# What this notebook does

It assumes that a separate notebook has built a day-by-day feature set represented by a list. Each item in the list is a feature for a specific day. A feature has at least these first 3 values: day_name as int, month as int, day_of_month as int. It may have further values, e.g., number of commits that day, number of lines of code changed that day.

The code below works with the feature set to build labels. Each label will be an average of one feature value (e.g., commits, locc) or a combination of feature values. Given we are looking at day d in the feature set, values will be averaged on the sequence [d+1:d+n] where n is a parameter ranging from 1 to some max.

If you want some other form of label, one not based on averaging, then write your own notebook for it.

## Assumptions of this notebook

1. The code and notebooks we will be using, including this one, are in the following folder: /ideas-uo/machine_learning/predicting_project_activity

2. We expect to be able to execute this code: open('features_by_day_'+project+'.txt', 'r')

3. Final result will be written to '/ideas-uo/machine_learning/predicting_project_activity' + 'commits_averaged_by_'+str(look_ahead)+'_'+project+'.txt'.

4. This notebook should be started in folder /ideas-uo/machine_learning/predicting_project_activity.

## Parameters for this notebook

In [1]:
project = 'spack'
look_ahead = 7  #average over the following 7 days

In [2]:
#This is index into a feature. It is the feature we will average over. A bit brittle. Should change
#format of the features to be a dict so can specify a key as opposed to fixed index.

feature_index = 3  #3 = commits

## read in data

In [3]:
import json
with open('features_by_day_'+project+'.txt', 'r') as f:
    features_by_day = json.loads(f.read())
    
len(features_by_day)

2331

## Build the labels

In [5]:
labels_by_day =  []  
for i in range(len(features_by_day)-look_ahead):  
  next_chunk = features_by_day[i+1:i+look_ahead]
  the_label = sum([f[feature_index] for f in next_chunk])/len(next_chunk)
  labels_by_day.append(the_label)

In [6]:
len(labels_by_day)

2324

In [7]:
labels_by_day[:10]

[0.8333333333333334,
 1.1666666666666667,
 2.1666666666666665,
 2.3333333333333335,
 2.3333333333333335,
 2.0,
 1.6666666666666667,
 1.3333333333333333,
 0.3333333333333333,
 0.16666666666666666]

In [8]:
with open('commits_averaged_by_'+str(look_ahead)+'_'+project+'.txt', 'w') as f:
    f.write(json.dumps(labels_by_day))