# What this notebook does

It assumes that a separate notebook has built a csv table where each row is a day in the life of a project. The table is sorted by date. The goal of this notebook is to load that table and build a day-by-day feature set represented by a list. Each item in the list is a feature. A feature has at least these first 3 values: weekday as int, month as int, day_of_month as int. What further features are included is given by notebook extenstions such as alt_1, alt_2, etc.

Once this notebook completes, and a build labels notebook completes, we are ready to train a model.

## _alt_1

Alternative 1 is to look at the following feature set. Reminder: these are values for a single day in the life of the project.

* weekday
* day_of_month
* month
* total_commits
* total_locc  #lines of code changed across all commits
* total_developers  #how many different developers were active on the day
* total_files_changed


## Assumptions of this notebook

1. The code and notebooks we will be using, including this one, are in the following folder: /ideas-uo/machine_learning/predicting_project_activity.

2. We expect to be able to execute this code: load_dir = repository_dir + '/machine_learning/predicting_project_activity/'; days_table = pd.read_csv(load_dir+project+'_days_table.csv').  

3. Final result will be written to load_dir + project+'features_by_day_'.txt'.

4. This notebook should be started in folder /ideas-uo/machine_learning/predicting_project_activity.

## Parameters for this notebook

In [1]:
project = 'spack'

## Read the table based on commits

Each row represents a separate commit.

In [2]:
import pandas as pd

In [6]:
days_table = pd.read_csv(project+'_days_table.csv')  #produced prior to this notebook

In [7]:
len(days_table)

4972

In [38]:
days_table.head(50)

Unnamed: 0,day_name,day_of_month,doy,locc,message,month,name,utc_offset,year
0,Wednesday,13,44,4,Initial version of spack with one package:...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
1,Monday,18,49,2,Require python2.7\n,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
2,Monday,18,49,2,"Dependencies now work. Added libelf, libd...",2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
3,Wednesday,20,51,1,Added libunwind and fixed link issues in c...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
4,Thursday,21,52,1,Minor changes; loosened up parallel build ...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
5,Friday,22,53,9,Better spack -h: added cmd descriptions.\n...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
6,Monday,25,56,8,Simpler implementation of depends_on.\n,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
7,Monday,25,84,12,Moved install-spack to its own simpler com...,3,b'Todd Gamblin <tgamblin@llnl.gov>',61200,2013
8,Monday,25,84,12,Adding install script\n,3,b'Todd Gamblin <tgamblin@llnl.gov>',61200,2013
9,Thursday,9,129,11,Removed old versions.py\n,5,b'Todd Gamblin <tgamblin@llnl.gov>',61200,2013


In [33]:
day_table.columns

Index([   'day_of_week',   'day_of_month',            'doy',  'total_commits',
          'total_loccs', 'total_messages',          'month',           'year',
          'total_names',                0],
      dtype='object')

In [None]:
for i in range(len(day_table)):
    #pull out values
    a_row = day_table.iloc[i].to_dict()
    dom = a_row[

In [11]:
len(features_by_day)

2331

In [12]:
'''
[[3, 2, 13, 1],
 [4, 2, 14, 0],
 [5, 2, 15, 0],
 [6, 2, 16, 0],
 [7, 2, 17, 0],
 [1, 2, 18, 2],
 [2, 2, 19, 3],
 [3, 2, 20, 2],
 [4, 2, 21, 6],
 [5, 2, 22, 1]]
 '''
features_by_day[:10]  #3=wednesday

[[3, 2, 13, 1],
 [4, 2, 14, 0],
 [5, 2, 15, 0],
 [6, 2, 16, 0],
 [7, 2, 17, 0],
 [1, 2, 18, 2],
 [2, 2, 19, 3],
 [3, 2, 20, 2],
 [4, 2, 21, 6],
 [5, 2, 22, 1]]

In [13]:
just_commits = [rec[3] for rec in features_by_day]

In [14]:
max(just_commits)  #49

49

In [15]:
n = len(just_commits)
for i in range(max(just_commits)+1):
    print((i, just_commits.count(i)/n))

(0, 0.3341913341913342)
(1, 0.09523809523809523)
(2, 0.08065208065208065)
(3, 0.06992706992706993)
(4, 0.05405405405405406)
(5, 0.05148005148005148)
(6, 0.049335049335049334)
(7, 0.03517803517803518)
(8, 0.03088803088803089)
(9, 0.029601029601029602)
(10, 0.02702702702702703)
(11, 0.02145002145002145)
(12, 0.01673101673101673)
(13, 0.015873015873015872)
(14, 0.013728013728013728)
(15, 0.010725010725010725)
(16, 0.012012012012012012)
(17, 0.006864006864006864)
(18, 0.005577005577005577)
(19, 0.006435006435006435)
(20, 0.00429000429000429)
(21, 0.005148005148005148)
(22, 0.002574002574002574)
(23, 0.001716001716001716)
(24, 0.002145002145002145)
(25, 0.003432003432003432)
(26, 0.001716001716001716)
(27, 0.000429000429000429)
(28, 0.0)
(29, 0.002145002145002145)
(30, 0.001716001716001716)
(31, 0.001287001287001287)
(32, 0.000429000429000429)
(33, 0.0)
(34, 0.000429000429000429)
(35, 0.000858000858000858)
(36, 0.000429000429000429)
(37, 0.001287001287001287)
(38, 0.000858000858000858)
(39,

In [16]:
fee_fie_foo()  #break here from Run All to see if want to save

NameError: name 'fee_fie_foo' is not defined

In [17]:
import json
with open('features_by_day_'+project+'.txt', 'w') as f:
    f.write(json.dumps(features_by_day))

#Now read the file back into a Python list object
with open('features_by_day_'+project+'.txt', 'r') as f:
    a = json.loads(f.read())
    
len(a) == len(features_by_day)

True

In [None]:
len(a)