# What this notebook does

It assumes that a separate notebook has built a csv table where each row is a commit event. The table is sorted by date. The goal of this notebook is to load that table and build a day-by-day feature set represented by a list. Each item in the list is a feature. A feature has at least these first 3 values: day_name as int, month as int, day_of_month as int. It may have further values, e.g., number of commits that day, number of lines of code changed that day.

In essence, the code below inverts a table by commits to a list by days.

## Assumptions of this notebook

1. The code and notebooks we will be using, including this one, are in the following folder: /ideas-uo/machine_learning/predicting_project_activity

2. We expect to be able to execute this code: sorted_table = pd.read_csv('sorted_'+project+'_table.csv')  

3. Final result will be written to '/ideas-uo/machine_learning/predicting_project_activity' + 'features_by_day_'+project+'.txt'.

4. This notebook should be started in folder /ideas-uo/machine_learning/predicting_project_activity.

## Parameters for this notebook

In [1]:
project = 'spack'

## Read the table based on commits

Each row represents a separate commit.

In [2]:
import pandas as pd

In [3]:
sorted_table = pd.read_csv('sorted_'+project+'_table.csv')  #produced prior to this notebook

In [4]:
len(sorted_table)

10913

In [5]:
sorted_table.head()

Unnamed: 0,day_name,day_of_month,doy,files,message,month,name,utc_offset,year
0,Wednesday,13,44,"[b'.gitignore', b'bin/spack']",Initial version of spack with one package:...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
1,Monday,18,49,"[b'4 +1,4 @@']",Require python2.7\n,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
2,Monday,18,49,"[b',7 +19,7 @@ import spack']","Dependencies now work. Added libelf, libd...",2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
3,Tuesday,19,50,"[b',7 +73,8 @@ for var in [""LD_LIBRARY_PATH"", ...",Fixed passing of dependence prefixes to cc...,2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013
4,Tuesday,19,50,"[b'28 +4,29 @@ import os']","Fixes, remove parallel build for libdwarf ...",2,b'Todd Gamblin <tgamblin@llnl.gov>',57600,2013


In [6]:
import datetime
from dateutil import parser


## Here is starting date

In [7]:
starting_year = sorted_table.loc[0,'year']
starting_month = sorted_table.loc[0,'month']
starting_day = sorted_table.loc[0,'day_of_month']
starting_obj = datetime.date(starting_year, starting_month, starting_day)
starting_obj

datetime.date(2013, 2, 13)

## Here is ending date

In [8]:
ending_year = sorted_table.iloc[-1]['year']
ending_month = sorted_table.iloc[-1]['month']
ending_day = sorted_table.iloc[-1]['day_of_month']
ending_obj = datetime.date(ending_year, ending_month, ending_day)
ending_obj

datetime.date(2019, 7, 3)

We should end up with a list of this length, i.e., a list item for each day.

In [9]:
td = ending_obj - starting_obj
td.days

2331

## wrangling code

Goal: for everyday between starting and ending dates, create a feature for that day.

Actual method: loop through rows of commit table. Keep values needed to (a) count rows with same date, (b) count days skipped leading to a sequence of 0 entries, and (c) determine when switch years so can reset values.

In [10]:
current_day = int(sorted_table.loc[0,'doy'])  #day of year: 1-365 (or 366 on leap years)
current_year = sorted_table.loc[0,'year']
features_by_day = []           #where final sequence will be kept
dnint = {'Monday':1, 'Tuesday':2, 'Wednesday':3, 'Thursday':4, 'Friday':5, 'Saturday':6, 'Sunday':7}
day_tracker = {}  #use to track values accumulating for a single day
day_tracker_keys = ['commit', 'locc']

#Here is where extra feature values are defined. Have to be based on columns in the commit table
for key in day_tracker_keys:
    day_tracker[key] = 0

for i in range(len(sorted_table)):

    #pull out date pieces
    year = int(sorted_table.loc[i,'year'])
    day_of_year = int(sorted_table.loc[i,'doy'])
    
    #check if change years, e.g., change from 2013 to 2014
    if year!=current_year:
        current_year = year
        diff = day_of_year + (365 - current_day)  #account for skipped days at end of old year
    else:
        diff = day_of_year - current_day
    
    #diff now holds number of days incremented
    
    #No diff so same day - increment all tracked values for the day
    if diff==0:
        for key in day_tracker_keys:
            day_tracker[key] += 1
        continue
    
    #Now things get interesting. We need to move back in time to beginning edge of gap. If gap is size diff,
    #then move back diff days. That will give us the date before the gap begins.
    
    #First build date object - easier to do arithmetic on. This is date on ending edge of gap.
    month = int(sorted_table.loc[i,'month'])
    day_of_month = int(sorted_table.loc[i,'day_of_month'])
    end_gap_date = datetime.datetime(year, month, day_of_month)   #current row we are looking at
    
    begin_gap_date = end_gap_date - datetime.timedelta(days=diff) #looking back in time
    
    #record feature values for begin gap date
    prior_day_name = dnint[begin_gap_date.strftime('%A')]  #convert to int 1-7
    prior_month = begin_gap_date.month
    prior_day_of_month = begin_gap_date.day

    #build features list
    date_features = [prior_day_name, prior_month, prior_day_of_month]  #always include date data
    more_features = [tup[1]   for tup in sorted(day_tracker.items())]
    features_by_day.append(date_features+more_features)
    
    #Whew. Took care of recording data for the beginning data of gap.
    
    #diff = 1 so tomorrow is here :) Just reset things since no dates skipped
    if diff == 1:
        for key in day_tracker_keys:
            day_tracker[key] = 1
        current_day = day_of_year
        continue
    
    #we have a gap! need to fill in with 0 feature values for each day in gap
    if diff > 1:
        date_obj = begin_gap_date
        for key in day_tracker_keys:  #0 out the tracked items
            day_tracker[key] = 0
        for i in range(diff-1):
            date_obj += datetime.timedelta(days=1)  #handles month change overs
            day_name = dnint[date_obj.strftime('%A')]
            date_features = [day_name, date_obj.month, date_obj.day]
            more_features = [tup[1] for tup in sorted(day_tracker.items())]  #should all be 0
            features_by_day.append(date_features+more_features) 
        day_tracker['commits'] = 1  #record the new one we just saw for data at end of gap
        current_day = day_of_year  #now on new date
        continue
    
    print((i, day_of_year, year, diff))
    raise Exception  #should never get here

#check if have any values accumulated.
tracking_values = [tup[1] for tup in sorted(day_tracker.items())]
if any(tracking_values):
    date_features = [prior_day_name, prior_month, prior_day_of_month]
    more_features = tracking_values
    features_by_day.append(date_features+more_features)  

In [11]:
len(features_by_day)

2331

In [12]:
'''
[[3, 2, 13, 1],
 [4, 2, 14, 0],
 [5, 2, 15, 0],
 [6, 2, 16, 0],
 [7, 2, 17, 0],
 [1, 2, 18, 2],
 [2, 2, 19, 3],
 [3, 2, 20, 2],
 [4, 2, 21, 6],
 [5, 2, 22, 1]]
 '''
features_by_day[:10]  #3=wednesday

[[3, 2, 13, 1],
 [4, 2, 14, 0],
 [5, 2, 15, 0],
 [6, 2, 16, 0],
 [7, 2, 17, 0],
 [1, 2, 18, 2],
 [2, 2, 19, 3],
 [3, 2, 20, 2],
 [4, 2, 21, 6],
 [5, 2, 22, 1]]

In [13]:
just_commits = [rec[3] for rec in features_by_day]

In [14]:
max(just_commits)  #49

49

In [15]:
n = len(just_commits)
for i in range(max(just_commits)+1):
    print((i, just_commits.count(i)/n))

(0, 0.3341913341913342)
(1, 0.09523809523809523)
(2, 0.08065208065208065)
(3, 0.06992706992706993)
(4, 0.05405405405405406)
(5, 0.05148005148005148)
(6, 0.049335049335049334)
(7, 0.03517803517803518)
(8, 0.03088803088803089)
(9, 0.029601029601029602)
(10, 0.02702702702702703)
(11, 0.02145002145002145)
(12, 0.01673101673101673)
(13, 0.015873015873015872)
(14, 0.013728013728013728)
(15, 0.010725010725010725)
(16, 0.012012012012012012)
(17, 0.006864006864006864)
(18, 0.005577005577005577)
(19, 0.006435006435006435)
(20, 0.00429000429000429)
(21, 0.005148005148005148)
(22, 0.002574002574002574)
(23, 0.001716001716001716)
(24, 0.002145002145002145)
(25, 0.003432003432003432)
(26, 0.001716001716001716)
(27, 0.000429000429000429)
(28, 0.0)
(29, 0.002145002145002145)
(30, 0.001716001716001716)
(31, 0.001287001287001287)
(32, 0.000429000429000429)
(33, 0.0)
(34, 0.000429000429000429)
(35, 0.000858000858000858)
(36, 0.000429000429000429)
(37, 0.001287001287001287)
(38, 0.000858000858000858)
(39,

In [16]:
fee_fie_foo()  #break here from Run All to see if want to save

NameError: name 'fee_fie_foo' is not defined

In [17]:
import json
with open('features_by_day_'+project+'.txt', 'w') as f:
    f.write(json.dumps(features_by_day))

#Now read the file back into a Python list object
with open('features_by_day_'+project+'.txt', 'r') as f:
    a = json.loads(f.read())
    
len(a) == len(features_by_day)

True

In [None]:
len(a)