<h1>Produce commits by day</h1>

Use rows in sorted_commits.csv to produce a count of commits for inclusive dates in the table. Each row represents a single commit.

In [1]:
import pandas as pd

In [33]:
sorted_commits = pd.read_csv('sorted_commits.csv')

In [34]:
len(sorted_commits)

23059

In [25]:
sorted_commits.head()

Unnamed: 0,date,doy,files,message,person,year
0,2013-02-13 17:50:44-08:00,44,"['.gitignore', 'bin/spack']",Initial version of spack with one package:...,Todd Gamblin <tgamblin@llnl.gov>,2013
1,2013-02-13 17:50:44-08:00,44,"['.gitignore', 'bin/spack']",Initial version of spack with one package:...,Todd Gamblin <tgamblin@llnl.gov>,2013
2,2013-02-13 17:50:44-08:00,44,"['.gitignore', 'bin/spack']",Initial version of spack with one package:...,Todd Gamblin <tgamblin@llnl.gov>,2013
3,2013-02-13 17:50:44-08:00,44,"['.gitignore', 'bin/spack']",Initial version of spack with one package:...,Todd Gamblin <tgamblin@llnl.gov>,2013
4,2013-02-13 17:50:44-08:00,44,"['.gitignore', 'bin/spack']",Initial version of spack with one package:...,Todd Gamblin <tgamblin@llnl.gov>,2013


In [5]:
import datetime
from dateutil import parser


## Here is starting date

In [46]:
starting_day = int(sorted_commits.loc[0,'doy'])  #day of year for Feb 13
starting_day

44

In [47]:
starting_year = sorted_commits.loc[0,'year']
starting_year

2013

## Here is ending date

In [38]:
str_date = sorted_commits.loc[len(sorted_commits)-1,'date']
parser.parse(str_date)  #-05 hours => -18000 seconds


datetime.datetime(2019, 1, 2, 1, 15, 15, tzinfo=tzoffset(None, 3600))

In [49]:
ending_year = sorted_commits.iloc[-1]['year']  #get ending year from last row
ending_year

2019

In [50]:
ending_day = sorted_commits.iloc[-1]['doy']  #get ending year from last row
ending_day

2

## wrangling code

Goal: go through everyday between (2013, 2, 13) and (2019, 1, 2). For each day count how many commits occured, 0 is possible. Produce a list of commits per day.

Actual method: loop through rows of table. Keep values needed to (a) count rows with same date, (b) count days skipped leading to a sequence of 0 entries, and (c) determine when switch years so can reset values.

Boyana suggests building table with row per day with 0 count then merge sorted_commits into that. Still seems would have to look through to consolidate counts.

In [45]:
current_day = starting_day    #computed above
current_year = starting_year  #computed above
commits_by_day = []           #where final sequence will be kept
day_commits = 0               #count the commits for a single day

for i in range(len(sorted_commits)):
    year = sorted_commits.loc[i,'year']
    day_of_year = int(sorted_commits.loc[i,'doy'])
    
    #check if change years, e.g., change from 2013 to 2014
    if year!=current_year:
        current_year = year
        diff = day_of_year + (365 - current_day)  #account for skipped days at end of old year
    else:
        diff = day_of_year - current_day
    
    #diff now holds number of days skipped
    
    #No diff so same day - increment commits for the day
    if diff==0:
        day_commits += 1
        continue
    
    #diff = 1 so tomorrow is here :)
    if diff == 1:
        commits_by_day.append(day_commits)  #save accumulation from previous day
        day_commits = 1
        current_day = day_of_year
        continue
    
    #what if days skipped - need to fill in with 0
    if diff > 1:
        commits_by_day.append(day_commits)  #add the ones already counted
        for i in range(diff-1):
            commits_by_day.append(0)  #fill in 0 for missing days 
        day_commits = 1
        current_day = day_of_year
        continue
    
    print((i, current_day, current_year, day_of_year, year, diff))
    raise Exception  #should never get here

commits_by_day.append(day_commits)  #get the last one

    



In [51]:
len(sorted_commits)

23059

In [52]:
sum(commits_by_day)

23059

In [54]:
commits_by_day.count(0)/len(sorted_commits)  #roughly 3% of days have no commits

0.03282883039160415

In [None]:
fee_fie_foo()  #break here from Run All to see if want to save

In [55]:
import json
with open('commit_counts.txt', 'w') as f:
    f.write(json.dumps(commits_by_day))

#Now read the file back into a Python list object
with open('commit_counts.txt', 'r') as f:
    a = json.loads(f.read())
    
len(a) == len(commits_by_day)

True

In [56]:
len(a)

2149