# What this notebook does

The overall goal is produce a table that can be used to create a feature-set ready for model training. To do that, we fist pull commit information about a github project that has the url listed in <code>project + '_url.txt'</code>, then format that info as a table where each row is a separate commit and columns hold info about the commit. Here is the target, where each row represents a single commit:

<img src="https://www.dropbox.com/s/c2wrow809najo9s/Screenshot%202019-07-15%2011.11.08.png?raw=1">

After this we will build a second table that compacts infomation into days. This will result in a table that looks like this:

<img src="https://www.dropbox.com/s/53ddnk7201mmecz/Screenshot%202019-07-15%2011.09.43.png?raw=1">

## Assumptions of this notebook

1. The code and notebooks we will be using, including this one, are in the following folder: <code>/ideas-uo/machine_learning/predicting_project_activity</code>.

2. We expect to be able to execute this code: <code>project_url = project + '_url.txt'</code> and then <code>project_info = open(os.path.join('.',project_url),'r').readlines()</code>. Two lines are expected in the file: the github url of the repo and the name of the repo. Other lines will be ignored.   

3. Two tables will be written to <code>'/ideas-uo/machine_learning/predicting_project_activity'</code>: (1) <code>project +  '_commits_table.csv'</code> and (2) <code>project +  '_days_table.csv'</code>.

4. This notebook should be started in folder <code>/ideas-uo/machine_learning/predicting_project_activity</code>.


### Aside

This notebook was originally broken into 2: one that produced commit table and one that produced day table. Because the 2 tables are strongly linked, they were combined into the single notebook.

## Parameters for this notebook

In [1]:
project = 'spack'  #or latte, etc.

# Part 1

Our goal is to load in and parse project data and produce a pandas dataframe that is commit-based, i.e., every row represents a separate commit.

## Do github pull

Assume that notebook started in context of ideas-uo repository.


In [2]:
import os
starting_dir = os.getcwd() #~/ideas-uo/machine_learning/predicting_project_activity
starting_dir

'/Users/fickas/Dropbox/boyana/my_work/ideas-uo/machine_learning/predicting_project_activity'

In [3]:
#~/ideas-uo/machine_learning/predicting_project_activity
%pwd

'/Users/fickas/Dropbox/boyana/my_work/ideas-uo/machine_learning/predicting_project_activity'

In [4]:
#move up to uo-ideas folder - must be better way
%cd ../..

/Users/fickas/Dropbox/boyana/my_work/ideas-uo


In [5]:
repository_dir = os.getcwd() #~/ideas-uo/
repository_dir

'/Users/fickas/Dropbox/boyana/my_work/ideas-uo'

In [6]:
!git fetch --all

Fetching origin


In [7]:
#Refresh local repository
!git pull

Already up-to-date.


In [8]:
code_dir = repository_dir + '/code'
code_dir

'/Users/fickas/Dropbox/boyana/my_work/ideas-uo/code'

In [9]:
import sys
sys.path.append(code_dir)

In [10]:
print( '\n'.join(sys.path))


/Users/fickas/anaconda2/envs/py36/lib/python36.zip
/Users/fickas/anaconda2/envs/py36/lib/python3.6
/Users/fickas/anaconda2/envs/py36/lib/python3.6/lib-dynload
/Users/fickas/anaconda2/envs/py36/lib/python3.6/site-packages
/Users/fickas/anaconda2/envs/py36/lib/python3.6/site-packages/IPython/extensions
/Users/fickas/.ipython
/Users/fickas/Dropbox/boyana/my_work/ideas-uo/code


## Our library

We wrote the hpcl library to act as a github puller. See /ideas-uo/code/hpcl/ for source.

In [11]:
import os, sys, subprocess
from hpcl import Command
from hpcl import GitCommand
#import mysql.connector    #works from command line but not here


In [12]:
#move back to code dir /ideas-uo/code

os.chdir(code_dir)
os.getcwd()

'/Users/fickas/Dropbox/boyana/my_work/ideas-uo/code'

## Note uses "project name"_url.txt to get projects to load

Typically only want to focus on a single project.

Pulling the project info can take some time.

In [13]:
project_url = project + '_url.txt'# Where to get the project from github

In [14]:
#Produces nested Python lists and dictionaries

def checkoutSubrepos(repos,tdir):
    #currdir = os.getcwd()
    for repopath in repos.keys():
        if repopath == tdir: continue
        retcode, out, err = Command.Command('git checkout %s%s' % repos[repopath]).run(dryrun=dry_run)
    return


if __name__ == "__main__":
    
    print('Going to load repos...')

    #Load the repo to clone from the following file
    project_info = open(os.path.join('.',project_url),'r').readlines()

    #Move to the tmp directory that will hold the repo
    currdir = os.getcwd()
    tmpdir = os.path.join(currdir,'tmp')  #sff: fails if does not already exist
    #os.chdir(tmpdir)

    commander = GitCommand.GitCommand(tmpdir)

    #Download repo
    url = project_info[0]
    name = project_info[1]
    if name != project: print(('warning: names are different', name, project))
    print('Cloning: ' + name)
    commander.cloneRepo(url)

    print('Finished cloning repo.')

    #Setup DB connection, this will probably change soon.
    #mydb = mysql.connector.connect(host="localhost", port="3307", user="pythondb", passwd="********", database="gitstats")
    #print(mydb)
    #mycursor = mydb.cursor()
    #sql = "INSERT INTO stats (reponame, stats) VALUES(%s, %s)"

    #Now run commands on the repo

    if not os.path.isdir(name): print(('missing repo', name))

    print('Checking: ' + name)
    prefix,versions = commander.getRepoVersions(name)
    print(prefix)
    print(versions)

    #Gets all the commit data in a dictionary format {authorname:{'total_commits':0, 'commits':{'date':'', 'message':'', 'diffs':{'filename':'', 'diff':''}}}}
    active_developers = commander.getRepoCommitData(name)

    print(('number of developers', len(active_developers)))

    """ stats = getStats('.',repo)
    ts = repo + ', ' + out.strip().split('-')[0] 
    buf = ts
    for i in range(0,len(category_names)): 
        buf += ', %d' % stats[category_names[i]]
        linecounts[i].append(stats[category_names[i]]*0.001)
    outfile.write(buf+'\n')
    #years.append(int(ts))
    print(stats)

    val = (repo, buf)"""
    #mycursor.execute(sql,val)
    #mydb.commit()






Going to load repos...
Cloning: spack
Finished cloning repo.
Checking: spack

[b'cc76c0f5f9f8021cfb7423a226bd431c00d791ce', b'30d3b32085ab31c91ac45f2f14c5de07774823f7', b'0a0291678e283bf154081df67b0a1f5c909d1d19', b'f5a49280c3b9063c6deb29307cd6356bf75cedd5', b'34d23c617c89861e0d4ee1aad6a0acf1892502df', b'c3b003e6984f15807fe8675d90eab19679566363', b'8540d5390e70388625ee006562b450efa924113b']
git checkout b'cc76c0f5f9f8021cfb7423a226bd431c00d791ce'
b''
git checkout b'30d3b32085ab31c91ac45f2f14c5de07774823f7'
b''
git checkout b'0a0291678e283bf154081df67b0a1f5c909d1d19'
b''
git checkout b'f5a49280c3b9063c6deb29307cd6356bf75cedd5'
b''
git checkout b'34d23c617c89861e0d4ee1aad6a0acf1892502df'
b''
git checkout b'c3b003e6984f15807fe8675d90eab19679566363'
b''
git checkout b'8540d5390e70388625ee006562b450efa924113b'
b''
('number of developers', 569)


## Let's see what we have

In [15]:
len(active_developers)

569

In [16]:
dev_tups = list(active_developers.items())  #get it in form easier to see

## Nested structure is complicated

Code below dives into the structure for illustration. It can be skipped on production run.

In [17]:
a_dev = dev_tups[0]  #record for one developer with commits buried in that record
a_dev

(b'Massimiliano Culpo <massimiliano.culpo@gmail.com>',
 {'commits': [{'date': b'Mon Jul 15 19:30:01 2019 +0200',
    'diffs': [{'diff': ['+import spack.error'],
      'filename': 'a/lib/spack/spack/cmd/uninstall.py b/lib/spack/spack/cmd/uninstall.py'}],
    'id': b'5acbe449e5840a7592e93d3ba35ff10e45ebc8a0',
    'message': '    spack uninstall can uninstall specs with multiple roots (#11977)\n    \n    Fixes #3690\n    Fixes #5637\n    \n    Uninstalling dependents of a spec was relying on a traversal of the\n    parents done by inspecting spec._dependents. This is in turn a\n    DependencyMap that maps a package name to a single DependencySpec object\n    (an edge in the DAG) and cannot thus model the case where a spec has\n    multiple configurations of the same parent package installed (for\n    example if different versions of the same Python library depend on\n    the same Python installation).\n    \n    This commit works around this issue by constructing the list of specs to\n   

In [18]:
a_dev[0]  #developer name

b'Massimiliano Culpo <massimiliano.culpo@gmail.com>'

In [19]:
a_dev[1].keys()  #dictionary with 2 keys and we only care about commits

dict_keys(['total_commits', 'commits'])

In [20]:
a_dev[1]['total_commits']

973

In [21]:
list_of_commits = a_dev[1]['commits']  #expect a separate row for each commit in this list
len(list_of_commits )

973

In [22]:
list_of_commits [0]  #dictionary of date, diffs, id, message - this corresponds to one row in the table

{'date': b'Mon Jul 15 19:30:01 2019 +0200',
 'diffs': [{'diff': ['+import spack.error'],
   'filename': 'a/lib/spack/spack/cmd/uninstall.py b/lib/spack/spack/cmd/uninstall.py'}],
 'id': b'5acbe449e5840a7592e93d3ba35ff10e45ebc8a0',
 'message': '    spack uninstall can uninstall specs with multiple roots (#11977)\n    \n    Fixes #3690\n    Fixes #5637\n    \n    Uninstalling dependents of a spec was relying on a traversal of the\n    parents done by inspecting spec._dependents. This is in turn a\n    DependencyMap that maps a package name to a single DependencySpec object\n    (an edge in the DAG) and cannot thus model the case where a spec has\n    multiple configurations of the same parent package installed (for\n    example if different versions of the same Python library depend on\n    the same Python installation).\n    \n    This commit works around this issue by constructing the list of specs to\n    be uninstalled in an alternative way, and adds tests to verify the\n    behavior

In [23]:
list_of_commits[0].keys()

dict_keys(['id', 'date', 'message', 'diffs'])

In [24]:
actual_diffs = list_of_commits [0]['diffs'] #a list of dictionaries with keys diff, filename
actual_diffs

[{'diff': ['+import spack.error'],
  'filename': 'a/lib/spack/spack/cmd/uninstall.py b/lib/spack/spack/cmd/uninstall.py'}]

In [25]:
list_of_diff_strings = actual_diffs[0]['diff']  #a list of strings - have to parse to pull out info
list_of_diff_strings

['+import spack.error']

### Aside

The comlicated nesting of information takes a bit to understand. Drawing out the structure is on our to-do list.

## Goal reminder

Goal is to reformat the data. Currently the key is user and the value is a nested structure that includes all the commits for that person. We want to invert this. We'd like a list of commits, where each commit has info including developer name.

Note Python is a mess when it comes to time zones. What is supposed to work, <code>%z</code>, does not. See discussion here: https://stackoverflow.com/questions/3305413/python-strptime-and-timezones.

In particular, these fail when include UTC: <code>'%a %b %m %X %y %z'</code> and <code>'%c %z'</code>.

We are resorting to a utility package.

In [26]:
import datetime
from dateutil import parser

In [27]:
x = parser.parse("Tue Aug 2 14:58:31 2016 +0200")  #+0200 hours => 7200 seconds
x  #just an example

datetime.datetime(2016, 8, 2, 14, 58, 31, tzinfo=tzoffset(None, 7200))

In [28]:
x.utcoffset().seconds

7200

## Helper function

Given a list of diff strings (see list_of_diff_strings above), do a count of lines of code changed for each string. At moment this is trivial given we assume each string represents a single change (+ or -). So in theory we could just take the length of change_list as locc. However, we want to verify that each string starts with a + or - and print a warning if not.
<p>
Eventually we want to get back to looking at replacing a line with another one. This will be counted as 2 changes, a + and a -, when it should only be counted as 1. On our to-do list is to further process change_list to catch these type of replace actions.

In [29]:
import re  #use in future

In [30]:
#This is a trivial method until we start thinking about replace actions
def count_changes(change_list):
    change_count = 0
    #Go through each change represented as a string, e.g., '+ share/spack/qa/.*'
    for change in change_list:
        #Big assumption: single + or - per change and it is the first char
        if change[0] in '+-':
            change_count += 1
        else:
            print(('missing +- at index 0', change))
    #TODO: worry about a replace being counted as 2 instead of 1
    return change_count
        

## Create a list of dicts

Each dict will end up as a row in a commit-based table.

<code>entry</code> below is the 2-tuple version of a single developer record. It has 2 items, the developer name and the complicated stuff.

In [31]:
all_commits = []

for entry in dev_tups:
    name = entry[0]                               #'Gilles Fourestey <gilles.fourestey@epfl.ch>'
    commit_list = entry[1]['commits']      #a list of dictionaries with date, diffs, message as keys
    seen = []                               #need this because getting duplicates
    for a_commit in commit_list:
        if a_commit in seen: continue
        if not a_commit['diffs']: continue  #skip commit with empty diff
        seen.append(a_commit)
        new_dict = {}
        date = parser.parse(a_commit['date'])         #"Tue Aug 2 14:58:31 2016 +0200"
        new_dict['year'] = date.year
        new_dict['doy'] = min(365, int(date.strftime('%j'))) #punting on leap year
        new_dict['month'] = date.month
        new_dict['day_of_month'] = date.day
        new_dict['day_name'] = date.strftime('%A')  #e.g., Tuesday
        new_dict['utc_offset'] = date.utcoffset().seconds  #e.g., 7200

        new_dict['name'] = name
        new_dict['message'] = a_commit['message']
        
        diffs = a_commit['diffs']  #list of dictionaries diff, filename
        
        filenames = []
        changes_total = 0
        for real_diff in diffs:   #a dictionary diff filename
            filenames.append(real_diff['filename'])
            the_diffs = real_diff['diff']  #list of strings
            if the_diffs:
                change_count = count_changes(the_diffs)
                changes_total += change_count
            else:
                 print(('warning: empty diff list for', name, date.month, date.day, a_commit['message'], a_commit['id']))
        new_dict['filenames'] = filenames
        new_dict['locc'] = changes_total
        all_commits.append(new_dict)
        
    



























In [32]:
len(all_commits)

10646

In [33]:
all_commits[:5]

[{'day_name': 'Monday',
  'day_of_month': 15,
  'doy': 196,
  'filenames': ['a/lib/spack/spack/cmd/uninstall.py b/lib/spack/spack/cmd/uninstall.py'],
  'locc': 1,
  'message': '    spack uninstall can uninstall specs with multiple roots (#11977)\n    \n    Fixes #3690\n    Fixes #5637\n    \n    Uninstalling dependents of a spec was relying on a traversal of the\n    parents done by inspecting spec._dependents. This is in turn a\n    DependencyMap that maps a package name to a single DependencySpec object\n    (an edge in the DAG) and cannot thus model the case where a spec has\n    multiple configurations of the same parent package installed (for\n    example if different versions of the same Python library depend on\n    the same Python installation).\n    \n    This commit works around this issue by constructing the list of specs to\n    be uninstalled in an alternative way, and adds tests to verify the\n    behavior. The core issue with DependencyMap is not resolved here.\n',
  'mo

Looks ok to me. Now converting to pandas dataframe is easy peasy.

In [34]:
import pandas as pd

In [48]:
raw_table = pd.DataFrame(all_commits)  #gotta love pandas

In [49]:
commits_table = raw_table[['month', 'day_of_month', 'day_name', 'year',
                           'locc', 'message', 'filenames', 'name', 'doy', 'utc_offset']]

In [50]:
commits_table.head()

Unnamed: 0,month,day_of_month,day_name,year,locc,message,filenames,name,doy,utc_offset
0,7,15,Monday,2019,1,spack uninstall can uninstall specs with m...,[a/lib/spack/spack/cmd/uninstall.py b/lib/spac...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,196,7200
1,6,13,Thursday,2019,3,"Make ""spack compiler find"" check PATH by d...",[a/lib/spack/spack/cmd/compiler.py b/lib/spack...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,164,7200
2,6,7,Friday,2019,2,Compiler search uses a pool of workers (#1...,[a/lib/spack/llnl/util/filesystem.py b/lib/spa...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,158,7200
3,5,28,Tuesday,2019,10,Cap the maximum number of build jobs (#113...,[a/etc/spack/defaults/config.yaml b/etc/spack/...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,148,7200
4,5,24,Friday,2019,6,build env: simplify handling of parallel j...,[a/lib/spack/spack/build_environment.py b/lib/...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,144,7200


In [51]:
#df_sorted = df_commits.sort_values(by='date')  #fails because of UTC
commits_table = commits_table.sort_values(['year', 'doy'])  #primary and secondary
commits_table = commits_table.reset_index(drop=True)

In [52]:
commits_table.head(50)

Unnamed: 0,month,day_of_month,day_name,year,locc,message,filenames,name,doy,utc_offset
0,2,13,Wednesday,2013,5,Initial version of spack with one package:...,"[a/.gitignore b/.gitignore, a/bin/spack b/bin/...",b'Todd Gamblin <tgamblin@llnl.gov>',44,57600
1,2,18,Monday,2013,2,Require python2.7\n,[a/bin/spack b/bin/spack],b'Todd Gamblin <tgamblin@llnl.gov>',49,57600
2,2,18,Monday,2013,2,"Dependencies now work. Added libelf, libd...",[a/bin/spack b/bin/spack],b'Todd Gamblin <tgamblin@llnl.gov>',49,57600
3,2,19,Tuesday,2013,3,Fixed passing of dependence prefixes to cc...,[a/lib/spack/env/cc b/lib/spack/env/cc],b'Todd Gamblin <tgamblin@llnl.gov>',50,57600
4,2,19,Tuesday,2013,6,"Fixes, remove parallel build for libdwarf ...",[a/lib/spack/env/cc b/lib/spack/env/cc],b'Todd Gamblin <tgamblin@llnl.gov>',50,57600
5,2,19,Tuesday,2013,1,"rpaths for dependencies. elf, dwarf, cmak...",[a/lib/spack/env/cc b/lib/spack/env/cc],b'Todd Gamblin <tgamblin@llnl.gov>',50,57600
6,2,20,Wednesday,2013,2,Fixed bug in parallel make option.\n,[a/lib/spack/spack/Package.py b/lib/spack/spac...,b'Todd Gamblin <tgamblin@llnl.gov>',51,57600
7,2,20,Wednesday,2013,1,Added libunwind and fixed link issues in c...,[a/lib/spack/env/cc b/lib/spack/env/cc],b'Todd Gamblin <tgamblin@llnl.gov>',51,57600
8,2,21,Thursday,2013,4,Better handling of stage.\n - better sy...,[a/lib/spack/spack/Package.py b/lib/spack/spac...,b'Todd Gamblin <tgamblin@llnl.gov>',52,57600
9,2,21,Thursday,2013,2,Parallel bootstrap for cmake.\n,[a/lib/spack/spack/Package.py b/lib/spack/spac...,b'Todd Gamblin <tgamblin@llnl.gov>',52,57600


In [39]:
for x in dev_tups:
    if x[0] == b'Todd Gamblin <tgamblin@llnl.gov>':
        found = x
        break

In [40]:
z = found[1]['commits']
z

[{'date': b'Mon Jul 15 07:32:51 2019 -0700',
  'diffs': [{'diff': ['-    homepage = "https://computation.llnl.gov/projects/co-design/amg2013"',
     '+    homepage = "https://computing.llnl.gov/projects/co-design/amg2013"'],
    'filename': 'a/var/spack/repos/builtin/packages/amg/package.py b/var/spack/repos/builtin/packages/amg/package.py'}],
  'id': b'a3caf52cac9f19638220b837e4870210d698da0d',
  'message': '    packages: computation.llnl.gov is now computing.llnl.gov (#12013)\n'},
 {'date': b'Fri Jul 12 08:33:23 2019 -0700',
  'diffs': [{'diff': ['-    # This logic is derived from the cea-hpc/modules profile.sh example at',
     '-    # https://github.com/cea-hpc/modules/blob/master/init/profile.sh.in',
     '-    #',
     '-    # The objective is to correctly detect the shell type even when setup-env',
     '-    # is sourced within a script itself rather than a login terminal.',
     '-        echo ${BASH##*/}',
     '+        echo bash',
     '-        echo $ZSH_NAME',
     '+    

In [41]:
oct11 = []
for i,d in enumerate(z):
    if b'Oct 11' in d['date'] and 'pydoc' in d['message']:
        oct11.append(d['diffs'])
oct11

[[{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}],
 [{'diff': [],
   'filename': 'a/lib/spack/spack/cmd/doc.py b/lib/spack/spack/cmd/doc.py'}]]

In [42]:
z[435]

{'date': b'Tue Oct 11 23:13:40 2016 -0700',
 'diffs': [{'diff': ["-        spack.curl.add_default_arg('-k')",
    '+        spack.insecure = True'],
   'filename': 'a/bin/spack b/bin/spack'}],
 'id': b'488e1bab2ca384863517375f47be33dca4f170f8',
 'message': '    Make `insecure` option work with curl AND git. (#1786)\n'}

In [53]:
commits_table.tail()

Unnamed: 0,month,day_of_month,day_name,year,locc,message,filenames,name,doy,utc_offset
10641,7,14,Sunday,2019,5,bzip2: Add 1.0.8 (#12017)\n \n Updat...,[a/var/spack/repos/builtin/packages/bzip2/pack...,b'Michael Kuhn <michael.kuhn@informatik.uni-ha...,195,7200
10642,7,15,Monday,2019,1,spack uninstall can uninstall specs with m...,[a/lib/spack/spack/cmd/uninstall.py b/lib/spac...,b'Massimiliano Culpo <massimiliano.culpo@gmail...,196,7200
10643,7,15,Monday,2019,2,packages: computation.llnl.gov is now comp...,[a/var/spack/repos/builtin/packages/amg/packag...,b'Todd Gamblin <tgamblin@llnl.gov>',196,61200
10644,7,16,Tuesday,2019,0,binutils: added '-Wno-narrowing' to CXXFLA...,[a/var/spack/repos/builtin/packages/binutils/p...,b'Hironori-Yamaji <52182908+Hironori-Yamaji@us...,197,32400
10645,7,16,Tuesday,2019,3,py-basemap: install without egg (#11961)\n...,[a/var/spack/repos/builtin/packages/py-basemap...,b'Milton Woods <miltonjwoods@gmail.com>',197,36000


In [54]:
len(commits_table)

10646

In [45]:
foo_fie_fum()  #break here from Run All and decide if want to write out

NameError: name 'foo_fie_fum' is not defined

Write it out to file

In [55]:
save_dir = repository_dir + '/machine_learning/predicting_project_activity/'

In [56]:
commits_table.to_csv(save_dir + project +  '_commits_table.csv', index=False)

# Part 2

Now produce a day-based table using the df_sorted table as a starting point.

In [57]:
import datetime
from dateutil import parser

## Here is starting date

starting_year = commits_table.loc[0,'year']
starting_month = commits_table.loc[0,'month']
starting_day = commits_table.loc[0,'day_of_month']
starting_obj = datetime.date(starting_year, starting_month, starting_day)
starting_obj

datetime.date(2013, 2, 13)

In [58]:
## Here is ending date

ending_year = commits_table.iloc[-1]['year']
ending_month = commits_table.iloc[-1]['month']
ending_day = commits_table.iloc[-1]['day_of_month']
ending_obj = datetime.date(ending_year, ending_month, ending_day)
ending_obj

datetime.date(2019, 7, 16)

In [59]:
#We should end up with a list of this length, i.e., a list item for each day.

td = ending_obj - starting_obj
td.days

2344

## Approach

Goal: for everyday between starting and ending dates, create a row for that day.

Actual method: loop through rows of commit table. Keep values needed to (a) count rows with same date, (b) count days skipped leading to a sequence of 0 entries, and (c) determine when switch years so can reset values.

Note that things like messages and filenames will be combined into a list that goes with a day, i.e., can have mulitple filenames and multiple messages associated with the activity in one day.

In [68]:
#Fill out this new table. These columns will be the fodder for features in a feature set.

days_table = pd.DataFrame(columns= ['month', 'day_of_month', 'day_of_week', 'year', 'total_commits',
                           'total_loccs', 'total_messages', 'total_filenames', 'total_names',
                                    'doy'])


In [69]:
current_day = int(commits_table.loc[0,'doy'])  #day of year: 1-365 (or 366 on leap years)
current_year = commits_table.loc[0,'year']
dnint = {'Monday':1, 'Tuesday':2, 'Wednesday':3, 'Thursday':4, 'Friday':5, 'Saturday':6, 'Sunday':7}

names = []
messages = []
filenames = []
total_commits = 0
total_loccs = 0


for i in range(len(commits_table)):

    #pull out date pieces
    year = int(commits_table.loc[i,'year'])
    day_of_year = int(commits_table.loc[i,'doy'])
    
    #check if change years, e.g., change from 2013 to 2014
    if year!=current_year:
        current_year = year
        diff = day_of_year + (365 - current_day)  #account for skipped days at end of old year
    else:
        diff = day_of_year - current_day
    
    #diff now holds number of days incremented
    
    #No diff so same day - increment all tracked values for the day
    if diff==0:
        total_loccs += commits_table.loc[i,'locc']
        total_commits += 1
        messages.append(commits_table.loc[i,'message'])
        names.append(commits_table.loc[i,'name'])
        filenames.append(commits_table.loc[i,'filenames'])
        continue
    
    #Now things get interesting. We need to move back in time to beginning edge of gap. If gap is size diff,
    #then move back diff days. That will give us the date before the gap begins.
    
    #First build date object - easier to do arithmetic on. This is date on ending edge of gap.
    month = int(commits_table.loc[i,'month'])
    day_of_month = int(commits_table.loc[i,'day_of_month'])
    end_gap_date = datetime.datetime(year, month, day_of_month)   #current row we are looking at
    
    begin_gap_date = end_gap_date - datetime.timedelta(days=diff) #looking back in time
    
    #record row values for begin gap date
    prior_day_name = dnint[begin_gap_date.strftime('%A')]  #convert to int 1-7
    prior_month = begin_gap_date.month
    prior_day_of_month = begin_gap_date.day
    prior_year = begin_gap_date.year
    prior_doy = min(365, int(begin_gap_date.strftime('%j')))

    #build row and append
    new_row = {'day_of_week': prior_day_name, 'month': prior_month, 'day_of_month': prior_day_of_month,
               'doy': prior_doy, 'year': prior_year,
               'total_messages': messages, 'total_names': names, 'total_filenames': filenames,
               'total_loccs': total_loccs, 'total_commits': total_commits}
    days_table = days_table.append(new_row, ignore_index=True)
    
    #Whew. Took care of recording data for the beginning data of gap.
    
    #diff = 1 so tomorrow is here :) Just reset things since no dates skipped
    if diff == 1:
        total_loccs = commits_table.loc[i,'locc']
        total_commits = 1
        messages = [commits_table.loc[i,'message']]
        names = [commits_table.loc[i,'name']]
        filenames = [commits_table.loc[i,'filenames']]
        current_day = day_of_year
        continue
    
    #we have a gap! need to fill in with 0 feature values for each day in gap
    if diff > 1:
        date_obj = begin_gap_date
        total_loccs = 0
        total_commits = 0
        messages = []
        names = []
        filenames = []
        for j in range(diff-1):
            date_obj += datetime.timedelta(days=1)  #handles month change overs
            day_name = dnint[date_obj.strftime('%A')]
            new_row = {'day_of_week': day_name, 'month': date_obj.month, 'day_of_month': date_obj.day,
               'doy': min(365, int(date_obj.strftime('%j'))), #punting on leap year
                'year': date_obj.year, 'total_names': names, 'total_filenames': filenames,
               'total_messages': messages, 'total_loccs': total_loccs, 'total_commits': total_commits}
            days_table = days_table.append(new_row, ignore_index=True)
        #record the new commit we just saw for data at end of gap
        total_loccs = commits_table.loc[i,'locc']
        total_commits = 1
        messages = [commits_table.loc[i,'message']]
        names = [commits_table.loc[i,'name']]
        filenames = [commits_table.loc[i,'filenames']]
        current_day = day_of_year  #now on new date
        continue
    
    print((i, day_of_year, year, diff))
    raise Exception  #should never get here

#check if have any values accumulated.
if total_commits>0:
    final_commit_row = commits_table.iloc[-1]
    end_date = datetime.datetime(final_commit_row['year'], final_commit_row['month'], final_commit_row['day_of_month'])
    new_row = {'day_of_week': dnint[end_date.strftime('%A')], 'month': end_date.month, 'day_of_month': end_date.day,
               'doy': final_commit_row['doy'], 'year': end_date.year,
               'total_messages': messages, 'total_names': names, 'total_filenames': filenames,
               'total_loccs': total_loccs, 'total_commits': total_commits}
    days_table = days_table.append(new_row, ignore_index=True) 

In [70]:
days_table.head(50)

Unnamed: 0,month,day_of_month,day_of_week,year,total_commits,total_loccs,total_messages,total_filenames,total_names,doy
0,2,13,3,2013,1,5,[ Initial version of spack with one package...,"[[a/.gitignore b/.gitignore, a/bin/spack b/bin...",[b'Todd Gamblin <tgamblin@llnl.gov>'],44
1,2,14,4,2013,0,0,[],[],[],45
2,2,15,5,2013,0,0,[],[],[],46
3,2,16,6,2013,0,0,[],[],[],47
4,2,17,7,2013,0,0,[],[],[],48
5,2,18,1,2013,2,4,"[ Require python2.7\n, Dependencies now...","[[a/bin/spack b/bin/spack], [a/bin/spack b/bin...","[b'Todd Gamblin <tgamblin@llnl.gov>', b'Todd G...",49
6,2,19,2,2013,3,10,[ Fixed passing of dependence prefixes to c...,"[[a/lib/spack/env/cc b/lib/spack/env/cc], [a/l...","[b'Todd Gamblin <tgamblin@llnl.gov>', b'Todd G...",50
7,2,20,3,2013,2,3,"[ Fixed bug in parallel make option.\n, ...",[[a/lib/spack/spack/Package.py b/lib/spack/spa...,"[b'Todd Gamblin <tgamblin@llnl.gov>', b'Todd G...",51
8,2,21,4,2013,6,16,[ Better handling of stage.\n - better s...,[[a/lib/spack/spack/Package.py b/lib/spack/spa...,"[b'Todd Gamblin <tgamblin@llnl.gov>', b'Todd G...",52
9,2,22,5,2013,1,9,[ Better spack -h: added cmd descriptions.\...,[[a/bin/spack b/bin/spack]],[b'Todd Gamblin <tgamblin@llnl.gov>'],53


In [66]:
len(days_table)

2344

In [None]:
fee_fie_foe()  #stop here before writing. Will stop Run All here.

In [67]:
days_table.to_csv(save_dir + project +  '_days_table.csv', index=False)