# Git Data Extraction using Python

This notebook helps you load historical git data from any public git repository on GitHub

## Pre-requisites
The code assumes you have the following libraries installed:

- pydriller
- pandas

## Part 1: Pulling Commit Data from GitHub

We will use the PyDriller library to pull commit data from GitHub for a public repository and build a list of commits.

This process can take a very long time depending on the size of the repository.

In [2]:
# We need PyDriller to pull git repository information
from pydriller import Repository


# Using the ML.NET public repository on GitHub
account = 'dotnet'
repository = 'machinelearning'
path = 'https://github.com/' + account + '/' + repository
repo = Repository(path)

print('Using repository ' + path)

Using repository https://github.com/dotnet/machinelearning


In [2]:


# Loop over each PyDriller commit to transform it to a commit usable for analysis later
# NOTE: This can take a LONG time if there are many commits

commits = []
for commit in repo.traverse_commits():

    hash = commit.hash
    try:

        # Gather a list of files modified in the commit
        files = []
        for f in commit.modified_files:
            if f.new_path is not None:
                files.append(f.new_path) 

        # Capture information about the commit in object format so I can reference it later
        record = {
            'hash': hash,
            'message': commit.msg,
            'author_name': commit.author.name,
            'author_email': commit.author.email,
            'author_date': commit.author_date,
            'author_tz': commit.author_timezone,
            'committer_name': commit.committer.name,
            'committer_email': commit.committer.email,
            'committer_date': commit.committer_date,
            'committer_tz': commit.committer_timezone,
            'in_main': commit.in_main_branch,
            'is_merge': commit.merge,
            'num_deletes': commit.deletions,
            'num_inserts': commit.insertions,
            'net_lines': commit.insertions - commit.deletions,
            'num_files': commit.files,
            'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
            'files': ', '.join(files), # Comma separated list of files the commit modifies
            'parents': ', '.join(commit.parents), # Comma separated list of parents
            # PyDriller Open Source Delta Maintainability Model (OS-DMM) stat. See https://pydriller.readthedocs.io/en/latest/deltamaintainability.html for metric definitions
            'dmm_unit_size': commit.dmm_unit_size,
            'dmm_unit_complexity': commit.dmm_unit_complexity,
            'dmm_unit_interfacing': commit.dmm_unit_interfacing,
        }
        # Omitted: modified_files (list), project_path, project_name
        commits.append(record)

    except Exception:
        print('Problem reading commit ' + hash)
        continue

Problem reading commit eae76959e6714af44caa212e102a5f06f0110e72
Problem reading commit df1c2af3369a5c87ea03df17d75d7e6fa730543f
Problem reading commit c3a20faa31c22eb85806d325dfe9f12d308c772e
Problem reading commit cb37c7e7f1e1b29b5608a2755db793c5435d10b1
Problem reading commit 2a927865769e10772a31af407f2a856fd6e4e523
Problem reading commit b2ac8e036e0fd932f18ca7f148367fc3d8a2c2a8
Problem reading commit a6024769c9da2ccac20531c36d8e137f3de64f6c
Problem reading commit 3b576fe058ed4f4331018bbc3eabc1ac26219644
Problem reading commit cc400493dc7934ca25c2720b46eeeef554a28749
Problem reading commit 8e5f7b42cd65660393b3ac59765ae166ee7ea4ad


## Part 2: Build a Pandas DataFrame
Now that we have a raw list of commits available, let's translate that to a Pandas DataFrame so we can validate the data appears roughly correct before exporting

In [3]:

import pandas as pd

# Translate this list of commits to a Pandas data frame, then export it to CSV for analysis
df_commits = pd.DataFrame(commits)
df_commits.head()

Unnamed: 0,hash,message,author_name,author_email,author_date,author_tz,committer_name,committer_email,committer_date,committer_tz,...,num_deletes,num_inserts,net_lines,num_files,branches,files,parents,dmm_unit_size,dmm_unit_complexity,dmm_unit_interfacing
0,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,0,382168,382168,868,main,".gitattributes, .gitignore, BuildToolsVersion....",,0.399491,0.611602,0.630582
1,76cb2cdf5cc8b6c88ca44b8969153836e589df04,Get a working build (#1)\n\n* Set missing exec...,Sandy Armstrong,sanfordarmstrong@gmail.com,2018-05-04 12:47:21-07:00,25200,Eric Erhardt,eric.erhardt@microsoft.com,2018-05-04 14:47:21-05:00,18000,...,27,1749,1722,23,main,"Microsoft.ML.sln, build.sh, init-tools.sh, run...",f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,,,
2,972f6232de173b5e294a34a847682e9b1e67d3af,Fixed the syntax of cited example. (#2),Zeeshan Ahmed,38438266+zeahmed@users.noreply.github.com,2018-05-04 14:06:13-07:00,25200,Eric Erhardt,eric.erhardt@microsoft.com,2018-05-04 16:06:13-05:00,18000,...,5,4,-1,1,main,README.md,76cb2cdf5cc8b6c88ca44b8969153836e589df04,,,
3,cde0d7d18ec9e93bde1d3a53c35f87430ac43fee,Add ML.NET Roadmap (#30)\n\n##Add Roadmap.md f...,Gleb K,glebk@microsoft.com,2018-05-05 01:11:31-07:00,25200,GitHub,noreply@github.com,2018-05-05 01:11:31-07:00,25200,...,1,128,127,3,main,"Microsoft.ML.sln, README.md, ROADMAP.md",972f6232de173b5e294a34a847682e9b1e67d3af,,,
4,979418886950e144b2cc561bdc5eb41d382cf829,Update contribution guide and issue/PR templates,Shauheen Zahirazami,shzahira@microsoft.com,2018-05-05 13:47:44-07:00,25200,Shauheen Zahirazami,shzahira@microsoft.com,2018-05-05 13:47:44-07:00,25200,...,0,53,53,3,main,"CONTRIBUTING.md, ISSUE_TEMPLATE.md, PULL_REQUE...",cde0d7d18ec9e93bde1d3a53c35f87430ac43fee,,,


In [7]:
# Look at the trends in the OS-DMM metrics
df_commits[['dmm_unit_complexity', 'dmm_unit_interfacing', 'dmm_unit_size']].describe()

Unnamed: 0,dmm_unit_complexity,dmm_unit_interfacing,dmm_unit_size
count,1485.0,1491.0,1484.0
mean,0.678715,0.658743,0.400372
std,0.412242,0.403688,0.404226
min,0.0,0.0,0.0
25%,0.252525,0.283067,0.0
50%,0.975,0.892857,0.263397
75%,1.0,1.0,0.861607
max,1.0,1.0,1.0


## Part 3: Export to a CSV File
Because the commit extraction process takes a very long time, let's export the resulting data to a CSV file so we don't need to repeat it every time we need to analyze the data

In [None]:
df_commits.to_csv('Commits.csv')

## Part 4: Breakdown by file

In [3]:
commits = []

for commit in repo.traverse_commits():
    hash = commit.hash
    try:
        for f in commit.modified_files:
            record = {
                'hash': hash,
                'message': commit.msg,
                'author_name': commit.author.name,
                'author_email': commit.author.email,
                'author_date': commit.author_date,
                'author_tz': commit.author_timezone,
                'committer_name': commit.committer.name,
                'committer_email': commit.committer.email,
                'committer_date': commit.committer_date,
                'committer_tz': commit.committer_timezone,
                'in_main': commit.in_main_branch,
                'is_merge': commit.merge,
                'num_deletes': commit.deletions,
                'num_inserts': commit.insertions,
                'net_lines': commit.insertions - commit.deletions,
                'num_files': commit.files,
                'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
                'filename': f.filename,
                'old_path': f.old_path,
                'new_path': f.new_path,
                'project_name': commit.project_name,
                'project_path': commit.project_path, 
                'parents': ', '.join(commit.parents), # Comma separated list of parents
            }
            # Omitted: modified_files (list), project_path, project_name
            commits.append(record)
    except Exception:
        print('Problem reading commit ' + hash)
        continue        

Problem reading commit eae76959e6714af44caa212e102a5f06f0110e72
Problem reading commit df1c2af3369a5c87ea03df17d75d7e6fa730543f
Problem reading commit c3a20faa31c22eb85806d325dfe9f12d308c772e
Problem reading commit cb37c7e7f1e1b29b5608a2755db793c5435d10b1
Problem reading commit 2a927865769e10772a31af407f2a856fd6e4e523
Problem reading commit b2ac8e036e0fd932f18ca7f148367fc3d8a2c2a8
Problem reading commit a6024769c9da2ccac20531c36d8e137f3de64f6c
Problem reading commit 3b576fe058ed4f4331018bbc3eabc1ac26219644
Problem reading commit cc400493dc7934ca25c2720b46eeeef554a28749
Problem reading commit 8e5f7b42cd65660393b3ac59765ae166ee7ea4ad


In [4]:
import pandas as pd

# Translate this list of commits to a Pandas data frame, then export it to CSV for analysis
df_file_commits = pd.DataFrame(commits)

df_file_commits.to_csv('FileCommits.csv')

df_file_commits.head()

Unnamed: 0,hash,message,author_name,author_email,author_date,author_tz,committer_name,committer_email,committer_date,committer_tz,...,num_inserts,net_lines,num_files,branches,filename,old_path,new_path,project_name,project_path,parents
0,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,382168,382168,868,main,.gitattributes,,.gitattributes,machinelearning,C:\Users\Admin\AppData\Local\Temp\tmpvezg8ml9\...,
1,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,382168,382168,868,main,.gitignore,,.gitignore,machinelearning,C:\Users\Admin\AppData\Local\Temp\tmpvezg8ml9\...,
2,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,382168,382168,868,main,BuildToolsVersion.txt,,BuildToolsVersion.txt,machinelearning,C:\Users\Admin\AppData\Local\Temp\tmpvezg8ml9\...,
3,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,382168,382168,868,main,CONTRIBUTING.md,,CONTRIBUTING.md,machinelearning,C:\Users\Admin\AppData\Local\Temp\tmpvezg8ml9\...,
4,f0e639af5ffdc839aae8e65d19b5a9a1f0db634a,Initial commit,dotnet-bot,dotnet-bot@microsoft.com,2018-05-03 17:22:00-07:00,25200,Immo Landwerth,immol@microsoft.com,2018-05-03 17:22:00-07:00,25200,...,382168,382168,868,main,Directory.Build.props,,Directory.Build.props,machinelearning,C:\Users\Admin\AppData\Local\Temp\tmpvezg8ml9\...,
