# Git Data Extraction using Python

This notebook helps you load historical git data from any public git repository on GitHub

## Pre-requisites
The code assumes you have the following libraries installed:

- pydriller
- pandas

## Part 1: Pulling Commit Data from GitHub

We will use the PyDriller library to pull commit data from GitHub for a public repository and build a list of commits.

This process can take a very long time depending on the size of the repository.

In [1]:
# We need PyDriller to pull git repository information
from pydriller import Repository


# Using the Polyglot Notebook (Formerly .NET Interactive) public repository on GitHub
path = 'https://github.com/dotnet/interactive'
repo = Repository(path)

print('Using repository ' + path)

Using repository https://github.com/dotnet/interactive


In [2]:
def sanitize_message(msg):
    msg = msg.replace('\n', ' ')
    msg = msg.replace(',', '')
    msg = msg.replace('"', '')
    return msg

In [3]:


# Loop over each PyDriller commit to transform it to a commit usable for analysis later
# NOTE: This can take a LONG time if there are many commits

commits = []
for commit in repo.traverse_commits():

    hash = commit.hash
    try:

        # Gather a list of files modified in the commit
        files = []
        for f in commit.modified_files:
            if f.new_path is not None:
                files.append(f.new_path) 

        # Sanitize the message to prevent it from confusing our resulting CSV
        msg = sanitize_message(commit.msg)

        # Optimization to prevent requesting same data twice
        author = commit.author
        committer = commit.committer
        inserts = commit.insertions
        deletions = commit.deletions

        # Capture information about the commit in object format so I can reference it later
        record = {
            'hash': hash,
            'message': msg,
            'author_name': author.name,
            'author_email': author.email,
            'author_date': commit.author_date,
            'author_tz': commit.author_timezone,
            'committer_name': committer.name,
            'committer_email': committer.email,
            'committer_date': commit.committer_date,
            'committer_tz': commit.committer_timezone,
            'in_main': commit.in_main_branch,
            'is_merge': commit.merge,
            'num_deletes': deletions,
            'num_inserts': inserts,
            'net_lines': inserts - deletions,
            'num_files': commit.files,
            'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
            'files': ', '.join(files), # Comma separated list of files the commit modifies
            # PyDriller Open Source Delta Maintainability Model (OS-DMM) stat. See https://pydriller.readthedocs.io/en/latest/deltamaintainability.html for metric definitions
            'dmm_unit_size': commit.dmm_unit_size,
            'dmm_unit_complexity': commit.dmm_unit_complexity,
            'dmm_unit_interfacing': commit.dmm_unit_interfacing,
        }
        # Omitted: modified_files (list), project_path, project_name
        commits.append(record)

    except Exception as er:
        print('Problem reading commit ' + hash)
        print(er)
        continue

## Part 2: Build a Pandas DataFrame
Now that we have a raw list of commits available, let's translate that to a Pandas DataFrame so we can validate the data appears roughly correct before exporting

In [4]:

import pandas as pd

# Translate this list of commits to a Pandas data frame, then export it to CSV for analysis
df_commits = pd.DataFrame(commits)
df_commits.head()

Unnamed: 0,hash,message,author_name,author_email,author_date,author_tz,committer_name,committer_email,committer_date,committer_tz,...,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,files,dmm_unit_size,dmm_unit_complexity,dmm_unit_interfacing
0,7894423f9bac837f4c5fb2c9a0f4284da38f2069,Initial commit,Rich Lander,rlander@microsoft.com,2017-09-21 16:11:36-07:00,25200,GitHub,noreply@github.com,2017-09-21 16:11:36-07:00,25200,...,False,0,21,21,1,main,LICENSE,,,
1,42dd1a3280da0bf901058cd7812faa1355eaae29,Create README.md,Piotr Puszkiewicz,piotrp@microsoft.com,2017-09-21 16:22:28-07:00,25200,GitHub,noreply@github.com,2017-09-21 16:22:28-07:00,25200,...,False,0,2,2,1,main,README.md,,,
2,25139110fc53537334c2f2a745246b4fcf8203fb,Updated the readme,Maria Naggaga Nakanwagi,mnaggaga@microsoft.com,2017-09-22 18:47:48-07:00,25200,Maria Naggaga Nakanwagi,mnaggaga@microsoft.com,2017-09-22 18:47:48-07:00,25200,...,False,1,10,9,1,main,README.md,,,
3,3a88efed0961f689e692eb3d52b3d9d3ddca903b,Update README.md,LadyNaggaga,maria.naggaga@live.ca,2017-09-22 18:50:32-07:00,25200,GitHub,noreply@github.com,2017-09-22 18:50:32-07:00,25200,...,False,2,15,13,1,main,README.md,,,
4,0278d89a6150858193cee8e6d1ac0ce159ac4ad0,Update README.md,LadyNaggaga,maria.naggaga@live.ca,2017-09-22 18:51:31-07:00,25200,GitHub,noreply@github.com,2017-09-22 18:51:31-07:00,25200,...,False,1,1,0,1,main,README.md,,,


In [5]:
# Look at the trends in the OS-DMM metrics
df_commits[['dmm_unit_complexity', 'dmm_unit_interfacing', 'dmm_unit_size']].describe()

Unnamed: 0,dmm_unit_complexity,dmm_unit_interfacing,dmm_unit_size
count,1935.0,1936.0,1945.0
mean,0.658387,0.678529,0.484984
std,0.440159,0.431222,0.434769
min,0.0,0.0,0.0
25%,0.0,0.058365,0.0
50%,1.0,1.0,0.454545
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


## Part 3: Export to a CSV File
Because the commit extraction process takes a very long time, let's export the resulting data to a CSV file so we don't need to repeat it every time we need to analyze the data

In [6]:
df_commits.to_csv('Commits.csv')

## Part 4: Breakdown by file

In [7]:
commits = []

for commit in repo.traverse_commits():
    hash = commit.hash
    try:
        # Sanitize the message to prevent it from confusing our resulting CSV
        msg = sanitize_message(commit.msg)

        # Optimization to prevent requesting same data twice
        author = commit.author
        committer = commit.committer
        inserts = commit.insertions
        deletions = commit.deletions
        author_date = commit.author_date
        author_timezone = commit.author_timezone
        committer_date = commit.committer_date
        committer_timezone = commit.committer_timezone
        in_main_branch = commit.in_main_branch
        is_merge = commit.merge
        branches = ', '.join(commit.branches) # Comma separated list of branches the commit is found in
        project_name = commit.project_name
        project_path = commit.project_path

        for f in commit.modified_files:
            record = {
                'hash': hash,
                'message': msg,
                'author_name': author.name,
                'author_email': author.email,
                'author_date': author_date,
                'author_tz': author_timezone,
                'committer_name': committer.name,
                'committer_email': committer.email,
                'committer_date': committer_date,
                'committer_tz': committer_timezone,
                'in_main': in_main_branch,
                'is_merge': is_merge,
                'num_deletes': deletions,
                'num_inserts': inserts,
                'net_lines': inserts - deletions,
                'branches': branches,
                'filename': f.filename,
                'old_path': f.old_path,
                'new_path': f.new_path,
                'project_name': project_name,
                'project_path': project_path, 
            }
            # Omitted: modified_files (list), project_path, project_name
            commits.append(record)
    except Exception as er:
        print('Problem reading commit ' + hash)
        print(er)
        continue        

In [8]:
import pandas as pd

# Translate this list of commits to a Pandas data frame, then export it to CSV for analysis
df_file_commits = pd.DataFrame(commits)

df_file_commits.to_csv('FileCommits.csv')

df_file_commits.head()

Unnamed: 0,hash,message,author_name,author_email,author_date,author_tz,committer_name,committer_email,committer_date,committer_tz,...,is_merge,num_deletes,num_inserts,net_lines,branches,filename,old_path,new_path,project_name,project_path
0,7894423f9bac837f4c5fb2c9a0f4284da38f2069,Initial commit,Rich Lander,rlander@microsoft.com,2017-09-21 16:11:36-07:00,25200,GitHub,noreply@github.com,2017-09-21 16:11:36-07:00,25200,...,False,0,21,21,main,LICENSE,,LICENSE,interactive,C:\Users\Admin\AppData\Local\Temp\tmpxtrxdoy6\...
1,42dd1a3280da0bf901058cd7812faa1355eaae29,Create README.md,Piotr Puszkiewicz,piotrp@microsoft.com,2017-09-21 16:22:28-07:00,25200,GitHub,noreply@github.com,2017-09-21 16:22:28-07:00,25200,...,False,0,2,2,main,README.md,,README.md,interactive,C:\Users\Admin\AppData\Local\Temp\tmpxtrxdoy6\...
2,25139110fc53537334c2f2a745246b4fcf8203fb,Updated the readme,Maria Naggaga Nakanwagi,mnaggaga@microsoft.com,2017-09-22 18:47:48-07:00,25200,Maria Naggaga Nakanwagi,mnaggaga@microsoft.com,2017-09-22 18:47:48-07:00,25200,...,False,1,10,9,main,README.md,README.md,README.md,interactive,C:\Users\Admin\AppData\Local\Temp\tmpxtrxdoy6\...
3,3a88efed0961f689e692eb3d52b3d9d3ddca903b,Update README.md,LadyNaggaga,maria.naggaga@live.ca,2017-09-22 18:50:32-07:00,25200,GitHub,noreply@github.com,2017-09-22 18:50:32-07:00,25200,...,False,2,15,13,main,README.md,README.md,README.md,interactive,C:\Users\Admin\AppData\Local\Temp\tmpxtrxdoy6\...
4,0278d89a6150858193cee8e6d1ac0ce159ac4ad0,Update README.md,LadyNaggaga,maria.naggaga@live.ca,2017-09-22 18:51:31-07:00,25200,GitHub,noreply@github.com,2017-09-22 18:51:31-07:00,25200,...,False,1,1,0,main,README.md,README.md,README.md,interactive,C:\Users\Admin\AppData\Local\Temp\tmpxtrxdoy6\...
