# Git Data Extraction using Python

This notebook helps you load historical git data from any public git repository on GitHub

## Pre-requisites
The code assumes you have the following libraries installed:

- pydriller
- pandas

## Part 1: Pulling Commit Data from GitHub

We will use the PyDriller library to pull commit data from GitHub for a public repository and build a list of commits.

This process can take a very long time depending on the size of the repository.

In [3]:
# We need PyDriller to pull git repository information
from pydriller import Repository


# Using the ML.NET public repository on GitHub
account = 'dotnet'
repository = 'machinelearning'
path = 'https://github.com/' + account + '/' + repository
repo = Repository(path)

print('Using repository ' + path)

Using repository https://github.com/dotnet/machinelearning


In [5]:


# Loop over each PyDriller commit to transform it to a commit usable for analysis later
# NOTE: This can take a LONG time if there are many commits

commits = []
for commit in repo.traverse_commits():

    hash = commit.hash

    # Gather a list of files modified in the commit
    files = []
    try:
        for f in commit.modified_files:
            if f.new_path is not None:
                files.append(f.new_path) 
    except Exception:
        print('Could not read files for commit ' + hash)
        continue

    # Capture information about the commit in object format so I can reference it later
    record = {
        'hash': hash,
        'message': commit.msg,
        'author_name': commit.author.name,
        'author_email': commit.author.email,
        'author_date': commit.author_date,
        'author_tz': commit.author_timezone,
        'committer_name': commit.committer.name,
        'committer_email': commit.committer.email,
        'committer_date': commit.committer_date,
        'committer_tz': commit.committer_timezone,
        'in_main': commit.in_main_branch,
        'is_merge': commit.merge,
        'num_deletes': commit.deletions,
        'num_inserts': commit.insertions,
        'net_lines': commit.insertions - commit.deletions,
        'num_files': commit.files,
        'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
        'files': ', '.join(files), # Comma separated list of files the commit modifies
        'parents': ', '.join(commit.parents), # Comma separated list of parents
        # PyDriller Open Source Delta Maintainability Model (OS-DMM) stat. See https://pydriller.readthedocs.io/en/latest/deltamaintainability.html for metric definitions
        'dmm_unit_size': commit.dmm_unit_size,
        'dmm_unit_complexity': commit.dmm_unit_complexity,
        'dmm_unit_interfacing': commit.dmm_unit_interfacing,
    }
    # Omitted: modified_files (list), project_path, project_name
    commits.append(record)

ValueError: SHA b'1ecc365249e5cac5e72c66317a141298dc52f6e3' could not be resolved, git returned: b'1ecc365249e5cac5e72c66317a141298dc52f6e3 missing'

## Part 2: Build a Pandas DataFrame
Now that we have a raw list of commits available, let's translate that to a Pandas DataFrame so we can validate the data appears roughly correct before exporting

In [None]:

import pandas as pd

# Translate this list of commits to a Pandas data frame, then export it to CSV for analysis
df_commits = pd.DataFrame(commits)
df_commits.head()

## Part 3: Export to a CSV File
Because the commit extraction process takes a very long time, let's export the resulting data to a CSV file so we don't need to repeat it every time we need to analyze the data

In [None]:
df_commits.to_csv(repository + '_Commits.csv')