# GitStractor Repository Analysis Data Extraction Tool

Created by [Matt Eland](https://MattEland.dev)

This notebook helps you load historical git and file statistics data from any local git repository

## Pre-requisites
The code assumes you have the following libraries installed:

- pydriller
- pandas

### Setup: Output file paths and repository

In [1]:
# This repository path should be a LOCAL path already cloned on disk
repository_path = 'C:\\Dev\\GitStractor'
repository_branch = 'main'

# Declare our paths of interest
commit_data_path = 'Commits.csv'
file_commit_data_path = 'FileCommits.csv'
file_size_data_path = 'FileSizes.csv'
file_data_path = 'FileData.csv'
author_data_path = 'Authors.csv'
merged_data_path = 'MergedData.csv'
merged_file_data_path = 'MergedFileData.csv'

# Multi-threading is supported
num_threads = 16

### Configure Authors

Sometimes people commit with different names or E-Mail addresses and we want to configure aliases. Additionally, the final project's rubric required a map view. The best way of doing that with this dataset is to allow the user to specify the country and city of each author.

In [2]:
author_info = [
    { 
        'name': 'Matt Eland', 
        'email': 'matt.eland@gmail.com', 
        'aliases': ['matt@mattondatascience.com']
    },
]

## Part 1: Pulling Commit Data

We will use the PyDriller library to pull commit data from a local git repository and build a list of commits.

This process can take a very long time depending on the size of the repository (0.25 - 1 seconds *per commit*, depending on your machine and the `num_threads`).

In [3]:
from Scripts import GitAnalyzer 

# This will pull all repository data and write to Commits.csv and FileCommits.csv
GitAnalyzer.analyze_repository(repository_path, 
                               num_threads=num_threads,
                               author_info=author_info,
                               commits_file_path=commit_data_path,
                               file_commits_file_path=file_commit_data_path,
                               branch=repository_branch)

Analyzing Git Repository at C:\Dev\GitStractor
Fetching commits. This can take a long time...
Read 2 commits and 11 file commits
Saved to Commits.csv
Saved to FileCommits.csv
Repository Data Pulled Successfully


## Part 2: Building File Size Information

Next we'll walk the directory looking at source files and build out a CSV file with all file data.

In [4]:
from Scripts import FileAnalyzer

FileAnalyzer.build_file_sizes(repository_path, file_size_data_path)

Reading file metrics from C:\Dev\GitStractor
5 source files read from C:\Dev\GitStractor
File size information saved to FileSizes.csv


## Part 3: Data Aggregation

Next, we're going to unify all the data together into a MergedFileData.csv and FileData.csv files that let us do more in-depth analysis without having to manage the joins downstream

**NOTE:** Nuances of commit history is currently lost for files that have been renamed.

In [5]:
from Scripts import GitDataMerger

GitDataMerger.generate_merged_file_data(file_commit_data_path, 
                                        file_size_data_path, 
                                        merged_file_data_path,
                                        file_data_path)

Loading file commit data from FileCommits.csv
Loading file size data from FileSizes.csv
Merged file data created in MergedFileData.csv
Aggregating file data
Writing file data to FileData.csv
File generation completed


## Ready for Analysis
Next up is the actual data analysis workflow.

In [6]:
print('Analysis complete. Data files are ready for import')

Analysis complete. Data files are ready for import
