# Git Repository Analysis Data Extraction

This notebook helps you load historical git and file statistics data from any local git repository

## Pre-requisites
The code assumes you have the following libraries installed:

- pydriller
- pandas

### Setup: Output file paths and repository

In [None]:
# This repository path should be a LOCAL path already cloned on disk
repository_path = 'D:\\OneDrive\\Documents\\FU\\DATA605\\FinalProject'

# Declare our paths of interest
commit_data_path = 'Commits.csv'
file_commit_data_path = 'FileCommits.csv'
file_size_data_path = 'FileSizes.csv'
author_data_path = 'Authors.csv'
merged_data_path = 'MergedFileData.csv'

## Part 1: Pulling Commit Data from GitHub

We will use the PyDriller library to pull commit data from GitHub for a public repository or local git repository and build a list of commits.

This process can take a very long time depending on the size of the repository

In [3]:
from Scripts import GitAnalyzer 

# This will pull all repository data and write to Commits.csv and FileCommits.csv
GitAnalyzer.analyze_repository(repository_path, commit_data_path, file_commit_data_path)

Analyzing Git Repository at D:\OneDrive\Documents\FU\DATA605\FinalProject
Fetching commits. This will take a long time...
Saving Commits
Saved to Commits.csv
Fetching file commits. This will take a very long time...
Saving Commits
Saved to FileCommits.csv
Repository Data Pulled Successfully


## Part 2: Building File Size Information

Next we'll walk the directory looking at source files and build out a CSV file with all file data.

In [4]:
from Scripts import FileAnalyzer

FileAnalyzer.build_file_sizes(repository_path, file_size_data_path)

Reading file metrics from D:\OneDrive\Documents\FU\DATA605\FinalProject
4 source files read from D:\OneDrive\Documents\FU\DATA605\FinalProject
File size information saved to FileSizes.csv


## Part 3: Building a List of Authors

Next, we'll determine the unique authors for the code and save that to a separate file called Authors.csv

In [1]:
from Scripts import GitAuthors

GitAuthors.identify_authors(commit_data_path, author_data_path)

Reading commit data from Commits.csv
Saved author information to Authors.csv


## Part 4: Merging Data Together

Next, we're going to unify all the data together into a single MergedFileData.csv file that lets us do more in-depth analysis against a single DataFrame

In [None]:
from Scripts import GitDataMerger

GitDataMerger.generate_merged_data(file_commit_data_path, file_size_data_path, merged_data_path)