# GitStractor Jupyter Notebook Data Visualization

This notebook serves as a set of examples for visualizing codebases using [GitStractor](https://github.com/integerman/gitstractor) and Jupyter Notebooks.

Contact [Matt Eland](https://MattEland.dev) ([@IntegerMan](https://twitter.com/IntegerMan)) with questions.

## Requirements

This application currently requires:

- CSV files generated by [GitStractor](https://github.com/integerman/gitstractor)
- Jupyter Notebooks running some version of Python (tested using Python 3.8.8)
- The following Python libraries:
  - pandas
  - plotly.express

In [230]:
# Load Dependencies
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## Data Loading

In [231]:
# Project Name shows up in some visualizations
project_name = 'Accessible AI Blog'

# This should point to the location containing the GitStractor CSV files
data_dir = 'C:\\tools\\gitstractor'

# These are the default GitStractor file names and shouldn't need to be customized
author_file = data_dir + '\\Authors.csv'
commits_file = data_dir + '\\Commits.csv'
file_commits_file = data_dir + '\\FileCommits.csv'
files_file = data_dir + '\\Files.csv'
final_structure_file = data_dir + '\\FinalStructure.csv'

### Load Authors

In [232]:
df_authors = pd.read_csv(author_file)

df_authors.head(5)

Unnamed: 0,Name,Email,NumCommits,TotalBytes
0,Matt Eland,Matt.Eland@GMail.com,229,5033233
1,repo-visualizer,repo-visualizer@users.noreply.github.com,42,2332366


### Load Commits

In [233]:
df_commits = pd.read_csv(commits_file, parse_dates=['AuthorDateUTC','CommitterDate'])

# Engineer Date Component Columns
df_commits['date'] = df_commits['AuthorDateUTC'].dt.date
df_commits['year'] = df_commits['AuthorDateUTC'].dt.year
df_commits['month'] = df_commits['AuthorDateUTC'].dt.month
df_commits['year-month'] = df_commits['AuthorDateUTC'].to_numpy().astype('datetime64[M]')
df_commits['weekday'] = df_commits['AuthorDateUTC'].dt.weekday
df_commits['weekday_name'] = df_commits['AuthorDateUTC'].dt.strftime("%A")

df_commits.head(5)

Unnamed: 0,CommitHash,AuthorEmail,AuthorDateUTC,CommitterEmail,CommitterDate,Message,NumFiles,AddedFiles,DeletedFiles,TotalFiles,TotalBytes,FileNames,date,year,month,year-month,weekday,weekday_name
0,0df430136e5ca0af02175d3acc40935e572528a5,Matt.Eland@GMail.com,2022-07-04 02:56:36,Matt.Eland@GMail.com,2022-07-04 02:56:36,Initial commit,3,3,0,3,7175,"dfcfd5 .gitignore @ 0df430 (Added), 144780 LIC...",2022-07-04,2022,7,2022-07-01,0,Monday
1,aec4a2c1f9283bc6da50e264a3ca8743b1f05410,Matt.Eland@GMail.com,2022-07-04 03:07:48,Matt.Eland@GMail.com,2022-07-04 03:07:48,Project structure setup,11,9,0,12,579000,"f0bb9e .gitignore @ aec4a2 (Modified), e34634 ...",2022-07-04,2022,7,2022-07-01,0,Monday
2,85f8e28806fa77c22764300502889d9af4a4740d,Matt.Eland@GMail.com,2022-07-04 03:52:20,Matt.Eland@GMail.com,2022-07-04 03:52:20,Initial setup,12,9,0,20,7922,ec60a2 MattEland.WhereDoggo/MattEland.WhereDog...,2022-07-04,2022,7,2022-07-01,0,Monday
3,5b407858d926bd50620376f0d7ebf871b392e87d,Matt.Eland@GMail.com,2022-07-04 04:03:53,Matt.Eland@GMail.com,2022-07-04 04:03:53,Now tracking game knowledge as events,6,2,0,22,3716,bca261 MattEland.WhereDoggo/MattEland.WhereDog...,2022-07-04,2022,7,2022-07-01,0,Monday
4,c8886dca6f7bbb999670b9dd89dd03dd530cecac,Matt.Eland@GMail.com,2022-07-04 04:32:35,Matt.Eland@GMail.com,2022-07-04 04:32:35,Doggo night phase,11,3,0,25,8770,06c73b MattEland.WhereDoggo/MattEland.WhereDog...,2022-07-04,2022,7,2022-07-01,0,Monday


In [234]:
df_file_commits = pd.read_csv(file_commits_file)

df_file_commits.head()

Unnamed: 0,FilePath,FileHash,CommitHash,AuthorEmail,AuthorDateUTC,CommitterEmail,CommitterDate,Message,Bytes,Lines
0,.gitignore,dfcfd56f444f9ae40e1082c07fe254cc547136cf,0df430136e5ca0af02175d3acc40935e572528a5,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Initial commit,6002,350
1,LICENSE,1447802c73221ff4f9a42fe06f922d70a1c1c108,0df430136e5ca0af02175d3acc40935e572528a5,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Initial commit,1067,21
2,README.md,fa40df806092ac52d73705a6e53e941c1686b328,0df430136e5ca0af02175d3acc40935e572528a5,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Matt.Eland@GMail.com,7/4/2022 2:56:36 AM,Initial commit,106,2
3,.gitignore,f0bb9e4ba2142c160fc712ab167ffed69820d129,aec4a2c1f9283bc6da50e264a3ca8743b1f05410,Matt.Eland@GMail.com,7/4/2022 3:07:48 AM,Matt.Eland@GMail.com,7/4/2022 3:07:48 AM,Project structure setup,6036,352
4,MattEland.WhereDoggo/MattEland.WhereDoggo.Core...,e34634acd363adb59e62845b3168ff76f79a6a92,aec4a2c1f9283bc6da50e264a3ca8743b1f05410,Matt.Eland@GMail.com,7/4/2022 3:07:48 AM,Matt.Eland@GMail.com,7/4/2022 3:07:48 AM,Project structure setup,628,21


### Load File Structure

In [235]:
df_files = pd.read_csv(final_structure_file)

df_files.fillna('.', inplace=True)
df_files.head(5)

Unnamed: 0,CommitHash,FileHash,Filename,Extension,FilePath,State,Lines,Bytes,CreatedDateUTC,Path1,Path2,Path3,Path4,Path5
0,5646a44aa369f67be1ff7f261c4d85511aece73a,568164e0c18ed80f5fffc7296d258f67e242951a,AIDesign.md,.md,AIDesign.md,Final,39,953,12/1/2022 7:28:40 PM,.,.,.,.,.
1,5646a44aa369f67be1ff7f261c4d85511aece73a,df87cf951fb4858ab7a76b68dd479c98b2df2404,encodings.xml,.xml,MattEland.WhereDoggo/.idea/.idea.MattEland.Whe...,Final,4,169,12/1/2022 7:28:40 PM,MattEland.WhereDoggo,.idea,.idea.MattEland.WhereDoggo,.idea,.
2,5646a44aa369f67be1ff7f261c4d85511aece73a,7b08163cebc50fb3e777eea4881b68fcebc10590,indexLayout.xml,.xml,MattEland.WhereDoggo/.idea/.idea.MattEland.Whe...,Final,8,198,12/1/2022 7:28:40 PM,MattEland.WhereDoggo,.idea,.idea.MattEland.WhereDoggo,.idea,.
3,5646a44aa369f67be1ff7f261c4d85511aece73a,4bb9f4d2a0dca07b653f7a6218425e1a2ff64d0a,projectSettingsUpdater.xml,.xml,MattEland.WhereDoggo/.idea/.idea.MattEland.Whe...,Final,6,184,12/1/2022 7:28:40 PM,MattEland.WhereDoggo,.idea,.idea.MattEland.WhereDoggo,.idea,.
4,5646a44aa369f67be1ff7f261c4d85511aece73a,6c0b8635858dc7ad44b93df54b762707ce49eefc,vcs.xml,.xml,MattEland.WhereDoggo/.idea/.idea.MattEland.Whe...,Final,6,183,12/1/2022 7:28:40 PM,MattEland.WhereDoggo,.idea,.idea.MattEland.WhereDoggo,.idea,.


## Data Visualization

In [236]:
# Declare standard styles here

theme_discrete = px.colors.qualitative.Prism
theme_diverging_neutral = px.colors.diverging.RdYlBu
theme_diverging = px.colors.diverging.Picnic_r
theme_diverging_r = px.colors.diverging.Picnic
theme_sequential = px.colors.sequential.Agsunset
theme_continuous= px.colors.diverging.balance
theme_hot = px.colors.sequential.Reds
theme_cold = px.colors.sequential.Blues

template = 'plotly_dark'

In [237]:
# Utility Formatting functions
def format_and_show_short(fig):
    fig.update_layout(template=template,
                      height=400)
    fig.show()

def format_and_show(fig):
    fig.update_layout(template=template,
                      height=550)
    fig.show()

def format_and_show_tall(fig):
    fig.update_layout(template=template,
                      height=800)
    fig.show()

def format_and_show_3d(fig):
    fig.update_layout(template=template,
                      width=800,
                      height=600)
    fig.show()

def format_and_show_sunburst(fig):
    fig.update_layout(template=template,
                      width=1024,
                      height=800)
    fig.show()

### File Structure

Data visualizations exploring the static structure of the git repository's final state

In [238]:
file_labels = {
    'Path1': 'Project',
    'Path2': 'Area',
    'Lines': 'Lines of Code',
    'Lines_sum': 'Total Lines of Code',
}

In [239]:
# Files by File Size
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Lines',
                 title=project_name + ' Largest Files (Lines)',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [240]:
# Files by File Size
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Bytes',
                 title=project_name + ' Largest Files (Bytes)',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [241]:
# Sunburst diagram. Same data as a treemap, but different presentation
fig = px.sunburst(df_files,
                 path=['Path1','Path2','Path3','Filename'],
                 color='Lines',
                 title=project_name + ' Size of Code Files by Project and Directory',
                 hover_data=['FilePath'],
                 color_continuous_scale='sunsetdark',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_sunburst(fig)

In [242]:
# Files by Directory Structure
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Path1',
                 title=project_name + ' Project Structure',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [243]:
# Files by Extension
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Extension',
                 title=project_name + ' File Types',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [244]:
fig = px.histogram(df_files,
                   x="Lines",
                   title=project_name + ' Frequency of File Sizes by Area',
                   color='Path2',
                   labels=file_labels,
                   color_discrete_sequence=theme_discrete)
format_and_show(fig)

In [245]:
# Overall box plot for all code
fig = px.box(df_files,
             title=project_name + ' Lines of Code by Area',
             x='Lines',
             y='Path2',
             color='Path2',
             color_discrete_sequence=theme_discrete,
             labels=file_labels,
             hover_data=['FilePath'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)

## Commit Graphs

Data visualizations exploring commit patterns

In [255]:
# Replacement Values to make the graphs look nice
commit_labels = {
                     'TotalBytes': 'Bytes',
                     'NumFiles': '# Files',
                     'weekday_name': 'Weekday',
                     'AuthorEmail': 'Author',
                     'AuthorDateUTC': 'Date',

                     'net_lines':'Net Lines',
                     'num_deletes': 'Lines Deleted',
                     'num_inserts': 'Lines Added',
                     'num_files': 'Files Modified',
                     'date': 'Date',
                     'datetime': 'Date',
                     'filename': 'File',
                     'message': 'Commit Message',
                     'hash': 'Hash',
                     'author_name': 'Author',
                     'count': 'Count',
                     'avg_net': 'Avg. Net Lines',
                     'num_commits': 'Commits',
                     'num_authors': 'Authors',
                     'sum_net': 'Total Net Lines',
                     'lines': 'Lines of Code',
                     'project': 'Project',
                 }

In [247]:
fig = px.scatter(df_commits, 
                 title= project_name + ' bytes per commit',
                 x='AuthorDateUTC', 
                 y='TotalBytes',
                 color='TotalBytes',
                 color_discrete_sequence=theme_sequential,
                 hover_data=['AuthorEmail'],
                 labels=commit_labels,
                 hover_name='Message')
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [248]:
fig = px.scatter(df_commits, 
                 title= project_name + ' files per commit',
                 x='AuthorDateUTC', 
                 y='NumFiles',
                 color='NumFiles',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels,
                 hover_data=['AuthorEmail'],
                 hover_name='Message')
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [249]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Day of Week',
                 x='date', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels,
                 hover_name='Message')
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [250]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Year (Day of Week Colorized)',
                 x='year', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels,
                 hover_name='Message')
fig.update_layout(xaxis_title='Year')
format_and_show_short(fig)

In [251]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Month (Day of Week Colorized)',
                 x='year-month', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels,
                 hover_name='Message')
fig.update_layout(xaxis_title='Month')
format_and_show_short(fig)

In [252]:
num_days = (df_commits['date'].max() - df_commits['date'].min()).days
num_days

150

In [253]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Daily Commits (Weekdays Colorized)',
                 x='date', 
                 color='weekday_name',
                 nbins=num_days,
                 color_discrete_sequence=theme_sequential,
                 hover_name='Message',
                 labels=commit_labels)
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

## Authors

Data visualizations exploring author behaviors and tendencies

In [256]:
fig = px.scatter(df_commits, 
                 title=project_name + ' Total Bytes per Commit by Author',
                 x='AuthorDateUTC', 
                 y='TotalBytes',
                 color='AuthorEmail',
                 color_discrete_sequence=theme_discrete,
                 hover_name='Message',
                 labels=commit_labels)

fig.update_traces(marker=dict(size=5), selector=dict(mode='markers'))

format_and_show(fig)

In [261]:
df_attributed = df_commits[df_commits['AuthorEmail'] != '(no author)']
df_attributed = df_attributed[df_attributed['AuthorEmail'] != 'unknown']

df_contributor_monthly = df_attributed.groupby(['year-month','AuthorEmail']).agg(
        count=('CommitHash', pd.Series.nunique),
        sum_files=('NumFiles', 'sum'),
        sum_inserts=('AddedFiles', 'sum'),
        sum_deletes=('DeletedFiles', 'sum')).sort_index(ascending=False)

df_contributor_monthly.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum_files,sum_inserts,sum_deletes
year-month,AuthorEmail,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-12-01,repo-visualizer@users.noreply.github.com,5,5,0,0
2022-12-01,Matt.Eland@GMail.com,11,93,23,0
2022-11-01,repo-visualizer@users.noreply.github.com,11,11,0,0
2022-11-01,Matt.Eland@GMail.com,70,266,56,0
2022-07-01,repo-visualizer@users.noreply.github.com,26,48,1,0


In [266]:
fig = px.scatter(df_contributor_monthly,
        x=df_contributor_monthly.index.get_level_values(0),
        y='sum_inserts',
        size='count',
        labels=commit_labels,
        color=df_contributor_monthly.index.get_level_values(1),
        color_discrete_sequence=theme_discrete)
fig.update_layout(title=project_name + " Net lines of code by month by author",
                  xaxis_title='Year / Month',
                  yaxis_title='Files Added',
                  legend_title='Author')
format_and_show(fig)

In [268]:
fig = px.scatter(df_contributor_monthly,
                 x=df_contributor_monthly.index.get_level_values(0), 
                 y=df_contributor_monthly.index.get_level_values(1),
                 color=df_contributor_monthly.index.get_level_values(1),
                 size='count',
                 color_discrete_sequence=theme_discrete,
                 labels=commit_labels)
fig.update_layout(title=project_name + " Monthly Contribution History",
                  xaxis_title='Year / Month',
                  yaxis_title='Author',
                  legend_title='Author')
format_and_show(fig)

In [274]:
# Overall box plot for all authors
fig = px.box(df_commits,
             title=project_name + ' bytes per commit by Author',
             x='TotalBytes',
             y='AuthorEmail',
             color='AuthorEmail',
             labels=commit_labels,
             color_discrete_sequence=theme_discrete,
             hover_data=['CommitHash','AuthorDateUTC','Message'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)