# GitStractor Jupyter Notebook Data Visualization

This notebook serves as a set of examples for visualizing codebases using [GitStractor](https://github.com/integerman/gitstractor) and Jupyter Notebooks.

Contact [Matt Eland](https://MattEland.dev) ([@IntegerMan](https://twitter.com/IntegerMan)) with questions.

## Requirements

This application currently requires:

- CSV files generated by [GitStractor](https://github.com/integerman/gitstractor)
- Jupyter Notebooks running some version of Python (tested using Python 3.8.8)
- The following Python libraries:
  - pandas
  - plotly.express

In [3]:
# Load Dependencies
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## Data Loading

In [4]:
# Project Name shows up in some visualizations
project_name = 'AccessibleAI Blog'

# This should point to the location containing the GitStractor CSV files
data_dir = 'C:\\tools\\gitstractor'

# These are the default GitStractor file names and shouldn't need to be customized
author_file = data_dir + '\\Authors.csv'
commits_file = data_dir + '\\Commits.csv'
file_commits_file = data_dir + '\\FileCommits.csv'
files_file = data_dir + '\\Files.csv'
final_structure_file = data_dir + '\\FinalStructure.csv'

### Load Authors

In [5]:
df_authors = pd.read_csv(author_file)

df_authors.head(5)

Unnamed: 0,Name,Email,NumCommits,TotalBytes
0,Tatham Oddie,tatham@oddie.com.au,51,332102
1,Richard Banks,rbanks54@msn.com,5,41619
2,Philippe Miossec,pmiossec@gmail.com,11,71711
3,Richard Banks,rbanks54@users.noreply.github.com,5,15023
4,alkampfergit,ricci.gianmaria@gmail.com,1,8272


### Load Commits

In [6]:
def add_date_columns(df, dateColName):
    df['date'] = df[dateColName].dt.date
    df['year'] = df[dateColName].dt.year
    df['month'] = df[dateColName].dt.month
    df['year-month'] = df[dateColName].to_numpy().astype('datetime64[M]')
    df['weekday'] = df[dateColName].dt.weekday
    df['weekday_name'] = df[dateColName].dt.strftime("%A")
    
    return df

In [7]:
df_commits = pd.read_csv(commits_file, parse_dates=['AuthorDateUTC','CommitterDateUTC'])

# Engineer Date Component Columns
df_commits = add_date_columns(df_commits, 'AuthorDateUTC')

# Grab the name of the author from our authors dataframe
df_commits = df_commits.merge(df_authors[['Name','Email']], right_on='Email', left_on='AuthorEmail')
df_commits.drop(columns=['Email'], inplace=True)
df_commits.rename(columns={'Name':'AuthorName'}, inplace=True)

df_commits.head(5)

Unnamed: 0,CommitHash,AuthorEmail,AuthorDateUTC,CommitterEmail,CommitterDateUTC,Message,NumFiles,AddedFiles,DeletedFiles,TotalFiles,...,FileNames,TotalLines,NetLines,date,year,month,year-month,weekday,weekday_name,AuthorName
0,824aa6bcc14c72f50292ca2b5b6525245df1e0f3,tatham@oddie.com.au,2013-11-05 00:08:43,tatham@oddie.com.au,2013-11-05 00:08:43,Add .gitignore,0,0,0,0,...,,0,0,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
1,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,8,8,0,8,...,"c8fbbe GitViz.sln @ 593872 (Added), 08c12a Log...",263,263,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
2,a5897ad17c39fdff706158f95fef08aa27b5c924,tatham@oddie.com.au,2013-11-05 00:46:33,tatham@oddie.com.au,2013-11-05 00:46:33,Parse the initial commit in a repo,6,3,0,10,...,"e86261 Logic/LogParser.cs @ a5897a (Modified),...",208,94,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
3,c70c8f165ef99eef8029787a6e52aa811081d760,tatham@oddie.com.au,2013-11-05 00:50:40,tatham@oddie.com.au,2013-11-05 00:50:40,Parse commits with a single parent,2,0,0,10,...,"792225 Logic/LogParser.cs @ c70c8f (Modified),...",56,11,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
4,fda42ec9bafd6dd857ed6f54e9c173ed477ecb24,tatham@oddie.com.au,2013-11-05 00:56:29,tatham@oddie.com.au,2013-11-05 00:56:29,Parse multiple parent hashes,3,0,0,10,...,"9b3b39 Logic/Commit.cs @ fda42e (Modified), 25...",76,12,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie


In [8]:
df_file_commits = pd.read_csv(file_commits_file, parse_dates=['AuthorDateUTC','CommitterDateUTC'])

# Add Date Columns
df_file_commits = add_date_columns(df_file_commits, 'AuthorDateUTC')

# Grab the name of the author from our authors dataframe
df_file_commits = df_file_commits.merge(df_authors[['Name','Email']], right_on='Email', left_on='AuthorEmail')
df_file_commits.drop(columns=['Email'], inplace=True)
df_file_commits.rename(columns={'Name':'AuthorName'}, inplace=True)

df_file_commits.head()

Unnamed: 0,FilePath,FileHash,CommitHash,AuthorEmail,AuthorDateUTC,CommitterEmail,CommitterDateUTC,Message,Bytes,Lines,NetLines,date,year,month,year-month,weekday,weekday_name,AuthorName
0,GitViz.sln,c8fbbe58e5028a2c70a60b040ffc0f9f3aa4af1d,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,1425,28,28,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
1,Logic/Commit.cs,08c12a10c35fa71c49945cb4a4cb52663bafdd64,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,154,8,8,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
2,Logic/LogParser.cs,f99b6f7caea89e37bd46250076f495430268c157,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,256,13,13,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
3,Logic/Logic.csproj,5a880a431f5cf034c5054a0319b37930a95c9ff4,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,2400,54,54,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie
4,Logic/Properties/AssemblyInfo.cs,a2b1d7d741a5367d9ccad3367bca71b1420fd551,593872d838c7afaff0b1b7d4a548eb85e64a796d,tatham@oddie.com.au,2013-11-05 00:25:52,tatham@oddie.com.au,2013-11-05 00:25:52,Stub project structure and first failing test,1386,36,36,2013-11-05,2013,11,2013-11-01,1,Tuesday,Tatham Oddie


### Load File Structure

In [9]:
df_files = pd.read_csv(final_structure_file, parse_dates=['CreatedDateUTC'])

df_files.fillna('.', inplace=True)
df_files.head(5)

Unnamed: 0,CommitHash,FileHash,Filename,Extension,FilePath,State,Lines,Bytes,CreatedDateUTC,Path1,Path2,Path3,Path4,Path5
0,04c46f70abc8a56beb179fa86cfb0020c8d2d5da,63f01b690c0d045633c3d811da6f4fd17bfee7ac,GitViz.sln,.sln,GitViz.sln,Final,34,1879,2013-11-05 04:34:36,.,.,.,.,.
1,f8e2eae0f98e11fec63659a760e2ddae161999d2,7c08d3c98bf5b7b3910fe387ea50bbe8d51e2e68,Commit.cs,.cs,Logic/Commit.cs,Final,15,348,2013-11-06 01:33:00,Logic,.,.,.,.
2,963129f57529c88989b8c336f0f5c8f7a620890c,4044de8a2f08ab94a914e9d7360d35bdf11f7651,CommitEdge.cs,.cs,Logic/CommitEdge.cs,Final,12,215,2013-11-05 06:42:27,Logic,.,.,.,.
3,963129f57529c88989b8c336f0f5c8f7a620890c,afa6b4d80636fe1fe08552405635ea328911e465,CommitGraph.cs,.cs,Logic/CommitGraph.cs,Final,8,131,2013-11-05 06:42:27,Logic,.,.,.,.
4,619a59ec387cde9c6e0013873a844ef0340ba85a,4b94c9ae77b4131e3939e35532a998a2948a2a54,FsckParser.cs,.cs,Logic/FsckParser.cs,Final,20,547,2013-11-06 01:47:17,Logic,.,.,.,.


In [10]:
# Get Aggregate level data for each file
df_file_commits_agg = df_file_commits.groupby('FilePath').agg(
    num_commits=('FileHash',pd.Series.nunique),
    sum_bytes=('Bytes', 'sum'),
    avg_bytes=('Bytes', 'mean'),
    avg_lines=('Lines', 'mean'),
    min_date=('date', 'min'),
    max_date=('date', 'max'),
    first_author=('AuthorName', 'first'),
    last_author=('AuthorName', 'last'),
    modal_author=('AuthorName', lambda x : x.value_counts().index[0]),
)
df_file_commits_agg.head()

Unnamed: 0_level_0,num_commits,sum_bytes,avg_bytes,avg_lines,min_date,max_date,first_author,last_author,modal_author
FilePath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
GitViz.sln,2,3304,1652.0,31.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
Logic/Commit.cs,6,1484,247.333333,11.5,2013-11-05,2013-11-06,Tatham Oddie,Tatham Oddie,Tatham Oddie
Logic/CommitEdge.cs,2,430,215.0,12.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
Logic/CommitGraph.cs,2,262,131.0,8.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
Logic/FsckParser.cs,2,1088,544.0,20.0,2013-11-06,2013-11-06,Tatham Oddie,Tatham Oddie,Tatham Oddie


In [11]:
# Merge aggregate data into the file dataset
df_files = df_files.merge(df_file_commits_agg, left_on='FilePath', right_on='FilePath', suffixes=('', ''))
df_files.head(5)

Unnamed: 0,CommitHash,FileHash,Filename,Extension,FilePath,State,Lines,Bytes,CreatedDateUTC,Path1,...,Path5,num_commits,sum_bytes,avg_bytes,avg_lines,min_date,max_date,first_author,last_author,modal_author
0,04c46f70abc8a56beb179fa86cfb0020c8d2d5da,63f01b690c0d045633c3d811da6f4fd17bfee7ac,GitViz.sln,.sln,GitViz.sln,Final,34,1879,2013-11-05 04:34:36,.,...,.,2,3304,1652.0,31.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
1,f8e2eae0f98e11fec63659a760e2ddae161999d2,7c08d3c98bf5b7b3910fe387ea50bbe8d51e2e68,Commit.cs,.cs,Logic/Commit.cs,Final,15,348,2013-11-06 01:33:00,Logic,...,.,6,1484,247.333333,11.5,2013-11-05,2013-11-06,Tatham Oddie,Tatham Oddie,Tatham Oddie
2,963129f57529c88989b8c336f0f5c8f7a620890c,4044de8a2f08ab94a914e9d7360d35bdf11f7651,CommitEdge.cs,.cs,Logic/CommitEdge.cs,Final,12,215,2013-11-05 06:42:27,Logic,...,.,2,430,215.0,12.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
3,963129f57529c88989b8c336f0f5c8f7a620890c,afa6b4d80636fe1fe08552405635ea328911e465,CommitGraph.cs,.cs,Logic/CommitGraph.cs,Final,8,131,2013-11-05 06:42:27,Logic,...,.,2,262,131.0,8.0,2013-11-05,2013-11-05,Tatham Oddie,Tatham Oddie,Tatham Oddie
4,619a59ec387cde9c6e0013873a844ef0340ba85a,4b94c9ae77b4131e3939e35532a998a2948a2a54,FsckParser.cs,.cs,Logic/FsckParser.cs,Final,20,547,2013-11-06 01:47:17,Logic,...,.,2,1088,544.0,20.0,2013-11-06,2013-11-06,Tatham Oddie,Tatham Oddie,Tatham Oddie


## Data Visualization

In [12]:
# Declare standard styles here

theme_discrete = px.colors.qualitative.Prism
theme_diverging_neutral = px.colors.diverging.RdYlBu
theme_diverging = px.colors.diverging.Picnic_r
theme_diverging_r = px.colors.diverging.Picnic
theme_sequential = px.colors.sequential.Agsunset
theme_continuous= px.colors.diverging.balance
theme_hot = px.colors.sequential.Reds
theme_cold = px.colors.sequential.Blues

template = 'plotly_dark'

In [13]:
# Utility Formatting functions
def format_and_show_short(fig):
    fig.update_layout(template=template,
                      height=400)
    fig.show()

def format_and_show(fig):
    fig.update_layout(template=template,
                      height=550)
    fig.show()

def format_and_show_tall(fig):
    fig.update_layout(template=template,
                      height=800)
    fig.show()

def format_and_show_3d(fig):
    fig.update_layout(template=template,
                      width=800,
                      height=600)
    fig.show()

def format_and_show_sunburst(fig):
    fig.update_layout(template=template,
                      width=1024,
                      height=800)
    fig.show()

### File Structure

Data visualizations exploring the static structure of the git repository's final state

In [14]:
file_labels = {
    'Path1': 'Project',
    'Path2': 'Area',
    'Lines': 'Lines of Code',
    'Lines_sum': 'Total Lines of Code',
    'num_commits': '# Commits',
}

In [15]:
# Files by File Size
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Lines',
                 title=project_name + ' Largest Files (Lines)',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [16]:
# Files by File Size
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Bytes',
                 title=project_name + ' Largest Files (Bytes)',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [17]:
# Sunburst diagram. Same data as a treemap, but different presentation
fig = px.sunburst(df_files,
                 path=['Path1','Path2','Path3','Filename'],
                 color='Lines',
                 title=project_name + ' Size of Code Files by Project and Directory',
                 hover_data=['FilePath'],
                 color_continuous_scale='sunsetdark',
                 labels=file_labels,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_sunburst(fig)

In [18]:
# Files by Directory Structure
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Path1',
                 title=project_name + ' Project Structure',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [19]:
# Files by Extension
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='Extension',
                 title=project_name + ' File Types',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [20]:
fig = px.histogram(df_files,
                   x="Lines",
                   title=project_name + ' Frequency of File Sizes by Area',
                   color='Path2',
                   labels=file_labels,
                   color_discrete_sequence=theme_discrete)
format_and_show(fig)

In [21]:
# Overall box plot for all code
fig = px.box(df_files,
             title=project_name + ' Lines of Code by Area',
             x='Lines',
             y='Path2',
             color='Path2',
             color_discrete_sequence=theme_discrete,
             labels=file_labels,
             hover_data=['FilePath'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)

In [22]:
# Files by Date Created
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='min_date',
                 title=project_name + ' Files by Creation Date',
                 labels=file_labels,
                 color_discrete_sequence=theme_sequential,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [23]:
# Files by Date Modified
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='max_date',
                 title=project_name + ' Files by Date Last Modified',
                 labels=file_labels,
                 color_discrete_sequence=theme_sequential,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [24]:
# Files by Date Modified
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='num_commits',
                 title=project_name + ' Files by # Commits',
                 labels=file_labels,
                 color_discrete_sequence=theme_sequential,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [25]:
# Sunburst diagram. Same data as a treemap, but different presentation
fig = px.sunburst(df_files,
                 path=['Path1','Path2','Path3','Filename'],
                 color='num_commits',
                 title=project_name + ' # Commits by Project Structure',
                 hover_data=['FilePath'],
                 color_continuous_scale='sunsetdark',
                 labels=file_labels,
                 values='num_commits')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_sunburst(fig)

In [26]:
# Files by Date Modified
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='first_author',
                 title=project_name + ' Files by Creator',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [27]:
# Files by Date Modified
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='last_author',
                 title=project_name + ' Files by Last Modified By',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [28]:
# Files by Date Modified
fig = px.treemap(df_files,
                 path=[px.Constant(project_name),'Path1','Path2','Path3','Filename'],
                 color='modal_author',
                 title=project_name + ' Files by Most Common Author',
                 labels=file_labels,
                 color_discrete_sequence=theme_discrete,
                 values='Lines')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
format_and_show_tall(fig)

In [29]:
df_file_commits_daily = df_file_commits.groupby('date').agg(
    NetLines=('NetLines', 'sum'),
    NumFiles=('FilePath', pd.Series.count),
    NumCommits=('CommitHash', pd.Series.count)
)

df_file_commits_daily.head()

Unnamed: 0_level_0,NetLines,NumFiles,NumCommits
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-11-05,1942,97,97
2013-11-06,353,48,48
2013-11-11,14,4,4
2013-11-13,33,5,5
2014-04-22,18,2,2


In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_file_commits_daily.index,
            mode='lines+markers',
            marker=dict(
                color=df_file_commits_daily['NumFiles'],
                size=8,
                colorscale=px.colors.diverging.balance
            ),
            y=df_file_commits_daily['NumFiles']))
fig.update_layout(title=project_name + " Files Changed",
                  yaxis_title="# Files Changed",
                  xaxis_title="Date")
format_and_show(fig)

## Commit Graphs

Data visualizations exploring commit patterns

In [31]:
# Replacement Values to make the graphs look nice
commit_labels = {
                     'TotalBytes': 'Bytes',
                     'NumFiles': '# Files',
                     'weekday_name': 'Weekday',
                     'AuthorEmail': 'Author E-Mail',
                     'AuthorDateUTC': 'Date',
                     'AuthorName': 'Author',

                     'net_lines':'Net Lines',
                     'num_deletes': 'Lines Deleted',
                     'num_inserts': 'Lines Added',
                     'num_files': 'Files Modified',
                     'date': 'Date',
                     'datetime': 'Date',
                     'filename': 'File',
                     'message': 'Commit Message',
                     'hash': 'Hash',
                     'author_name': 'Author',
                     'count': 'Count',
                     'avg_net': 'Avg. Net Lines',
                     'num_commits': 'Commits',
                     'num_authors': 'Authors',
                     'sum_net': 'Total Net Lines',
                     'lines': 'Lines of Code',
                     'project': 'Project',
                 }

In [32]:
fig = px.scatter(df_commits, 
                 title= project_name + ' bytes per commit',
                 x='AuthorDateUTC', 
                 y='TotalBytes',
                 color='TotalBytes',
                 color_discrete_sequence=theme_sequential,
                 hover_data=['AuthorName'],
                 labels=commit_labels,
                 hover_name='Message')
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [33]:
fig = px.scatter(df_commits, 
                 title= project_name + ' files per commit',
                 x='AuthorDateUTC', 
                 y='NumFiles',
                 color='NumFiles',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels,
                 hover_data=['AuthorName'],
                 hover_name='Message')
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [34]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Day of Week',
                 x='date', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels)
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [35]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Year (Day of Week Colorized)',
                 x='year', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels)
fig.update_layout(xaxis_title='Year')
format_and_show_short(fig)

In [36]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Commits by Month (Day of Week Colorized)',
                 x='year-month', 
                 color='weekday_name',
                 color_discrete_sequence=theme_sequential,
                 labels=commit_labels)
fig.update_layout(xaxis_title='Month')
format_and_show_short(fig)

In [37]:
# Determine the length of the project in days. This allows us to bin future graphs by the exact number of days in the project
num_days = (df_commits['date'].max() - df_commits['date'].min()).days
num_days

1967

In [38]:
fig = px.histogram(df_commits, 
                 title=project_name + ' Daily Commits (Weekdays Colorized)',
                 x='date', 
                 color='weekday_name',
                 nbins=num_days,
                 color_discrete_sequence=theme_sequential,
                 hover_name='Message',
                 labels=commit_labels)
fig.update_layout(xaxis_title='Date')
format_and_show_short(fig)

In [39]:
# Let's now remove author from things and just look at aggregate daily totals
df_commits_daily = df_commits.groupby('date').agg(
        num_files=('NumFiles', 'sum'),
        num_authors=('AuthorEmail', pd.Series.nunique),
        num_commits=('CommitHash', pd.Series.nunique),
        sum_inserts=('DeletedFiles', 'sum'),
        sum_deletes=('DeletedFiles', 'sum'),
        sum_total_files=('TotalFiles', 'sum'),
        min_total_files = ('TotalFiles', 'min'), 
        max_total_files=('TotalFiles', 'max'),
        avg_total_files=('TotalFiles', 'mean'),
        min_files = ('NumFiles', 'min'), 
        max_files=('NumFiles', 'max'),
        avg_files=('NumFiles', 'mean'),
        sum_net=('NetLines','sum'),
        avg_net=('NetLines','mean'),
        min_net=('NetLines','min'),
        max_net=('NetLines','max'),
        total_lines=('TotalLines', 'max'),
        min_deletes = ('DeletedFiles', 'min'), 
        max_deletes=('DeletedFiles', 'max'),
        avg_deletes=('DeletedFiles', 'mean'),
        min_inserts = ('AddedFiles', 'min'), 
        max_inserts=('AddedFiles', 'max'),
        avg_inserts=('AddedFiles', 'mean'))

df_commits_monthly = df_commits.groupby('year-month').agg(
        num_files=('NumFiles', 'sum'),
        num_authors=('AuthorEmail', pd.Series.nunique),
        num_commits=('CommitHash', pd.Series.nunique),
        sum_inserts=('DeletedFiles', 'sum'),
        sum_deletes=('DeletedFiles', 'sum'),
        sum_total_files=('TotalFiles', 'sum'),
        min_total_files = ('TotalFiles', 'min'), 
        max_total_files=('TotalFiles', 'max'),
        avg_total_files=('TotalFiles', 'mean'),
        min_files = ('NumFiles', 'min'), 
        max_files=('NumFiles', 'max'),
        avg_files=('NumFiles', 'mean'),
        sum_net=('NetLines','sum'),
        avg_net=('NetLines','mean'),
        min_net=('NetLines','min'),
        max_net=('NetLines','max'),
        total_lines=('TotalLines', 'max'),
        min_deletes = ('DeletedFiles', 'min'), 
        max_deletes=('DeletedFiles', 'max'),
        avg_deletes=('DeletedFiles', 'mean'),
        min_inserts = ('AddedFiles', 'min'), 
        max_inserts=('AddedFiles', 'max'),
        avg_inserts=('AddedFiles', 'mean'))

agg_commit_hover_data = ['sum_inserts', 'sum_deletes', 'min_files', 'min_inserts', 'min_deletes', 'max_files', 'max_inserts', 'max_deletes', 'avg_files', 'avg_inserts','avg_deletes']

In [40]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_commits_daily.index,
            mode='lines+markers',
            marker=dict(
                color=df_commits_daily['sum_net'],
                size=8,
                colorscale=px.colors.diverging.balance
            ),
            y=df_commits_daily['sum_net']))
fig.update_layout(title=project_name + " Daily Net Changes",
                  yaxis_title="Net Change (Lines of Code)",
                  xaxis_title="Date")
format_and_show(fig)

In [41]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_commits_monthly.index,
            mode='lines+markers',
            marker=dict(
                color=df_commits_monthly['sum_net'],
                size=8,
                colorscale=px.colors.diverging.balance
            ),
            y=df_commits_monthly['sum_net']))
fig.update_layout(title=project_name + " Monthly Net Changes",
                  yaxis_title="Net Change (Lines of Code)",
                  xaxis_title="Date")
format_and_show(fig)

In [42]:
fig = px.scatter(df_commits_daily, 
                 title=project_name + ' Daily Commit Counts',
                 x=df_commits_daily.index,
                 y='num_commits', 
                 color='num_commits',
                 hover_data=agg_commit_hover_data,
                 hover_name=df_commits_daily.index,
                 color_continuous_scale=theme_sequential,
                 labels=commit_labels)

fig.update_layout(xaxis_title='Date')

format_and_show(fig)

In [43]:
fig = px.scatter(df_commits_daily, 
                 title=project_name + ' Daily Commit and Author Counts',
                 x=df_commits_daily.index,
                 y='num_commits', 
                 color='num_authors',
                 hover_name=df_commits_daily.index,
                 hover_data=agg_commit_hover_data,
                 color_continuous_scale=theme_sequential,
                 labels=commit_labels)

fig.update_layout(xaxis_title='Date')

format_and_show(fig)

In [44]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_commits_monthly.index,
            mode='lines+markers',
            name='Commits',
            line=dict(
                color='Purple'
            ),
            marker=dict(
                color=df_commits_monthly['num_authors'],
                size=8,
                colorscale=theme_sequential,
                colorbar=dict(
                    title="Authors"
                ),
            ),
            y=df_commits_monthly['num_commits']))

fig.update_layout(xaxis_title='Date',yaxis_title='Commits', title=project_name + " Monthly Commits and Authors")

format_and_show(fig)

In [45]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_commits_monthly.index,
            mode='lines+markers',
            name='Authors',
            line=dict(
                color='Purple'
            ),
            marker=dict(
                color=df_commits_monthly['num_commits'],
                size=8,
                colorscale=theme_sequential,
                colorbar=dict(
                    title="Commits"
                ),
            ),
            y=df_commits_monthly['num_authors']))

fig.update_layout(xaxis_title='Date',yaxis_title='Authors', title=project_name +" Monthly Authors and Commits")

format_and_show(fig)

In [46]:
fig = px.scatter_3d(df_commits_monthly, 
                 title=project_name + ' Monthly Commit and Author Counts',
                 x=df_commits_monthly.index,
                 y='num_commits', 
                 z='num_authors',
                 color='num_authors',
                 hover_name=df_commits_monthly.index,
                 hover_data=agg_commit_hover_data,
                 color_continuous_scale=theme_sequential,
                 labels=commit_labels)

fig.update_layout(xaxis_title='Date')

format_and_show_3d(fig)

In [47]:
fig = px.scatter_3d(df_commits_daily, 
                 title=project_name + ' Daily Commit and Average Files Modified',
                 x=df_commits_daily.index,
                 y='num_commits', 
                 z='avg_files',
                 color='avg_files',
                 hover_data=agg_commit_hover_data,
                 hover_name=df_commits_daily.index,
                 color_continuous_scale=theme_sequential,
                 labels=commit_labels)

fig.update_layout(xaxis_title='Date')

format_and_show_3d(fig)

## Authors

Data visualizations exploring author behaviors and tendencies

In [48]:
fig = px.scatter(df_commits, 
                 title=project_name + ' Total Bytes per Commit by Author',
                 x='AuthorDateUTC', 
                 y='TotalBytes',
                 color='AuthorName',
                 color_discrete_sequence=theme_discrete,
                 hover_name='Message',
                 labels=commit_labels)

fig.update_traces(marker=dict(size=5), selector=dict(mode='markers'))

format_and_show(fig)

In [49]:
df_attributed = df_commits[df_commits['AuthorName'] != '(no author)']
df_attributed = df_attributed[df_attributed['AuthorName'] != 'unknown']

df_contributor_monthly = df_attributed.groupby(['year-month','AuthorName']).agg(
        count=('CommitHash', pd.Series.nunique),
        sum_files=('NumFiles', 'sum'),
        sum_inserts=('AddedFiles', 'sum'),
        sum_deletes=('DeletedFiles', 'sum')).sort_index(ascending=False)

In [50]:
fig = px.scatter(df_contributor_monthly,
        x=df_contributor_monthly.index.get_level_values(0),
        y='sum_inserts',
        size='count',
        labels=commit_labels,
        color=df_contributor_monthly.index.get_level_values(1),
        color_discrete_sequence=theme_discrete)
fig.update_layout(title=project_name + " Net lines of code by month by author",
                  xaxis_title='Year / Month',
                  yaxis_title='Files Added',
                  legend_title='Author')
format_and_show(fig)

In [51]:
fig = px.scatter(df_contributor_monthly,
                 x=df_contributor_monthly.index.get_level_values(0), 
                 y=df_contributor_monthly.index.get_level_values(1),
                 color=df_contributor_monthly.index.get_level_values(1),
                 size='count',
                 color_discrete_sequence=theme_discrete,
                 labels=commit_labels)
fig.update_layout(title=project_name + " Monthly Contribution History",
                  xaxis_title='Year / Month',
                  yaxis_title='Author',
                  legend_title='Author')
format_and_show(fig)

In [52]:
# Overall box plot for all authors
fig = px.box(df_commits,
             title=project_name + ' bytes per commit by Author',
             x='TotalBytes',
             y='AuthorName',
             color='AuthorName',
             labels=commit_labels,
             color_discrete_sequence=theme_discrete,
             hover_data=['CommitHash','AuthorDateUTC','Message'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)

In [53]:
# Overall box plot for all authors
fig = px.box(df_commits,
             title=project_name + ' net lines of code per commit by Author',
             x='NetLines',
             y='AuthorName',
             color='AuthorName',
             labels=commit_labels,
             color_discrete_sequence=theme_discrete,
             hover_data=['CommitHash','AuthorDateUTC','Message'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)

In [54]:
# Overall box plot for all authors
fig = px.box(df_commits,
             title=project_name + ' # files modified per commit by Author',
             x='NumFiles',
             y='AuthorName',
             color='AuthorName',
             labels=commit_labels,
             color_discrete_sequence=theme_discrete,
             hover_data=['CommitHash','AuthorDateUTC','Message'],
             points='outliers') # Acceptable values: 'all', 'outliers', 'suspectedoutliers', or False
fig.update_traces(quartilemethod='linear', jitter=1)
format_and_show(fig)