In [149]:
# Import things we'll need for analysis
import pandas as pd

# Loading Data
Before we can do any analysis we need to load our data for various types of things.

Let's load the data for:

- Releases
- Issues
- Commits
- Files, Directory Structure, and File Sizes

## Loading Release Data from GitHub
Release data is provided via [the GitHub's CLI](https://cli.github.com/) using the following command:

```bash
gh release list | tr '\t' '|' > releases.csv
```

This will create a `releases.csv` file with different values separated by a `|` delimiter. Note the we use the `|` delimiter because commas in string fields otherwise might prove a problem.

Note that GitHub releases do not include a header row so we'll have to specify the column headers manually when loading data.

Let's load that CSV file into a Pandas DataFrame which will let us work with tabular data in an efficient way.

In [150]:
# Load release data
df_releases = pd.read_csv('releases.csv', sep='|', names=['ID','Tag','Name','Date'],parse_dates=['Date'])

# Ensure the Date column can be worked with as a date later on
#df_releases['Date'] = pd.to_datetime(df_releases['Date'], errors='coerce', utc=True)

# Ensure our releases are tagged as releases instead of NA
df_releases['Tag'] = df_releases['Tag'].fillna('Release')

# Display the top 5 Releases
df_releases.head()

Unnamed: 0,ID,Tag,Name,Date
0,release-20210321,Latest,release-20210321,2021-03-21 11:20:15+00:00
1,playtest-20210131,Pre-release,playtest-20210131,2021-01-30 21:27:38+00:00
2,playtest-20201213,Pre-release,playtest-20201213,2020-12-12 20:21:43+00:00
3,Release 20200503,Release,release-20200503,2020-05-03 12:50:18+00:00
4,Playtest 20200426,Pre-release,playtest-20200426,2020-04-26 11:54:14+00:00


## Loading Issue Data from GitHub
Issue data is provided via [the GitHub's CLI](https://cli.github.com/) using the following command:

```bash
gh issue list --limit 10000 --state all | tr '\t' '|' > issues.csv
```

This will create a `issues.csv` file with different values separated by a `|` delimiter. Note the we use the `|` delimiter because commas in string fields otherwise might prove a problem.

Note that GitHub issues do not include a header row so we'll have to specify the column headers manually when loading data.

In [151]:
# Load issue data
df_issues = pd.read_csv('issues.csv', sep='\t', names=['ID', 'Status', 'Message', 'Labels', 'Date'])

# Ensure the Date column can be worked with as a date later on. Sample date format: 2021-12-28 11:42:22 +0000 UTC
df_issues['Date'] = pd.to_datetime(df_issues['Date'], utc=True, format='%Y-%m-%d %H:%M:%S %z UTC')

# Ensure our labels use empty strings instead of NA
df_issues['Labels'] = df_issues['Labels'].fillna('')

# Display the top 5 issues
df_issues.head()

Unnamed: 0,ID,Status,Message,Labels,Date
0,19857,OPEN,Engine / Mod credit tab highlighting doesn't work,Bug,2021-12-28 11:42:22+00:00
1,19855,CLOSED,ChangesHealth incorrect work,Limitation,2021-12-28 23:18:47+00:00
2,19849,OPEN,Connection Failed during Singleplayer Skirmish,Bug,2021-12-20 08:16:24+00:00
3,19848,OPEN,Feedback wanted on sample HD assets for Red Alert,"Red Alert, Question / Support, Artwork",2021-12-21 11:04:23+00:00
4,19845,OPEN,Broken attack move for vehicel with turret weapon,Bug,2021-12-14 20:52:29+00:00


## Load Git Commit Data
Git Commit data was pulled via a time-consuming process using [PyDriller](https://pydriller.readthedocs.io/en/latest/intro.html) and saved into an `OpenRA_FileCommits.csv` file. This process takes multiple hours depending on your processing, disk, memory, and network connection and is best done overnight. See `GitFileDataExtraction.ipynb` for more details on that process.

Git Commit data was not available as part of the GitHub CLI at the time of this experiment, but might be supported now if you are reading this later after early 2022.

Let's load the previously saved CSV data into a data frame.

In [152]:
df_commits = pd.read_csv('OpenRA_FileCommits.csv', parse_dates=['author_date'])
df_commits.head()

Unnamed: 0.1,Unnamed: 0,hash,message,author_name,author_date,in_main,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,filename,old_path,new_path,project_name,project_path,parents
0,0,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,chrisf,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Blowfish.cs,,MixBrowser\Blowfish.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
1,1,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,chrisf,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MM.DAT,MixBrowser\MM.DAT,MixBrowser\MM.DAT,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
2,2,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,chrisf,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixBrowser.csproj,,MixBrowser\MixBrowser.csproj,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
3,3,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,chrisf,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixEntry.cs,,MixBrowser\MixEntry.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
4,4,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,chrisf,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Program.cs,,MixBrowser\Program.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,


Okay, great, but we need to do some cleaning. There are a few authors who appear in this data under multiple names. For example, *chrisf* in the commits above should be merged into *Chris Forbes*.

Merging authors is a common problem and one that can be exposed by exporting the unique authors to a CSV file and looking at it manually in Excel with the following code:

```py
df_commits = df.groupby('author_name')
               .agg(count=('hash', 'count'))
               .sort_values('count', ascending=False)
               .to_csv('authors.csv')
```

However, we've already done that, so let's do the renaming now:

In [153]:
# Data Cleaning - Handle duplicate names identified for contributors
df_commits['author_name'] = df_commits['author_name'].replace(['chrisf'], 'Chris Forbes')
df_commits['author_name'] = df_commits['author_name'].replace(['Curtis S'], 'Curtis Shmyr')
df_commits['author_name'] = df_commits['author_name'].replace(['dan9550'], 'Dan9550')
df_commits['author_name'] = df_commits['author_name'].replace(['DArcy Rush'], 'D\'Arcy Rush')
df_commits['author_name'] = df_commits['author_name'].replace(['David JimÃ©nez'], 'David JimeÌnez')
df_commits['author_name'] = df_commits['author_name'].replace(['Deniz AyÄ±kol'], 'Deniz Ayikol')
df_commits['author_name'] = df_commits['author_name'].replace(['forcecore'], 'Forcecore')
df_commits['author_name'] = df_commits['author_name'].replace(['Guido L', 'Guido L.'], 'Guido Lipke')
df_commits['author_name'] = df_commits['author_name'].replace(['huwpascoe'], 'Huw Pascoe')
df_commits['author_name'] = df_commits['author_name'].replace(['Matija H', 'Matija HustiÄ‡'], 'matija-hustic')
df_commits['author_name'] = df_commits['author_name'].replace(['Matthias MailÃƒÂ¤nder', 'Matthias MailaÌˆnder'], 'Matthias MailÃ¤nder')
df_commits['author_name'] = df_commits['author_name'].replace(['MustaphaTR'], 'Mustapha')
df_commits['author_name'] = df_commits['author_name'].replace(['pchote'], 'Paul Chote')
df_commits['author_name'] = df_commits['author_name'].replace(['penev92'], 'Pavel Penev')
df_commits['author_name'] = df_commits['author_name'].replace(['PiÃ«t Delport'], 'Pi Delport')
df_commits['author_name'] = df_commits['author_name'].replace(['pizzaoverhead'], 'Pizzaoverhead')
df_commits['author_name'] = df_commits['author_name'].replace(['Scott_NZ'], 'ScottNZ')
df_commits['author_name'] = df_commits['author_name'].replace(['unknown', 'Unknown'], '(no author)')
df_commits.head()

Unnamed: 0.1,Unnamed: 0,hash,message,author_name,author_date,in_main,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,filename,old_path,new_path,project_name,project_path,parents
0,0,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Blowfish.cs,,MixBrowser\Blowfish.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
1,1,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MM.DAT,MixBrowser\MM.DAT,MixBrowser\MM.DAT,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
2,2,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixBrowser.csproj,,MixBrowser\MixBrowser.csproj,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
3,3,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixEntry.cs,,MixBrowser\MixEntry.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,
4,4,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Program.cs,,MixBrowser\Program.cs,OpenRA,C:\Users\MattE\AppData\Local\Temp\tmpsnu5cl6m\...,


We can see that *chrisf* now displays properly as *Chris Forbes*. We do still have some extra cleanup to do of columns that either don't make sense (such as `Unnamed: 0`) or data we won't be using.

We also want to rename the `new_path` column to `fullpath` to make things simpler for us in the future when we want to join between Pandas DataFrames.

Let's fix those columns now.

In [154]:
# Drop not needed columns
df_commits.drop('Unnamed: 0', axis=1, inplace=True)
df_commits.drop('project_name', axis=1, inplace=True)
df_commits.drop('project_path', axis=1, inplace=True)
df_commits.drop('old_path', axis=1, inplace=True)

# Rename the new_path column to fullpath for ease of merging DataFrames later
df_commits.rename(columns={'new_path': 'fullpath'})

df_commits.head()

Unnamed: 0,hash,message,author_name,author_date,in_main,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,filename,new_path,parents
0,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Blowfish.cs,MixBrowser\Blowfish.cs,
1,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MM.DAT,MixBrowser\MM.DAT,
2,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixBrowser.csproj,MixBrowser\MixBrowser.csproj,
3,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,MixEntry.cs,MixBrowser\MixEntry.cs,
4,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,2007-06-19 08:51:17+00:00,True,False,0,1350,1350,11,bleed,Program.cs,MixBrowser\Program.cs,


Next let's add some more specific date-based columns and ensure we sort by commit date to make our lives easier in visualization later on.

In [155]:
# Engineer Date Columns
df_commits['datetime'] = pd.to_datetime(df_commits['author_date'], errors='coerce', utc=True)
df_commits['date'] = df_commits['datetime'].dt.date
df_commits['year'] = df_commits['datetime'].dt.year
df_commits['month'] = df_commits['datetime'].dt.month
df_commits['year-month'] = df_commits['datetime'].to_numpy().astype('datetime64[M]')
df_commits['weekday'] = df_commits['datetime'].dt.weekday
df_commits = df_commits.sort_values('date')

# We no longer need the raw author_date column
df_commits.drop('author_date', axis=1, inplace=True)

df_commits.head()

  df_commits['year-month'] = df_commits['datetime'].to_numpy().astype('datetime64[M]')


Unnamed: 0,hash,message,author_name,in_main,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,filename,new_path,parents,datetime,date,year,month,year-month,weekday
0,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,True,False,0,1350,1350,11,bleed,Blowfish.cs,MixBrowser\Blowfish.cs,,2007-06-19 08:51:17+00:00,2007-06-19,2007,6,2007-06-01,1
18,765c0ac0673c10471f5b7b46a008eb78ffa143b2,git-svn-id: svn://svn.ijw.co.nz/svn/OpenRa@105...,Chris Forbes,True,False,13,3,-10,2,bleed,Program.cs,MixBrowser\Program.cs,3fdefe451aa8cba9b4057b07dd8ffc8fa2d90f85,2007-06-19 11:31:31+00:00,2007-06-19,2007,6,2007-06-01,1
17,765c0ac0673c10471f5b7b46a008eb78ffa143b2,git-svn-id: svn://svn.ijw.co.nz/svn/OpenRa@105...,Chris Forbes,True,False,13,3,-10,2,bleed,MixFile.cs,MixBrowser\MixFile.cs,3fdefe451aa8cba9b4057b07dd8ffc8fa2d90f85,2007-06-19 11:31:31+00:00,2007-06-19,2007,6,2007-06-01,1
16,3fdefe451aa8cba9b4057b07dd8ffc8fa2d90f85,git-svn-id: svn://svn.ijw.co.nz/svn/OpenRa@105...,Chris Forbes,True,False,2,18,16,2,bleed,Program.cs,MixBrowser\Program.cs,711a99a02215319659a790804e70ed34277346a5,2007-06-19 10:30:35+00:00,2007-06-19,2007,6,2007-06-01,1
15,3fdefe451aa8cba9b4057b07dd8ffc8fa2d90f85,git-svn-id: svn://svn.ijw.co.nz/svn/OpenRa@105...,Chris Forbes,True,False,2,18,16,2,bleed,MixFile.cs,MixBrowser\MixFile.cs,711a99a02215319659a790804e70ed34277346a5,2007-06-19 10:30:35+00:00,2007-06-19,2007,6,2007-06-01,1


## File Changes
Next, let's load a list of files in the file system from a pre-created `filesizes.csv` file. This file was created using python code in `FileAnalysis.ipynb` if you are curious about how it was generated, but the process was simple: Loop through a project's folder structure, note any files present, count the number of lines in each file, and output the full path and file size data to a CSV file.

In [156]:
df_files = pd.read_csv('filesizes.csv')
df_files.head()

Unnamed: 0.1,Unnamed: 0,fullpath,project,path,filename,ext,lines
0,0,OpenRA.Game/Activities/Activity.cs,OpenRA.Game,Activities,Activity.cs,.cs,291
1,1,OpenRA.Game/Activities/CallFunc.cs,OpenRA.Game,Activities,CallFunc.cs,.cs,33
2,2,OpenRA.Game/./Actor.cs,OpenRA.Game,.,Actor.cs,.cs,645
3,3,OpenRA.Game/./CPos.cs,OpenRA.Game,.,CPos.cs,.cs,150
4,4,OpenRA.Game/./CryptoUtil.cs,OpenRA.Game,.,CryptoUtil.cs,.cs,261


Okay, great. We have data, but there's a column that doesn't look helpful and we'll need to handle the issue of `/./` appearing in some `fullpath` columns for files that occur in the root of their projects.

In [157]:
# Remove the unwanted column
df_files.drop('Unnamed: 0', axis=1, inplace=True)

# This is a function we'll apply to each row of our DataFrame
def fix_file_path(row):
    if row['path'] == '.':
        row['fullpath'] = row['project'] + '\\' + row['filename']
    else:
        row['fullpath'] = row['project'] + '\\' + row['path'] + '\\' + row['filename']
    return row

# Apply the function to each row and update the DataFrame with the result
df_files = df_files.apply(fix_file_path, axis=1)

df_files.head()

Unnamed: 0,fullpath,project,path,filename,ext,lines
0,OpenRA.Game\Activities\Activity.cs,OpenRA.Game,Activities,Activity.cs,.cs,291
1,OpenRA.Game\Activities\CallFunc.cs,OpenRA.Game,Activities,CallFunc.cs,.cs,33
2,OpenRA.Game\Actor.cs,OpenRA.Game,.,Actor.cs,.cs,645
3,OpenRA.Game\CPos.cs,OpenRA.Game,.,CPos.cs,.cs,150
4,OpenRA.Game\CryptoUtil.cs,OpenRA.Game,.,CryptoUtil.cs,.cs,261


That's now fixed and we now have working data for files, issues, releases, and commits.

# Statistical Analysis

Now that we have data, let's use some statistical techniques to determine if we can spot anything unusual about our data.

## Releases

In [158]:
df_releases.describe(datetime_is_numeric=True, include='all')

Unnamed: 0,ID,Tag,Name,Date
count,30,30,30,30
unique,30,3,30,
top,release-20210321,Pre-release,release-20210321,
freq,1,20,1,
mean,,,,2019-05-16 23:40:25.633333248+00:00
min,,,,2017-09-23 19:22:58+00:00
25%,,,,2018-08-05 07:16:37.750000128+00:00
50%,,,,2019-06-04 18:09:54+00:00
75%,,,,2020-03-03 08:08:01.500000+00:00
max,,,,2021-03-21 11:20:15+00:00


Okay, that's not a lot of data there, for releases, but it does tell us that there were 30 total releases with the first release being in September of 2017 and the latest release was in March of 2021.

## Issues

In [159]:
df_issues.describe(datetime_is_numeric=True, include='all')

Unnamed: 0,ID,Status,Message,Labels,Date
count,8607.0,8607,8607,8607,8607
unique,,2,8548,701,
top,,CLOSED,My game crashed,Bug,
freq,,7086,19,957,
mean,10705.062043,,,,2017-04-17 05:40:05.329615616+00:00
min,2001.0,,,,2012-04-05 08:47:53+00:00
25%,5947.5,,,,2015-03-07 09:01:16.500000+00:00
50%,10880.0,,,,2017-01-30 22:01:38+00:00
75%,15241.5,,,,2019-07-21 13:53:25+00:00
max,19857.0,,,,2021-12-29 20:46:08+00:00


Alright, so there've been 8607 issues in the project between April of 2012 and the end of 2021 (when this data was pulled), showing the project is still very active. Looking at the percentiles, it does appear that the issues are roughly evenly distributed between the years as well.

## Commits

In [160]:
df_commits.describe(datetime_is_numeric=True, include='all')

Unnamed: 0,hash,message,author_name,in_main,is_merge,num_deletes,num_inserts,net_lines,num_files,branches,filename,new_path,parents,datetime,date,year,month,year-month,weekday
count,109022,109022,109022,109022,109022,109022.0,109022.0,109022.0,109022.0,109022,109022,106939,109011,109022,109022,109022.0,109022.0,109022,109022.0
unique,23663,23300,386,1,1,,,,,1,5991,10076,21072,,3636,,,,
top,6810469634d43a7a3e8ab2664942e162c3f4436a,Updated copyright years.,Paul Chote,True,False,,,,,bleed,map.yaml,OpenRA.Mods.Common\OpenRA.Mods.Common.csproj,5a7a09a6a712c5c3861d8ba0328837c746580c4e,,2016-02-21,,,,
freq,1480,1480,33621,109022,109022,,,,,109022,3035,634,1480,,1984,,,,
mean,,,,,,715.189503,713.422539,-1.766964,146.740695,,,,,2014-12-20 13:22:48.098649600+00:00,,2014.515125,5.970868,2014-12-05 14:45:08.634954496,3.535479
min,,,,,,0.0,0.0,-26898.0,1.0,,,,,2007-06-19 08:51:17+00:00,,2007.0,1.0,2007-06-01 00:00:00,0.0
25%,,,,,,6.0,14.0,0.0,4.0,,,,,2011-08-16 06:42:05+00:00,,2011.0,3.0,2011-08-01 00:00:00,2.0
50%,,,,,,38.0,63.0,2.0,11.0,,,,,2014-12-25 22:40:40+00:00,,2014.0,6.0,2014-12-01 00:00:00,4.0
75%,,,,,,237.0,292.0,34.0,56.0,,,,,2017-06-28 21:54:39+00:00,,2017.0,9.0,2017-06-01 00:00:00,5.0
max,,,,,,43617.0,43618.0,26898.0,1480.0,,,,,2021-12-05 16:22:22+00:00,,2021.0,12.0,2021-12-01 00:00:00,6.0


That's a lot more data and not all of it makes sense to look at in this view, but it does tell us that `in_main`, `is_merge`, and `branches` appear to only have one unique value, meaning those columns won't tell us much.

Before we go on, let's drop those columns.

In [161]:
df_commits.drop('in_main', axis=1, inplace=True)
df_commits.drop('is_merge', axis=1, inplace=True)
df_commits.drop('branches', axis=1, inplace=True)
df_commits.head(1) # Only show the first row

Unnamed: 0,hash,message,author_name,num_deletes,num_inserts,net_lines,num_files,filename,new_path,parents,datetime,date,year,month,year-month,weekday
0,b59ba43934a3a6837410db51cf60157cf854e52d,openra first commit!\r\n\r\ngit-svn-id: svn://...,Chris Forbes,0,1350,1350,11,Blowfish.cs,MixBrowser\Blowfish.cs,,2007-06-19 08:51:17+00:00,2007-06-19,2007,6,2007-06-01,1


It also tells us that there are over 100,000 commits in this dataset, so visualizing some of this data may be a challenge given the scale.

Looking at the date values, it looks like the project was first active in October of 2011 and the date of peak activity seems to be near the end of 2014.

## Files

In [162]:
df_files.describe(datetime_is_numeric=True, include='all')

Unnamed: 0,fullpath,project,path,filename,ext,lines
count,1362,1362,1362,1362,1362,1362.0
unique,1362,10,75,1347,1,
top,OpenRA.Game\Activities\Activity.cs,OpenRA.Mods.Common,Traits,Program.cs,.cs,
freq,1,968,155,5,1362,
mean,,,,,,136.984581
std,,,,,,146.008808
min,,,,,,18.0
25%,,,,,,56.25
50%,,,,,,88.0
75%,,,,,,160.0


Okay, so we're working with 1362 distinct files that range in size from 18 lines to 1411 lines long. The data is limited to just C# files (`.cs`) and the average file size is 136 lines of code, which seems reasonably healthy at a glance.

# Exploratory Data Analysis

Now that we've got our data and a rough idea of the distribution of that data, let's start visualizing it.

There are many different data visualization libraries in Python, but my favorite is Plotly since it gives us very interactive visuals, ease of use through Plotly express, a variety of themes, and power to customize with Graph Objects. It also integrates nicely with Dash if we wanted to make an interactive dashboard.

Let's import Plotly now and set up some styling settings.

In [164]:
import plotly.express as px

theme_discrete = px.colors.qualitative.Prism
theme_diverging_neutral = px.colors.diverging.RdYlBu
theme_diverging = px.colors.diverging.Picnic_r
theme_diverging_r = px.colors.diverging.Picnic
theme_sequential = px.colors.sequential.Agsunset
theme_continuous= px.colors.diverging.balance

template = 'plotly_white'

Now with Plotly, let's start exploring our data.