# Dataset analysis

In [None]:
import pandas as pd

df = pd.read_csv('commits.csv')
df

## Statistics
Some rapid commands to check whether we have enough data to create developers' corpus

In [None]:
# Check number of changed methods by author (before alias analysis)
df.groupby(['author'])['changed_methods'].sum().sort_values(ascending=False)

In [None]:
# Check number of added lines by author (before alias analysis)
df.groupby(['author'])['added_line'].sum().sort_values(ascending=False)

## Pre-processing
In this step, we remove null values, bot users and group eventual similar aliases

### Dataset inspection

In [None]:
df.info()

In [None]:
# Remove rows containing null values
df = df.dropna()
df.info()

### Remove bots

Remove all authors that contains the '[bot]' substring within the name

In [None]:
df = df[df["author"].str.contains("\[bot\]") == False]

In [None]:
df[df["author"].str.contains("\[bot\]")]["author"].unique()

In [None]:
df[df["author"].str.contains("GitHub")]["author"].unique()

In [None]:
df = df[df["author"].str.contains("GitHub") == False]

In [None]:
df[df["author"].str.contains("GitHub")]["author"].unique()

## Remove outliers
Remove all commits that have a number of modifies file over the third quartile + 1.5 inter-quartile range

In [None]:
# Calculate the inter-quartile range
Q1, Q3 = df['changed_files'].quantile(0.25), df['changed_files'].quantile(0.75)
IQR = Q3 - Q1
print(f'Q1 = {Q1}, Q3 = {Q3}, IQR = {IQR}')

In [None]:
# Remove all instances which has as number of modified files more than Q3+1.5IQR
threshold = Q3 + 1.5*IQR
df = df.query('changed_files < @threshold')
df

## Alias disambiguation with gambit

In this phase, we cluster together those instances which are more likely to be related to the same developer. Since we only have the developer name and email, we rely on "gambit", a disambiguation tool presented in the recent work by Gote and Zingg, "gambit – An Open Source Name Disambiguation
Tool for Version Control Systems"

In [None]:
!pip install gambit-disambig

Here we transform our data in an acceptable form for the gambit library

In [None]:
aliases_df = df[['author', 'email']].drop_duplicates()
aliases_df.columns = aliases_df.columns.str.replace("author", "alias_name")
aliases_df.columns = aliases_df.columns.str.replace("email", "alias_email")
aliases_df

In [None]:
import gambit

disamb_df = gambit.disambiguate_aliases(aliases_df)
disamb_df

Export the alias mapping in a separate CSV file

In [None]:
disamb_df.to_csv('disamb.csv')

This way it is possible to re-load the dataframe in a second moment

In [None]:
disamb_df = pd.read_csv('disamb.csv')
disamb_df

Here we map the produced 'author_id' (an unique identifier used to distinguish developers) to our dataset

In [None]:
merged_df = df.merge(disamb_df[['alias_name', 'alias_email', 'author_id']], left_on=['author', 'email'], right_on=['alias_name', 'alias_email'], how='inner')

In [None]:
merged_df = merged_df.drop('alias_name', axis=1)
merged_df = merged_df.drop('alias_email', axis=1)

Save the merged DataFrame (containing the author_id) on a separate file

In [None]:
merged_df.to_csv('commits_with_authors.csv', index=False)