When describing the data, in particular, you should show (non-exhaustive list):

    That you can handle the data in its size.
    That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
    That you considered ways to enrich, filter, transform the data according to your needs.
    That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
    That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.


# Milestone 2: Analyzing Success

In [1]:
# Imports.
import pandas as pd
import numpy as np
import os
# Spark doesn't support reading XML files natively, so we use spark-xml
# (source: <https://github.com/databricks/spark-xml/>)
# Note that we're using spark-xml 0.4.2 as that fixes <https://github.com/databricks/spark-xml/issues/92>,
# which is necessary to read our dataset. (0.4.2 isn't released yet, which is why we compiled it ourselves.)
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-xml_2.11-0.4.2.jar pyspark-shell'
from pyspark.sql import SparkSession, SQLContext

# Paths.
DATASETS_DIR = './data/datasets'

# 1. Data Retrieval 

The project datasets were retrieved from the following sources:

- [Gitential Datasets for Open Source Projects (retrieved in January 2018)](https://github.com/gitential/datasets) (2.31 G): there is no unified dataset provided by Gitential (each repo's dataset is separate). We used a mixed Jupyter Notebook to extract all the datasets' links with a shell one-liner, and downloaded them with Python.

- [GitHub API](https://developer.github.com/v3/) (? G): to augment our datasets, we're also using this API (with the [PyGithub Python library](https://github.com/PyGithub/PyGithub)) to obtain additional information about the repos we're interested in. We use the API to get the number of stars, forks and stargazers of a project. We also use the API to get each project's issues and each issue's comments (where applicable, since projects on GitHub can choose not to have an issues tracker).  
**<span style="color:green">(For implementation details, see `retrieve_additional_data_github.ipynb`)</span>**.

- [StackOverflow Posts data dump](https://archive.org/details/stackexchange) (62 G): we asked for this dataset to be downloaded on the EPFL cluster.

- [Reddit comments from 2005-12 to 2017-03](http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b) (304 G): this dataset is available on the EPFL cluster.

# 2. Data Loading

### <span style='color:green'> 2.1 - GitHub API data (Issues & Comments) </span>

In [43]:
# load issues data into dataframe 
issues_df = pd.read_csv('./data/github_issues.csv', index_col=0)

# display a sample
display(issues_df.head(3))

Unnamed: 0,body,closed_at,comments,created_at,html_url,number,state,title,updated_at,closed_by,user,assignee,assignees,labels,milestone,pull_request
341328385,- fix bug #19992\r\n- 2 tests amended in frame...,2018-07-18 10:23:30,3,2018-07-15 15:18:30,https://github.com/pandas-dev/pandas/pull/21921,21921,closed,BUG:Clip with a list-like threshold with a nan...,2018-07-18 10:23:47,jreback,makbigc,,,"['Bug', 'Missing-data']",0.23.4,https://github.com/pandas-dev/pandas/pull/21921
341342552,- [x] closes #21792\r\n- [ ] tests added / pas...,,10,2018-07-15 18:49:01,https://github.com/pandas-dev/pandas/pull/21922,21922,open,Concatenation of series of differing types sho...,2018-11-21 15:42:35,,xhochy,,,"['Bug', 'ExtensionArray']",,https://github.com/pandas-dev/pandas/pull/21922
341349059,"May close #21905, will need to check with OP.\r\n",2018-07-17 00:37:13,15,2018-07-15 20:16:15,https://github.com/pandas-dev/pandas/pull/21923,21923,closed,[BUG] change types to Py_ssize_t to fix #21905,2018-07-17 01:02:49,jreback,jbrockmendel,,,"['32bit', 'Bug']",0.24.0,https://github.com/pandas-dev/pandas/pull/21923


In [27]:
# load the comments data into a dataframe
comments_df = pd.read_csv('./data/github_comments.csv', index_col=0)

# display a sample
comments_df.head(5)

Unnamed: 0,body,created_at,updated_at,parent
142689649,It seems OK to me. I assume we still have at l...,2015-09-23 18:28:51,2015-09-23 18:28:51,107977847
142690747,@srowen So this changes it so that all of the ...,2015-09-23 18:33:36,2015-09-23 18:33:36,107977847
142699766,[Test build #42915 has finished](https://amp...,2015-09-23 19:07:00,2015-09-23 19:07:00,107977847
142701746,[Test build #42916 has finished](https://amp...,2015-09-23 19:16:05,2015-09-23 19:16:05,107977847
142733640,cc'ing a few people: @mccheah (who wrote the o...,2015-09-23 21:24:23,2015-09-23 21:24:23,108009077


**Note: this is only a sample of the data we want to retrieve. Therefore, we cannot perform correlation/distribution analysis yet (since we need the full data for that).**

(This is because the GitHub API, which we're using to build this additional dataset, has an hourly limit of 5000 requests/hour. We're retrieving around 300,000 issues in total, spread across all the repositories we're studying, as well as all of their comments. In total, this amounts to half a million to a million requests. We expect this to finish around Monday. For implementation details, please check the `retrieve_additional_data_github.ipynb` notebook.)

### <span style='color:green'> 2.2 - Projects Data (blames, and tags) </span>

In [11]:
# proper project name couldn't be identified for these 5 cases
# this was done manually to solve the issue
AMBIGUOUS_NAMES = {
    'apache-incubator-superset': 'apache-incubator/superset',
    'keras-team-keras': 'keras-team/keras',
    'pandas-dev-pandas': 'pandas-dev/pandas',
    'rust-lang-rust': 'rust-lang/rust',
    'scikit-learn-scikit-learn': 'scikit-learn/scikit-learn'
}
DIR_GITHUB_MAPPING = {}

# assign each directory with a github path
for dir_name in os.listdir(DATASETS_DIR):
    if dir_name in AMBIGUOUS_NAMES:
        github_path = AMBIGUOUS_NAMES[dir_name]
    else:
        github_path = dir_name.replace('-', '/')
    DIR_GITHUB_MAPPING[dir_name] = github_path
del DIR_GITHUB_MAPPING['.DS_Store']

In [12]:
# load commits into a data frame
commits = {}
for dir_name, github_path in DIR_GITHUB_MAPPING.items():
    commits[github_path] = pd.read_json('{}/{}/commits.json.gz'.format(DATASETS_DIR, dir_name))
commits_df = pd.concat(commits, names=['project'])
commits_df = commits_df.reset_index(level='project').reset_index(drop=True)

In [17]:
# display a sample
commits_df.head(5)

Unnamed: 0,project,age,author_email,author_email_dedup,author_name,author_name_dedup,author_time,committer_email,committer_email_dedup,committer_name,...,comp_i,delay,id,ismerge,loc_d,loc_i,message,ndiffs,nfiles,squashof
0,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-07-29 10:12:20,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,0,0,bc9b0d6b0aebc469b2f84664de590b59d6fdf79f,False,0,0,test\n,1,1,-1
1,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-08-29 16:21:42,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,248008,0,61694509551f38e031c74f3d9409b44fe50224cf,False,0,139349,First Release of CNTK\n,1,492,-1
2,Microsoft/CNTK,-1,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,Jasha Droppo,2014-08-31 12:27:42,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,...,0,0,9515bfbd104a5ba4f4214e2d883e8e3af2acd01c,False,0,0,Added the ASR/TIMIT/decoding to ExampleSetups ...,1,6,-1
3,Microsoft/CNTK,-1,do****@microsoft.com,do****@microsoft.com,Dong Yu,Dong Yu,2014-09-01 14:43:21,do****@microsoft.com,do****@microsoft.com,Dong Yu,...,0,0,52eabc6e8852b6a8342ae304a606663f7f8ae15f,False,1,0,remove #include SimpleCNNBuilder.h\n,1,1,-1
4,Microsoft/CNTK,-1,do****@microsoft.com,do****@microsoft.com,Dong Yu,Dong Yu,2014-09-02 17:16:40,do****@microsoft.com,do****@microsoft.com,Dong Yu,...,3,0,f5a490c2afbffd515a9ddfbe3053e76bb9cbfe17,False,1,1,"remove "";"" from ""if (pass == ndlPassInitial);""...",1,1,-1


In [18]:
# load blames into a data frame
tags = {}
for dir_name, github_path in DIR_GITHUB_MAPPING.items():
    tags[github_path] = pd.read_json('{}/{}/tags.json.gz'.format(DATASETS_DIR, dir_name))
tags_df = pd.concat(tags, names=['project'])
tags_df = tags_df.reset_index(level='project').reset_index(drop=True)

In [24]:
# set id as the index
tags_df = tags_df.set_index('id')
# display a sample
tags_df.head(4)

Unnamed: 0_level_0,project,author_time,message,name,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a0a32466a95c7f907ebb66e7f879cc314ec1506f,Microsoft/CNTK,2016-01-22 10:15:34,,refs/tags/2015-12-08,1.0
35cb5738e7ef794177a2fff06892a39700722dee,Microsoft/CNTK,2016-06-14 18:29:56,,refs/tags/feature/CNTKCustomMKL,1.0
56a2a15f64676ea4c0e0a0a681a57b19a46f64c6,Microsoft/CNTK,2016-01-25 20:53:43,Release CNTK Beta (Windows+Linux) 2016-01-26\n,refs/tags/r2016-01-26,1.0
2f9a48c71dc0a6097498cb7e90ac3b151ab536dd,Microsoft/CNTK,2016-02-05 11:06:20,Release CNTK Beta (Windows+Linux) 2016-02-08\n,refs/tags/r2016-02-08,1.0


### <span style='color:green'> 2.3 - Stackflow data</span>

In [3]:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)

# Read the data.
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='row').load('./data/Posts.xml')

KeyboardInterrupt: 

In [None]:
# Convert the CreationDate column to timestamps format.
df = df.withColumn('CreationDate', to_timestamp(df._CreationDate))
df = df.drop('_CreationDate').withColumnRenamed('CreationDate', '_CreationDate')

# Display some of the data.
pd.DataFrame(df.take(5), columns=df.columns)

### <span style='color:green'> 2.4 - Reddit data</span>

In [None]:
# load the data germain

# 3. Data Inspection

### <span style='color:green'> 2.1 - GitHub API data (Issues & Comments) </span>

In [29]:
display(issues_df.head(3))

Unnamed: 0,body,closed_at,comments,created_at,html_url,number,state,title,updated_at,closed_by,user,assignee,assignees,labels,milestone,pull_request
341328385,- fix bug #19992\r\n- 2 tests amended in frame...,2018-07-18 10:23:30,3,2018-07-15 15:18:30,https://github.com/pandas-dev/pandas/pull/21921,21921,closed,BUG:Clip with a list-like threshold with a nan...,2018-07-18 10:23:47,jreback,makbigc,,,"['Bug', 'Missing-data']",0.23.4,https://github.com/pandas-dev/pandas/pull/21921
341342552,- [x] closes #21792\r\n- [ ] tests added / pas...,,10,2018-07-15 18:49:01,https://github.com/pandas-dev/pandas/pull/21922,21922,open,Concatenation of series of differing types sho...,2018-11-21 15:42:35,,xhochy,,,"['Bug', 'ExtensionArray']",,https://github.com/pandas-dev/pandas/pull/21922
341349059,"May close #21905, will need to check with OP.\r\n",2018-07-17 00:37:13,15,2018-07-15 20:16:15,https://github.com/pandas-dev/pandas/pull/21923,21923,closed,[BUG] change types to Py_ssize_t to fix #21905,2018-07-17 01:02:49,jreback,jbrockmendel,,,"['32bit', 'Bug']",0.24.0,https://github.com/pandas-dev/pandas/pull/21923


The following columns were dropped: 'updated_at', 'assignee' and 'number' because they do not provide any significant inforamtion for our analysis:
- **'updated_at'**: We are more interested in the time it took for the issue to be closed and not when it was last updated because the reason for the change is not clear.
- **'number'** (repo specific sequential ID): we already have the global tracking number (used as index for the data) and this does not provide any further inforamtion. 
- **'assignee'**: redundant data, information already present in assignees column. 

In [44]:
# remove unneeded columns updated_at', 'number'
issues_df = issues_df.drop(['updated_at', 'number','assignee'], axis=1)

# display changed data
issues_df.head(5)

Unnamed: 0,body,closed_at,comments,created_at,html_url,state,title,closed_by,user,assignees,labels,milestone,pull_request
341328385,- fix bug #19992\r\n- 2 tests amended in frame...,2018-07-18 10:23:30,3,2018-07-15 15:18:30,https://github.com/pandas-dev/pandas/pull/21921,closed,BUG:Clip with a list-like threshold with a nan...,jreback,makbigc,,"['Bug', 'Missing-data']",0.23.4,https://github.com/pandas-dev/pandas/pull/21921
341342552,- [x] closes #21792\r\n- [ ] tests added / pas...,,10,2018-07-15 18:49:01,https://github.com/pandas-dev/pandas/pull/21922,open,Concatenation of series of differing types sho...,,xhochy,,"['Bug', 'ExtensionArray']",,https://github.com/pandas-dev/pandas/pull/21922
341349059,"May close #21905, will need to check with OP.\r\n",2018-07-17 00:37:13,15,2018-07-15 20:16:15,https://github.com/pandas-dev/pandas/pull/21923,closed,[BUG] change types to Py_ssize_t to fix #21905,jreback,jbrockmendel,,"['32bit', 'Bug']",0.24.0,https://github.com/pandas-dev/pandas/pull/21923
341349603,- [ ] <s>closes #16045</s><b>update</b>Not any...,2018-09-08 02:46:54,6,2018-07-15 20:24:21,https://github.com/pandas-dev/pandas/pull/21924,closed,move rename functionality out of internals,jreback,jbrockmendel,,"['Internals', 'Refactor']",0.24.0,https://github.com/pandas-dev/pandas/pull/21924
341355270,"Hi,\r\n\r\nThe `corr` method for DataFrames is...",,2,2018-07-15 21:50:46,https://github.com/pandas-dev/pandas/issues/21925,open,Allow different methods of correlation when us...,,dsaxton,,"['Apply', 'Enhancement']",Contributions Welcome,


**NaN value handling:**

In [46]:
# get the number of Nan values for each column
issues_df.isnull().sum()

body              17
closed_at        594
comments           0
created_at         0
html_url           0
state              0
title              0
closed_by        575
user               0
assignees       1958
labels           193
milestone        850
pull_request     974
dtype: int64

- **body:** an issue doesn't need to have a body, the problem can be explained in the title as it can be seen in this [example](https://github.com/pandas-dev/pandas/pull/22038)
- **closed_at:** an open issue does not have a closed time and thus it is specified as Nan as it can be seen in this [example](https://github.com/pandas-dev/pandas/pull/21922)
- **closed_by:** if an issue is not closed, it does not have a closed_by attribute. Note that the # of closed_by Nans is smaller then the number of closed_at Nans which can be explained by the fact that some issues might get reopened leading to this state as it can be seen in [this reopened issue](https://github.com/pandas-dev/pandas/issues/22116)
- **assignees, labels and milestones** are optional fields for a certain issue and thus having Nans is acceptable
- **pull_request:** an issue that does not have a pull_request is an GitHub issue otherwise it is a pull request (the GitHub API does not separate the two)

In [50]:
# display a sample
comments_df.head(5)

Unnamed: 0,body,created_at,updated_at,parent
142689649,It seems OK to me. I assume we still have at l...,2015-09-23 18:28:51,2015-09-23 18:28:51,107977847
142690747,@srowen So this changes it so that all of the ...,2015-09-23 18:33:36,2015-09-23 18:33:36,107977847
142699766,[Test build #42915 has finished](https://amp...,2015-09-23 19:07:00,2015-09-23 19:07:00,107977847
142701746,[Test build #42916 has finished](https://amp...,2015-09-23 19:16:05,2015-09-23 19:16:05,107977847
142733640,cc'ing a few people: @mccheah (who wrote the o...,2015-09-23 21:24:23,2015-09-23 21:24:23,108009077


- No columns will be dropped because all provide relevent information for later analysis. 

In [53]:
# get the number of Nan values for each column
comments_df.isnull().sum()

body          0
created_at    0
updated_at    0
parent        0
dtype: int64

- No null values are present in comments of issues data.

### <span style='color:green'> 2.2 - Projects data </span>

In [56]:
# use the id column as the index
commits_df = commits_df.set_index('id')
# display a sample
commits_df.head(5)

Unnamed: 0_level_0,project,age,author_email,author_email_dedup,author_name,author_name_dedup,author_time,committer_email,committer_email_dedup,committer_name,...,comp_d,comp_i,delay,ismerge,loc_d,loc_i,message,ndiffs,nfiles,squashof
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bc9b0d6b0aebc469b2f84664de590b59d6fdf79f,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-07-29 10:12:20,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,0,0,0,False,0,0,test\n,1,1,-1
61694509551f38e031c74f3d9409b44fe50224cf,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-08-29 16:21:42,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,0,248008,0,False,0,139349,First Release of CNTK\n,1,492,-1
9515bfbd104a5ba4f4214e2d883e8e3af2acd01c,Microsoft/CNTK,-1,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,Jasha Droppo,2014-08-31 12:27:42,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,...,0,0,0,False,0,0,Added the ASR/TIMIT/decoding to ExampleSetups ...,1,6,-1
52eabc6e8852b6a8342ae304a606663f7f8ae15f,Microsoft/CNTK,-1,do****@microsoft.com,do****@microsoft.com,Dong Yu,Dong Yu,2014-09-01 14:43:21,do****@microsoft.com,do****@microsoft.com,Dong Yu,...,0,0,0,False,1,0,remove #include SimpleCNNBuilder.h\n,1,1,-1
f5a490c2afbffd515a9ddfbe3053e76bb9cbfe17,Microsoft/CNTK,-1,do****@microsoft.com,do****@microsoft.com,Dong Yu,Dong Yu,2014-09-02 17:16:40,do****@microsoft.com,do****@microsoft.com,Dong Yu,...,3,3,0,False,1,1,"remove "";"" from ""if (pass == ndlPassInitial);""...",1,1,-1


- None of the columns will be droped because all are needed later on for further analysis steps.

In [57]:
# get the number of Nan values for each column
commits_df.isnull().sum()

project                      0
age                          0
author_email                 0
author_email_dedup           0
author_name                  0
author_name_dedup            0
author_time                  0
committer_email              0
committer_email_dedup    22313
committer_name               0
committer_name_dedup     22313
committer_time               0
comp_d                       0
comp_i                       0
delay                        0
ismerge                      0
loc_d                        0
loc_i                        0
message                      0
ndiffs                       0
nfiles                       0
squashof                     0
dtype: int64

- The only present Nans are in `committer_email_dedup` and `committer_name_dedup` which makes sense because these represent deduplicated information which is optional and thus no necessairy.

### <span style='color:green'> 2.3 - StackOverflow data</span>

In [None]:
# germain
# display a sample
stackoverflow_df.head(5)

- None of the columns will be droped because all are needed later on for further analysis steps.

In [None]:
# germain
# get the number of Nan values for each column
stackoverflow_df.isnull().sum()

- No null values are present in comments of issues data.

### <span style='color:green'> 2.4 - Reddit data</span>

In [None]:
# germain
# display a sample
reddit_df.head(5)

- None of the columns will be droped because all are needed later on for further analysis steps.

In [None]:
# germain
# get the number of Nan values for each column
reddit_df.isnull().sum()

- No null values are present in comments of issues data.

# 4. Data Preparation:
- converting data type
- filter / transform
- Description
- distributions
- correlations / dependance 

### <span style='color:green'> 2.1 - GitHub API data (Issues & Comments) </span>

In [59]:
# check the types of each column
issues_df.dtypes

body            object
closed_at       object
comments         int64
created_at      object
html_url        object
state           object
title           object
closed_by       object
user            object
assignees       object
labels          object
milestone       object
pull_request    object
dtype: object

In [168]:
# convert each column to it's appropriate type
issues_df.closed_at = issues_df.closed_at.astype('datetime64')
issues_df.created_at = issues_df.created_at.astype('datetime64')
# germain change this to 'project' instead of 'html_url'
issues_df.state = issues_df.state.astype('category')
issues_df.user = issues_df.user.astype(list)

In [169]:
issues_df.dtypes

body                    object
closed_at       datetime64[ns]
comments                 int64
created_at      datetime64[ns]
html_url                object
state                 category
title                   object
closed_by               object
user                    object
assignees               object
labels                  object
milestone               object
pull_request            object
dtype: object

In [136]:
# decribe the comments column (only int column)
pd.DataFrame(issues_df['comments'].describe())

Unnamed: 0,comments
count,1963.0
mean,5.201732
std,6.481846
min,0.0
25%,2.0
50%,4.0
75%,6.0
max,94.0


In [140]:
# decribe the closed_at, created_at
pd.DataFrame(issues_df[['closed_at','created_at']].describe())

Unnamed: 0,closed_at,created_at
count,1369,1963
unique,1203,1963
top,2018-11-18 18:32:51,2018-08-09 11:48:09
freq,3,1
first,2018-07-16 15:18:48,2018-07-15 15:18:30
last,2018-11-24 15:43:57,2018-11-24 09:51:00


In [152]:
pd.DataFrame(issues_df['state'].describe())

Unnamed: 0,state
count,1963
unique,2
top,closed
freq,1369


In [155]:
# check the types of each column
comments_df.dtypes

body          object
created_at    object
updated_at    object
parent         int64
dtype: object

In [170]:
# convert to appropriate types
comments_df.updated_at = comments_df.updated_at.astype('datetime64')
comments_df.created_at = comments_df.created_at.astype('datetime64')
comments_df.dtypes

body                  object
created_at    datetime64[ns]
updated_at    datetime64[ns]
parent                 int64
dtype: object

In [171]:
# describe the data
comments_df[['updated_at','created_at']].describe()

Unnamed: 0,updated_at,created_at
count,171985,171985
unique,170599,170586
top,2018-10-22 16:36:50,2018-10-22 16:36:50
freq,13,13
first,2015-09-23 18:28:51,2015-09-23 18:28:51
last,2018-11-24 17:31:16,2018-11-24 15:56:21


Based on sample used for this milestone, everything is acceptable and makes sense so far in terms of value ranges and distribution, meaning:
- when it comes to time values, they are in acceptable year ranges (by looking at the first and last times).
- Normal value ranges for the min comments and maximum observed (min is 0 and max is 94 for this sample of the data).
- the categorical data checks out as well.

**Note: this is only a sample of the data we want to retrieve. Therefore, we cannot perform correlation/distribution analysis yet (since we need the full data for that).**

(This is because the GitHub API, which we're using to build this additional dataset, has an hourly limit of 5000 requests/hour. We're retrieving around 300,000 issues in total, spread across all the repositories we're studying, as well as all of their comments. In total, this amounts to half a million to a million requests. We expect this to finish around Monday. For implementation details, please check the `retrieve_additional_data_github.ipynb` notebook.)

### <span style='color:green'> 2.2 - Project data </span>

In [174]:
# check data types
commits_df.dtypes

project                          object
age                               int64
author_email                     object
author_email_dedup               object
author_name                      object
author_name_dedup                object
author_time              datetime64[ns]
committer_email                  object
committer_email_dedup            object
committer_name                   object
committer_name_dedup             object
committer_time           datetime64[ns]
comp_d                            int64
comp_i                            int64
delay                             int64
ismerge                            bool
loc_d                             int64
loc_i                             int64
message                          object
ndiffs                            int64
nfiles                            int64
squashof                          int64
dtype: object

All columns have already the appropriate data type. :-)

In [175]:
# describe the integer data
commits_df.describe()

Unnamed: 0,age,comp_d,comp_i,delay,loc_d,loc_i,ndiffs,nfiles,squashof
count,760016.0,760016.0,760016.0,760016.0,760016.0,760016.0,760016.0,760016.0,760016.0
mean,16188.32,850.402367,1653.404,-1844900.0,483.495596,1134.851,1.15594,35.957244,309.021527
std,982235.1,7397.177527,14830.75,10640610.0,5396.217236,10896.62,0.363096,261.872079,1976.791309
min,-109167.0,0.0,0.0,-383369700.0,0.0,0.0,1.0,0.0,-1.0
25%,-1.0,0.0,1.0,-59575.5,0.0,1.0,1.0,1.0,-1.0
50%,-1.0,5.0,14.0,0.0,3.0,10.0,1.0,2.0,-1.0
75%,-1.0,47.0,101.0,0.0,27.0,65.0,1.0,5.0,-1.0
max,259430100.0,956073.0,2569923.0,31535770.0,826769.0,1435685.0,6.0,61451.0,44150.0


In [181]:
# describe the time data
commits_df[['author_time','committer_time']].describe()

Unnamed: 0,author_time,committer_time
count,760016,760016
unique,575196,571758
top,2016-06-03 15:38:25,2016-04-21 10:56:45
freq,170,979
first,1999-12-29 14:20:26,1999-12-29 14:20:26
last,2018-12-31 09:53:18,2018-01-19 21:29:15


In [182]:
commits_df['ismerge'].describe()

count     760016
unique         2
top        False
freq      617053
Name: ismerge, dtype: object

Everything is acceptable and makes sense so far in terms of value ranges and distribution, meaning:
- when it comes to time values, they are in acceptable year ranges (by looking at the first and last times `1999-2018` which makes sense).
- Normal value ranges (min, max) for the integer type data as show in the above data frame.
- the categorical data (ismerge) checks out as well.

*Note: in regards to the huge number for age, this is the result of it being in seconds. The max value corresponds to arounnd 8 years which fall into the observed time range of the commits (1999-2018).*

In [186]:
commits_df.corr()[commits_df.corr() > 0.3]

Unnamed: 0,age,comp_d,comp_i,delay,ismerge,loc_d,loc_i,ndiffs,nfiles,squashof
age,1.0,,,,,,,,,
comp_d,,1.0,0.471823,,,0.965573,0.433444,,0.463603,
comp_i,,0.471823,1.0,,,0.387511,0.985779,,0.557978,
delay,,,,1.0,,,,,,
ismerge,,,,,1.0,,,0.892249,,0.325822
loc_d,,0.965573,0.387511,,,1.0,0.369714,,0.423406,
loc_i,,0.433444,0.985779,,,0.369714,1.0,,0.560638,
ndiffs,,,,,0.892249,,,1.0,,
nfiles,,0.463603,0.557978,,,0.423406,0.560638,,1.0,
squashof,,,,,0.325822,,,,,1.0


In [187]:
commits_df = commits_df.drop('ismerge', axis=1)

These correlations make sense given the definition of our data columns.

For example, `ismerge` and `ndiffs` are correlated since `ismerge` is True whenever there are more than two parents / is a git squash, and `ndiffs` is the number of diffs and parents of a commit. In fact, we can obtain the same information in `ismerge` from `ndiffs` and `squashof`. We therefore decide to drop `ismerge`.

However, the other correlated variables might still be useful to us during our analysis.  
For instance, if we consider `comp_d` (whitespace complexity deleted, e.g. number of spaces removed by a commit) and `loc_d` (lines of code deleted by a commit), they're obviously highly correlated (deleted lines of codes = deleted spaces) but can still be useful together in some cases. If an author's commits are mostly removing whitespace complexity (without removing a lot of lines of code), then we might be able to classify this author as someone who mostly does "Code style checking/code guidelines enforcer" or similar.

For this reason, we choose not to remove any of the remaining columns at this stage.

# 5. Plan Update
- updated your plan in a reasonable way
- reflecting your improved knowledge 
- discuss how your data suits your project needs
- discuss the methods you’re going to use, giving their essential mathematical details
- potentially discussing alternatives to your choices that you considered but dropped.