When describing the data, in particular, you should show (non-exhaustive list):

    That you can handle the data in its size.
    That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
    That you considered ways to enrich, filter, transform the data according to your needs.
    That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
    That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.


# Milestone 2: Analyzing Success

In [2]:
# Imports.
import pandas as pd
import numpy as np

# Paths.
DATASETS_DIR = './data/datasets'

# 1. Data Retrieval 

The project datasets were retrieved from the following sources:

- [Gitential Datasets for Open Source Projects (retrieved in January 2018)](https://github.com/gitential/datasets) (2.31 G): there is no unified dataset provided by Gitential (each repo's dataset is separate). We used a mixed Jupyter Notebook to extract all the datasets' links with a shell one-liner, and downloaded them with Python.

- [GitHub API](https://developer.github.com/v3/) (? G): to augment our datasets, we're also using this API (with the [PyGithub Python library](https://github.com/PyGithub/PyGithub)) to obtain additional information about the repos we're interested in. We use the API to get the number of stars, forks and stargazers of a project. We also use the API to get each project's issues and each issue's comments (where applicable, since projects on GitHub can choose not to have an issues tracker).  
**<span style="color:green">(For implementation details, see `retrieve_additional_data_github.ipynb`)</span>**.

- [StackOverflow Posts data dump](https://archive.org/details/stackexchange) (62 G): we asked for this dataset to be downloaded on the EPFL cluster.

- [Reddit comments from 2005-12 to 2017-03](http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b) (304 G): this dataset is available on the EPFL cluster.

# 2. Data Loading

In [3]:
AMBIGUOUS_NAMES = {
    'apache-incubator-superset': 'apache-incubator/superset',
    'keras-team-keras': 'keras-team/keras',
    'pandas-dev-pandas': 'pandas-dev/pandas',
    'rust-lang-rust': 'rust-lang/rust',
    'scikit-learn-scikit-learn': 'scikit-learn/scikit-learn'
}
DIR_GITHUB_MAPPING = {}

for dir_name in os.listdir(DATASETS_DIR):
    if dir_name in AMBIGUOUS_NAMES:
        github_path = AMBIGUOUS_NAMES[dir_name]
    else:
        github_path = dir_name.replace('-', '/')
    DIR_GITHUB_MAPPING[dir_name] = github_path

In [20]:
commits = {}
for dir_name, github_path in DIR_GITHUB_MAPPING.items():
    commits[github_path] = pd.read_json('{}/{}/commits.json.gz'.format(DATASETS_DIR, dir_name))
commits_df = pd.concat(commits, names=['project'])
commits_df = commits_df.reset_index(level='project').reset_index(drop=True)

In [23]:
commits_df.head(4)

Unnamed: 0,project,age,author_email,author_email_dedup,author_name,author_name_dedup,author_time,committer_email,committer_email_dedup,committer_name,...,comp_i,delay,id,ismerge,loc_d,loc_i,message,ndiffs,nfiles,squashof
0,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-07-29 10:12:20,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,0,0,bc9b0d6b0aebc469b2f84664de590b59d6fdf79f,False,0,0,test\n,1,1,-1
1,Microsoft/CNTK,-1,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,Yinggong ZHAO,2014-08-29 16:21:42,do****@stggpu1.redmond.corp.microsoft.com,al*****@microsoft.com,unknown,...,248008,0,61694509551f38e031c74f3d9409b44fe50224cf,False,0,139349,First Release of CNTK\n,1,492,-1
2,Microsoft/CNTK,-1,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,Jasha Droppo,2014-08-31 12:27:42,jd*****@microsoft.com,jd*****@microsoft.com,Jasha Droppo,...,0,0,9515bfbd104a5ba4f4214e2d883e8e3af2acd01c,False,0,0,Added the ASR/TIMIT/decoding to ExampleSetups ...,1,6,-1
3,Microsoft/CNTK,-1,do****@microsoft.com,do****@microsoft.com,Dong Yu,Dong Yu,2014-09-01 14:43:21,do****@microsoft.com,do****@microsoft.com,Dong Yu,...,0,0,52eabc6e8852b6a8342ae304a606663f7f8ae15f,False,1,0,remove #include SimpleCNNBuilder.h\n,1,1,-1


# 3. Data Inspection
- Description
- format / type
- distributions
- missing values
- correlations


# 3. Data Preparation
- remove redundant data (like HW1)
- Missing values handling
- converting data type
- filter / transform

# 4. Plan Update
- updated your plan in a reasonable way
- reflecting your improved knowledge 
- discuss how your data suits your project needs
- discuss the methods you’re going to use, giving their essential mathematical details
- potentially discussing alternatives to your choices that you considered but dropped.

Stackoverflow
- Tags
- Response time
- number of answers
- number of positive and negative votes