**This notebook handles the retrieval and creation of two related datasets using the Github API, in order to enrich our main dataset:**

- **Github issues**: the issues and pull requests for each of the projects we're interested in.
- **Github comments**: for each issue/pull request, its corresponding comments (replies).

In [None]:
# Imports.
import pandas as pd
from github import Github
from datetime import datetime
import time
from github.GithubException import RateLimitExceededException

# Paths.
DATA_DIR = './data'
DATASETS_DIR = './data/datasets'
OAUTH_TOKEN = 'secret'

In [8]:
# Github API client.
g = Github(OAUTH_TOKEN)

In order to fetch each repo's issues and comments, we first need to determine the Github repository each directory in our Gitential dataset belongs to.

The directory structure is as follows:
```
data/
    datasets/
        angular-angular/
        antirez-redis/
        apache-incubator-superset/
        apache-mesos/
        keras-team-keras/
        pandas-dev-pandas/
        ...
```

For example, `angular-angular` corresponds to the Github project `angular/angular` (owner: angular, project name: angular).

However, some of these names are ambiguous. For instance, does `apache-incubator-superset` correspond to `apache-incubator/superset` or `apache/incubator-superset`? We handle these few cases manually, and resolve the rest automatically by replacing `-` with `/`.

In [106]:
# Define ambiguous names and their corresponding Github repos.
AMBIGUOUS_NAMES = {
    'apache-incubator-superset': 'apache-incubator/superset',
    'keras-team-keras': 'keras-team/keras',
    'pandas-dev-pandas': 'pandas-dev/pandas',
    'rust-lang-rust': 'rust-lang/rust',
    'scikit-learn-scikit-learn': 'scikit-learn/scikit-learn'
}
# This is where we'll store the directory<->Github repo mapping.
DIR_GITHUB_MAPPING = {}

# Go over the directories.
for dir_name in os.listdir(DATASETS_DIR):
    if dir_name in AMBIGUOUS_NAMES:
        # Handle ambiguous names.
        github_path = AMBIGUOUS_NAMES[dir_name]
    else:
        # Handle non-ambiguous names.
        github_path = dir_name.replace('-', '/')
    # Update the mapping.
    DIR_GITHUB_MAPPING[dir_name] = github_path

For each Github repository we're interested in, we extract the following data for the **issues**:

- **id**: a (global) ID of the issue.
- **title**: the title of the issue.
- **body**: the body text of the issue.
- **number**: the number of the issue (for the current repo).
- **comments**: number of comments.
- **html_url**: the URL to the issue.
- **state**: whether the issue is open or closed.
- **user**: the username of the person who opened the issue.
- **created_at**: when the issue was created.
- **updated_at**: when the issue was last updated.
- **closed_at**: when the issue was closed (or `None` if it was never closed).
- **closed_by**: the username of the person who closed the issue (`None` if it was never closed).
- **assignee**: (legacy) the username of the person who is assigned to this issue  (`None` if no one is assigned).
- **assignees**: same as above, but supports multiple users (`None` if there are no assignees).
- **labels**: the list of labels applied to this issue (`None` if there are no labels).
- **milestone**: the milestone's title that is applied to this issue (`None` if no milestone is applied).
- **pull_request**: the pull request URL if this issue is a pull request (`None` if it's a normal issue).
- **project**: the project's name for which this issue belongs.

For each issue, we also extract the following data for its **comments**:

- **body**: the text of the comment.
- **created_at**: when the comment was created.
- **updated_at**: when the comment was last updated.

Note that the Github API has a limit of 5000 requests/hour, which we handle as well in the code by waiting whenever necessary.

In [116]:
# This is where we'll store the data we're fetching.
issues_data = {}
comments_data = {}

# Descriptors we want to fetch for issues/comments without any special handling.
# Some descriptors require some extra logic, which we do in the main code.
descriptors = ['body', 'closed_at', 'comments', 'created_at', 'html_url', 'number', 'state', 'title', 'updated_at']
comments_descriptors = ['body', 'created_at', 'updated_at']

In [None]:
def extract_issue_data(issues_data, comments_data, issue, github_path):
    '''Given an issue, fetch its data and comments and store them in issues_data and comments_data.'''
    # Get and store the issue's comments.
    for comment in issue.get_comments():
        comment_data = {}
        for descriptor in comments_descriptors:
            comment_data[descriptor] = getattr(comment, descriptor)
        comment_data['parent'] = issue.id
        comments_data[comment.id] = comment_data
    # Get and store the issue's data.
    issue_data = {}
    for descriptor in descriptors:
        issue_data[descriptor] = getattr(issue, descriptor)
    issue_data['closed_by'] = issue.closed_by.login if issue.closed_by else None
    issue_data['user'] = issue.user.login
    issue_data['assignee'] = issue.assignee.login if issue.assignee else None
    issue_data['assignees'] = [user.login for user in issue.assignees] if issue.assignees else None
    issue_data['labels'] = [label.name for label in issue.labels] if issue.labels else None
    issue_data['milestone'] = issue.milestone.title if issue.milestone else None
    issue_data['pull_request'] = issue.pull_request.html_url if issue.pull_request else None
    issue_data['project'] = github_path
    issues_data[issue.id] = issue_data

# For each directory and corresponding Github repo, fetch the issues and comments.
for dir_name, github_path in DIR_GITHUB_MAPPING.items():
    print('Fetching issues for {}'.format(github_path))
    repo = g.get_repo(github_path)
    for issue in repo.get_issues(state='all'):
        # We don't want our code to crash when Github returns an error (e.g. API rate limit exceeded).
        # Therefore, we keep trying/sleeping until fetching an issue's data succeeds.
        success = False
        while not success:
            try:
                # If we've already retrieved this issue's data, skip it.
                if issue.id not in issues_data:
                    extract_issue_data(issues_data, comments_data, issue, github_path)
                success = True
            except RateLimitExceededException as exc:
                # If we've exceeded the rate limit, sleep and try again later.
                # While the limit is hourly, we sleep for 10 minutes at a time instead of 1 hour at a time
                # because the limit refresh seems to lag a bit sometimes. This way, we don't risk waiting an
                # extra hour if it lagged a bit too much.
                print('Rate limit exceeded. Sleeping (time: {})... ({})'.format(datetime.now(), github_path))
                remaining = 0
                while remaining < 4000:
                    time.sleep(10 * 60)
                    remaining = g.get_rate_limit().core.remaining
            except GithubException as exc:
                print(exc)
                time.sleep(60)

Fetching issues for pandas-dev/pandas
Rate limit exceeded. Sleeping (time: 2018-11-25 00:48:16.554675)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 01:47:02.831549)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 02:53:55.383979)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 04:00:50.323763)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 05:07:13.648489)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 06:14:07.311023)... (pandas-dev/pandas)
Rate limit exceeded. Sleeping (time: 2018-11-25 07:20:10.565698)... (pandas-dev/pandas)
Fetching issues for Microsoft/CNTK
Rate limit exceeded. Sleeping (time: 2018-11-25 08:25:44.606466)... (Microsoft/CNTK)
Fetching issues for caffe2/caffe2
Rate limit exceeded. Sleeping (time: 2018-11-25 09:31:49.292134)... (caffe2/caffe2)
Fetching issues for apache/spark
Rate limit exceeded. Sleeping (time: 2018-11-25 10:38:45.413574)... (apache

In [109]:
# Transform our dictionaries into pandas DataFrames for easier handling.
issues_df = pd.DataFrame.from_dict(issues_data, orient='index')
comments_df = pd.DataFrame.from_dict(comments_data, orient='index')

**Please note that the dataset is still retrieving. The code below was only performed on a sample, which will further be explored in the main notebook:**

In [111]:
# Count number of retrieved issues from each project.
issues_df.html_url.str.split('/').apply(lambda x: x[3] + '/' + x[4]).value_counts()

pandas-dev/pandas    1963
Name: html_url, dtype: int64

In [113]:
# Save to disk.
comments_df.to_csv('{}/github_comments.csv'.format(DATA_DIR))
issues_df.to_csv('{}/github_issues.csv'.format(DATA_DIR))