## Extracting Issues from GitHub Repositories
### Author: carterp@cs.uoregon.edu

Python implementation for extracting issues and comments from a GitHub repository utilizing the GitHub REST API (https://docs.github.com/v3/).

In [644]:
import pandas as pd
import requests
import urllib.parse as up

owner = "HPCL" # owner or organization of repo
repo = "autoperf" # repo name

#### Rate Limiting
The API does not allow more than 60 unauthenticated requests per hour (https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) which is problematic for larger repositories (see **Pagination** below). At maximum efficiency, the API retrieves up to 100 results per page on a single call for requesting both issues and comments. Utilizing a person access token for authenticated requests, however, increases this limit to 5000 authenticated requests per hour, which sufficiently handles large repositories (see https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token for setting up a personal access token).

In [645]:
user = "..." # GitHub username (string)
token = "..." # personal access token (string)

session = requests.Session()
session.auth = (user, token)

In [646]:
# setup payload params (except page number)
per_page = 100 # max val is 100, fewest API calls
state = "all" # grab open and closed issues

# setup headers
accept = "application/vnd.github.v3+json"
headers = {"accept" : accept}

topics_url = f"https://api.github.com/repos/{owner}/{repo}/issues"
comments_url = f"https://api.github.com/repos/{owner}/{repo}/issues/comments"

In [647]:
def api_request(url, payload=None, headers=None):
    # Fetch from GitHub REST API
    try:
        response = session.get(url, params=payload, headers=headers)
    except requests.exceptions.ConnectionError:
        return None
    return response

#### Pagination
In repositories with no more than 100 issues/comments (maximum entries that can be grabbed per page), the API will return all the data on a single call. However, when this limit is exceeded it is necessary to use pagination (https://docs.github.com/en/rest/guides/traversing-with-pagination) and make additional API calls to fetch all the issues/comments for the repository.

##### Example:
$I$ = Number of issues in a repository\
$C$ = Number of comments in a repository \
$A(x) =  \lceil \frac{x}{100} \rceil$ is the total number of API calls \
If $I = 1184$, $C = 4862$, then the number of API calls is $A(I) + A(C) = \lceil \frac{1184}{100} \rceil + \lceil \frac{4862}{100} \rceil =  12 + 49 = 61$ which also shows the importance of using authenticated requests. This number can be a little bit larger due to the number of unpaired comments (see **Finding Unpaired Comments** below).

In [648]:
def retrieve_data(url):
    # Retireve full API information via pagination
    page = 1
    payload = {"per_page" : per_page, "page" : page, "state" : state}
    response = api_request(url=url, payload=payload, headers=headers)
    data = response.json()
    
    # Case where pagination is not needed
    if "Link" not in response.headers.keys():
        return pd.DataFrame(data)
    
    # Determine last page from response headers
    link_headers = response.headers["Link"].split(" ")
    index = link_headers.index("rel=\"last\"") - 1
    last_page_url = link_headers[index][1:-2] # truncate extraneous symbols
    last_page = int(up.parse_qs(up.urlparse(last_page_url).query)["page"][0])
    
    for page in range(page + 1, last_page + 1):
        payload["page"] = page
        response = api_request(url=url, payload=payload, headers=headers)
        data += response.json()
        
    return pd.DataFrame(data)

In [649]:
topics = retrieve_data(topics_url) # retrieves all topics

In [650]:
topics.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,assignees,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app
0,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/28,599648698,MDU6SXNzdWU1OTk2NDg2OTg=,28,Results not going to configured results directory,...,"[{'login': 'brnorris03', 'id': 3604514, 'node_...",,1,2020-04-14T15:12:15Z,2020-04-14T15:14:35Z,2020-04-14T15:14:35Z,MEMBER,,Results placement ignores output config option.,
1,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/26,598524497,MDU6SXNzdWU1OTg1MjQ0OTc=,26,Simple timer tool,...,"[{'login': 'brnorris03', 'id': 3604514, 'node_...",{'url': 'https://api.github.com/repos/HPCL/aut...,0,2020-04-12T16:44:43Z,2020-04-12T16:44:43Z,,MEMBER,,Eliminate dependency on TAU by adding a simple...,
2,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/25,598273781,MDU6SXNzdWU1OTgyNzM3ODE=,25,Setup git actions testing,...,"[{'login': 'brnorris03', 'id': 3604514, 'node_...",{'url': 'https://api.github.com/repos/HPCL/aut...,2,2020-04-11T14:11:05Z,2020-04-11T22:31:20Z,2020-04-11T22:31:20Z,MEMBER,resolved,Created by Boyana Norris via monday.com integr...,
3,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/24,598273776,MDU6SXNzdWU1OTgyNzM3NzY=,24,Setup git actions testing,...,"[{'login': 'brnorris03', 'id': 3604514, 'node_...",{'url': 'https://api.github.com/repos/HPCL/aut...,0,2020-04-11T14:11:03Z,2020-04-11T20:16:55Z,2020-04-11T20:16:55Z,MEMBER,,Simple git actions baseline testing with pytest.,
4,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/23,598183592,MDU6SXNzdWU1OTgxODM1OTI=,23,Initial refactoring,...,"[{'login': 'brnorris03', 'id': 3604514, 'node_...",{'url': 'https://api.github.com/repos/HPCL/aut...,0,2020-04-11T03:54:08Z,2020-04-14T20:55:51Z,2020-04-14T20:55:50Z,MEMBER,,"First pass at cleaning up -- remove globals, o...",


In [651]:
repo_comments = retrieve_data(comments_url) # retrieves all comments

In [652]:
repo_comments = repo_comments.assign(paired=False) # keep track of comments of a deleted issue

In [653]:
repo_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
0,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/1#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,105728415,MDEyOklzc3VlQ29tbWVudDEwNTcyODQxNQ==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-05-27T02:30:57Z,2015-05-27T02:30:57Z,MEMBER,Added all contents of metric_spec to installat...,,False
1,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/6#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,109154857,MDEyOklzc3VlQ29tbWVudDEwOTE1NDg1Nw==,"{'login': 'xdai', 'id': 6415069, 'node_id': 'M...",2015-06-05T03:48:12Z,2015-06-05T03:48:12Z,COLLABORATOR,Fixed in 48d5392abb5dff3b08dc8281b4de51b5289c5...,,False
2,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/9#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,111305188,MDEyOklzc3VlQ29tbWVudDExMTMwNTE4OA==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-06-11T23:24:06Z,2015-06-11T23:24:44Z,MEMBER,Partially implemented in c4156aa -- only SUMMA...,,False
3,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/10#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,111633920,MDEyOklzc3VlQ29tbWVudDExMTYzMzkyMA==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-06-12T22:35:37Z,2015-06-12T22:35:37Z,MEMBER,Fixed in 4f12176\n,,False
4,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,139623857,MDEyOklzc3VlQ29tbWVudDEzOTYyMzg1Nw==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-09-11T18:43:28Z,2015-09-11T18:43:28Z,MEMBER,Also I don't think this should be requiring th...,,False


#### Build Issue Threads
Create an issue thread from the topic (the intial problem/question) and comments. Issue threads are in the form `(topic_df, comments_df)`.

In [654]:
def build_issue_threads(topics, repo_comments):
    issue_threads = []
    for index, row in topics.iterrows():
        topic = topics.loc[topics["node_id"] == row["node_id"]]
        # sort by date (most recent first)
        comments = repo_comments.loc[repo_comments["issue_url"] == row["url"]].sort_values("created_at", ascending=False)
        repo_comments.loc[repo_comments["issue_url"] == row["url"], "paired"] = True
        issue_threads.append(((topic, comments)))
    
    return issue_threads

In [655]:
issue_threads = build_issue_threads(topics=topics, repo_comments=repo_comments)

#### Finding Unpaired Comments (i.e. Comments for Deleted Issues)
For larger repositories, sometimes the comments fetched reference a deleted issue -- found when testing (https://github.com/jupyter/help). When testing with the aforementioned repository, of the 1669 comments fetched only 19 of them were unpaired.

In [656]:
deleted_issues = []

unpaired_comments = repo_comments.loc[repo_comments["paired"] == False]
problematic_urls = unpaired_comments["issue_url"].unique().tolist() # unique issues
problematic_urls

[]

In [657]:
for url in problematic_urls: # remove unpaired comments from repo_comments
    response = api_request(url=url)
    if response.status_code == 404:
        deleted_issues.append(url)
        repo_comments = repo_comments[repo_comments.issue_url != url]

In [658]:
deleted_issues

[]

In [659]:
unpaired_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired


In [660]:
repo_comments.drop("paired", axis=1, inplace=True) # remove paired column so it is only GitHub data
repo_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app
0,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/1#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,105728415,MDEyOklzc3VlQ29tbWVudDEwNTcyODQxNQ==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-05-27T02:30:57Z,2015-05-27T02:30:57Z,MEMBER,Added all contents of metric_spec to installat...,
1,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/6#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,109154857,MDEyOklzc3VlQ29tbWVudDEwOTE1NDg1Nw==,"{'login': 'xdai', 'id': 6415069, 'node_id': 'M...",2015-06-05T03:48:12Z,2015-06-05T03:48:12Z,COLLABORATOR,Fixed in 48d5392abb5dff3b08dc8281b4de51b5289c5...,
2,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/9#issu...,https://api.github.com/repos/HPCL/autoperf/iss...,111305188,MDEyOklzc3VlQ29tbWVudDExMTMwNTE4OA==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-06-11T23:24:06Z,2015-06-11T23:24:44Z,MEMBER,Partially implemented in c4156aa -- only SUMMA...,
3,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/10#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,111633920,MDEyOklzc3VlQ29tbWVudDExMTYzMzkyMA==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-06-12T22:35:37Z,2015-06-12T22:35:37Z,MEMBER,Fixed in 4f12176\n,
4,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,139623857,MDEyOklzc3VlQ29tbWVudDEzOTYyMzg1Nw==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-09-11T18:43:28Z,2015-09-11T18:43:28Z,MEMBER,Also I don't think this should be requiring th...,


#### Example Usage

In [661]:
# example usage
num_comments = 2 # decrease this if StopIteration raised, just to show a conversation
ex_issue = next(x for x in issue_threads if len(x[1]) > num_comments)

In [662]:
ex_issue[0]

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,assignees,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app
13,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14,106068474,MDU6SXNzdWUxMDYwNjg0NzQ=,14,TAU_MAKEFILE env variable,...,[],,5,2015-09-11T18:42:34Z,2015-09-19T16:10:07Z,,MEMBER,,Users shouldn't have to set any env. variables...,


In [663]:
ex_issue[1]

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
16,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,141684027,MDEyOklzc3VlQ29tbWVudDE0MTY4NDAyNw==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-09-19T16:10:07Z,2015-09-19T16:10:07Z,MEMBER,"That would work, but it brings up a different ...",,False
14,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,141631088,MDEyOklzc3VlQ29tbWVudDE0MTYzMTA4OA==,"{'login': 'xdai', 'id': 6415069, 'node_id': 'M...",2015-09-19T06:39:58Z,2015-09-19T06:40:09Z,COLLABORATOR,How about a generic `[Env]` section and export...,,False
11,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,139881818,MDEyOklzc3VlQ29tbWVudDEzOTg4MTgxOA==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-09-13T14:27:23Z,2015-09-13T14:27:23Z,MEMBER,I don't think we can avoid having to deal with...,,False
8,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,139846797,MDEyOklzc3VlQ29tbWVudDEzOTg0Njc5Nw==,"{'login': 'xdai', 'id': 6415069, 'node_id': 'M...",2015-09-13T06:27:24Z,2015-09-13T06:27:24Z,COLLABORATOR,Autoperf depends on user script to build the u...,,False
4,https://api.github.com/repos/HPCL/autoperf/iss...,https://github.com/HPCL/autoperf/issues/14#iss...,https://api.github.com/repos/HPCL/autoperf/iss...,139623857,MDEyOklzc3VlQ29tbWVudDEzOTYyMzg1Nw==,"{'login': 'brnorris03', 'id': 3604514, 'node_i...",2015-09-11T18:43:28Z,2015-09-11T18:43:28Z,MEMBER,Also I don't think this should be requiring th...,,False


In [664]:
ex_issue[0]["html_url"].values[0] # verify data at the issue url

'https://github.com/HPCL/autoperf/issues/14'

### TODO
* <del> Retrieve closed issues
* <del> Fix issue_threads to iterate over issue numbers in issue_heads
* <del> Fix authentication limit
* <del> Rework commenting fetching to use non-depecrated API call and pagination
* <del> Figure issue with unpaired comments