## Extracting Issues from GitHub and GitLab Repositories
### Author: carterp@cs.uoregon.edu

Python implementation for extracting issues and comments from a GitHub repository utilizing the GitHub REST API (https://docs.github.com/v3/) or GitLab via (https://docs.gitlab.com/ee/api).

In [24]:
import datetime
import pandas as pd
import requests

# date constants
LATEST = datetime.datetime(year=datetime.MAXYEAR, month=12, day=31)# latest entry to fetch
OLDEST = datetime.datetime.utcfromtimestamp(0) # oldest entry to fetch

#### Configuration Step
User configures this step. Be sure to specify which source via the `source_github` boolean. 

In [25]:
source_github = True # True for GitHub, False for GitLab

start = datetime.datetime(year=2019, month=1, day=1) #example of different start end dates
end = datetime.datetime(year=2019, month=1, day=31)

date_range = (start, end) # range of results to fetch, use OLDEST, LATEST if you want to encompass all issues

if source_github:
    # authentication config
    user = "..." # GitHub username
    token = "..." # personal access token
    
    # target repository
    owner = "jupyter" # owner or organization of repo
    repo = "help" # repo name
    
    # filtering parameters
    state = "all" # sort by the state of issue - options: all, open, closed
else:
    # authentication config
    token = "..." # personal access token
    
    # target repository
    project_id = "1441932" # GitLab project id
    
    # filtering parameters
    state = "all" # sort by the state of issue - options: all, opened, closed

#### Rate Limiting
The API does not allow more than 60 unauthenticated requests per hour (https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) which is problematic for larger repositories (see **Pagination** below). At maximum efficiency, the API retrieves up to 100 results per page on a single call for requesting both issues and comments. Utilizing a person access token for authenticated requests, however, increases this limit to 5000 authenticated requests per hour, which sufficiently handles large repositories (see https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token for setting up a personal access token). GitLab rate limits at 600 requests per minute.

In [26]:
session = requests.Session()

In [27]:
# pre-fetching payload & authentication setup
if source_github:
    # authenticate
    session.auth = (user, token)
    
    # setup headers & params
    since = date_range[0].isoformat() # issues updated at or later
    accept = "application/vnd.github.v3+json"
    headers = {"accept" : accept}
    topics_url = f"https://api.github.com/repos/{owner}/{repo}/issues"
    comments_url = f"https://api.github.com/repos/{owner}/{repo}/issues/comments"
else:
    # authenticate
    session.headers = {"Private-Token" : token}
    
    # setup params
    scope = "all" # encompassed every issue
    with_labels_details = "true" # more info on labels
    created_after, created_before = date_range # date_range translate
    created_after, created_before = created_after.isoformat(), created_before.isoformat()
    topics_url = f"https://gitlab.com/api/v4/projects/{project_id}/issues"

In [28]:
def api_request(url, payload=None, headers=None):
    # Fetch from API
    try:
        response = session.get(url, params=payload, headers=headers)
    except requests.exceptions.ConnectionError:
        return None
    return response

#### Pagination
In repositories with no more than 100 issues/comments (maximum entries that can be grabbed per page), the API will return all the data on a single call. However, when this limit is exceeded it is necessary to use pagination (https://docs.github.com/en/rest/guides/traversing-with-pagination) and make additional API calls to fetch all the issues/comments for the repository.

##### Example:
$I$ = Number of issues in a repository\
$C$ = Number of comments in a repository \
$A(x) =  \lceil \frac{x}{100} \rceil$ is the total number of API calls \
If $I = 1184$, $C = 4862$, then the number of API calls is $A(I) + A(C) = \lceil \frac{1184}{100} \rceil + \lceil \frac{4862}{100} \rceil =  12 + 49 = 61$ which also shows the importance of using authenticated requests. This number can be a little bit larger due to the number of unpaired comments (see **Finding Unpaired Comments** below).

In [29]:
def retrieve_data(url):
    # Retireve full API information via pagination
    page = 1
    per_page = 100
    if source_github:
        payload = {"per_page" : per_page, "page" : page, "state" : state, "since" : since}
    else:
        payload = {"per_page" : per_page, "page" : page, "state" : state, "scope" : scope, 
                   "with_labels_details" : with_labels_details, "created_after" : created_after,
                  "created_before" : created_before}
    
    response = api_request(url=url, payload=payload, headers=headers)
    data = response.json()

    # Case where pagination is not needed
    if not response.links:
        return pd.DataFrame(data)
    
    # Determine last page from response headers
    while "next" in response.links.keys():
        page += 1
        payload["page"] = page
        response = api_request(url=url, payload=payload, headers=headers)
        data += response.json()
        
    return pd.DataFrame(data)

In [30]:
topics = retrieve_data(topics_url) # retrieves all topics

In [31]:
# apply date range
if not topics.empty:
    mask = (topics['created_at'] >= date_range[0].isoformat()) & (topics['created_at'] <= date_range[1].isoformat())
    topics = topics.loc[mask]

In [32]:
topics.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
45,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/513,404462675,MDU6SXNzdWU0MDQ0NjI2NzU=,513,Successfully executed cells get stuck in busy ...,...,,0,2019-01-29T20:01:10Z,2019-01-29T20:01:10Z,,NONE,,I need help with this issue:\r\nhttps://github...,,
46,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/512,404444283,MDU6SXNzdWU0MDQ0NDQyODM=,512,metadata-aware themes,...,,0,2019-01-29T19:12:56Z,2019-01-29T19:12:56Z,,NONE,,I've been trying to use cell metadata to imple...,,
47,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/511,404344190,MDU6SXNzdWU0MDQzNDQxOTA=,511,How to search kernels only in one path.,...,,0,2019-01-29T15:25:06Z,2019-01-29T15:25:06Z,,NONE,,I defined JUPYTER_PATH contining only a truste...,,
48,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/510,403777793,MDU6SXNzdWU0MDM3Nzc3OTM=,510,Difficulties with references to pysqlite2 afte...,...,,0,2019-01-28T11:54:00Z,2019-01-28T11:54:00Z,,NONE,,"**My setup:**\r\n\r\n Windows 7, 64 bit\r\n...",,
49,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/509,403474262,MDU6SXNzdWU0MDM0NzQyNjI=,509,slideshows with JupyterLab,...,,1,2019-01-26T19:30:12Z,2019-01-29T13:06:49Z,,NONE,,I've seen tools for creating slideshow present...,,


In [33]:
repo_comments = pd.DataFrame(None)
if source_github: # GitLab only fetches comments for each issue
    repo_comments = retrieve_data(comments_url) # retrieves all comments
    repo_comments = repo_comments.assign(paired=False) # keep track of comments of a deleted issue

In [34]:
repo_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
0,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/480#iss...,https://api.github.com/repos/jupyter/help/issu...,451353915,MDEyOklzc3VlQ29tbWVudDQ1MTM1MzkxNQ==,"{'login': 'PetitLepton', 'id': 9269985, 'node_...",2019-01-04T05:00:25Z,2019-01-04T05:00:25Z,NONE,"Hi,\r\nit turns out that the problem seems to ...",,False
1,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/480#iss...,https://api.github.com/repos/jupyter/help/issu...,451359024,MDEyOklzc3VlQ29tbWVudDQ1MTM1OTAyNA==,"{'login': 'PetitLepton', 'id': 9269985, 'node_...",2019-01-04T05:54:56Z,2019-01-04T05:54:56Z,NONE,This is a similar issue: https://github.com/ip...,,False
2,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/489#iss...,https://api.github.com/repos/jupyter/help/issu...,451366012,MDEyOklzc3VlQ29tbWVudDQ1MTM2NjAxMg==,"{'login': 'CharlesAnalyst', 'id': 28387630, 'n...",2019-01-04T06:55:26Z,2019-01-04T06:55:26Z,NONE,This issue can be solved by the answer:\r\nhtt...,,False
3,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/493#iss...,https://api.github.com/repos/jupyter/help/issu...,453099196,MDEyOklzc3VlQ29tbWVudDQ1MzA5OTE5Ng==,"{'login': 'kspeeckaert', 'id': 6348619, 'node_...",2019-01-10T13:40:54Z,2019-01-10T13:40:54Z,NONE,I compared the output to a working environment...,,False
4,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/89#issu...,https://api.github.com/repos/jupyter/help/issu...,453770444,MDEyOklzc3VlQ29tbWVudDQ1Mzc3MDQ0NA==,"{'login': 'awatson31911', 'id': 25552118, 'nod...",2019-01-12T18:29:01Z,2019-01-12T18:29:01Z,NONE,"@d-demirci \r\nperfect, thanks a lot!",,False


#### Build Issue Threads
Create an issue thread from the topic (the intial problem/question) and comments. Issue threads are in the form `(topic_df, comments_df)`.

In [35]:
def build_issue_threads(topics, repo_comments):
    issue_threads = []
    for index, row in topics.iterrows():
        if source_github:
            topic = topics.loc[topics["node_id"] == row["node_id"]]
            # sort by date (most recent first)
            comments = repo_comments.loc[repo_comments["issue_url"] == row["url"]].sort_values("created_at", ascending=False)
            comments.drop("paired", axis=1, inplace=True) # remove paired column
            repo_comments.loc[repo_comments["issue_url"] == row["url"], "paired"] = True
            issue_threads.append(((topic, comments)))
        else:
            # this is really slow!
            topic = topics.loc[topics["iid"] == row["iid"]]
            iid = row["iid"] #issue id
            comments_url =  f"https://gitlab.com/api/v4/projects/{project_id}/issues/{iid}/notes"
            response = api_request(url=comments_url)
            comments = pd.DataFrame(response.json())
            issue_threads.append((topic, comments))
    
    return issue_threads

In [36]:
issue_threads = build_issue_threads(topics=topics, repo_comments=repo_comments)

#### Finding Unpaired Comments (i.e. Comments for Deleted Issues)
For larger repositories, sometimes the comments fetched reference a deleted issue -- found when testing (https://github.com/jupyter/help). When testing with the aforementioned repository, of the 1669 comments fetched only 19 of them were unpaired. This does not apply to repositories from GitLab as comments are only fetched for each issue. Note, changing `date_range` will increase the number of problematic urls.

In [37]:
if source_github and not topics.empty:
    deleted_issues = []

    unpaired_comments = repo_comments.loc[repo_comments["paired"] == False]
    problematic_urls = unpaired_comments["issue_url"].unique().tolist() # unique issues
    for url in problematic_urls: # remove unpaired comments from repo_comments
        response = api_request(url=url)
        if response.status_code == 404:
            deleted_issues.append(url)
            repo_comments = repo_comments[repo_comments.issue_url != url]

In [38]:
problematic_urls

['https://api.github.com/repos/jupyter/help/issues/480',
 'https://api.github.com/repos/jupyter/help/issues/89',
 'https://api.github.com/repos/jupyter/help/issues/241',
 'https://api.github.com/repos/jupyter/help/issues/426',
 'https://api.github.com/repos/jupyter/help/issues/150',
 'https://api.github.com/repos/jupyter/help/issues/175',
 'https://api.github.com/repos/jupyter/help/issues/476',
 'https://api.github.com/repos/jupyter/help/issues/138',
 'https://api.github.com/repos/jupyter/help/issues/478',
 'https://api.github.com/repos/jupyter/help/issues/369',
 'https://api.github.com/repos/jupyter/help/issues/50',
 'https://api.github.com/repos/jupyter/help/issues/190',
 'https://api.github.com/repos/jupyter/help/issues/524',
 'https://api.github.com/repos/jupyter/help/issues/525',
 'https://api.github.com/repos/jupyter/help/issues/526',
 'https://api.github.com/repos/jupyter/help/issues/527',
 'https://api.github.com/repos/jupyter/help/issues/531',
 'https://api.github.com/repos/ju

In [39]:
deleted_issues

[]

In [40]:
unpaired_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
0,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/480#iss...,https://api.github.com/repos/jupyter/help/issu...,451353915,MDEyOklzc3VlQ29tbWVudDQ1MTM1MzkxNQ==,"{'login': 'PetitLepton', 'id': 9269985, 'node_...",2019-01-04T05:00:25Z,2019-01-04T05:00:25Z,NONE,"Hi,\r\nit turns out that the problem seems to ...",,False
1,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/480#iss...,https://api.github.com/repos/jupyter/help/issu...,451359024,MDEyOklzc3VlQ29tbWVudDQ1MTM1OTAyNA==,"{'login': 'PetitLepton', 'id': 9269985, 'node_...",2019-01-04T05:54:56Z,2019-01-04T05:54:56Z,NONE,This is a similar issue: https://github.com/ip...,,False
4,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/89#issu...,https://api.github.com/repos/jupyter/help/issu...,453770444,MDEyOklzc3VlQ29tbWVudDQ1Mzc3MDQ0NA==,"{'login': 'awatson31911', 'id': 25552118, 'nod...",2019-01-12T18:29:01Z,2019-01-12T18:29:01Z,NONE,"@d-demirci \r\nperfect, thanks a lot!",,False
9,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/241#iss...,https://api.github.com/repos/jupyter/help/issu...,455207852,MDEyOklzc3VlQ29tbWVudDQ1NTIwNzg1Mg==,"{'login': 'akaihola', 'id': 13725, 'node_id': ...",2019-01-17T15:16:25Z,2019-01-17T15:16:25Z,NONE,"As a work-around for the simplest use cases, t...",,False
12,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/426#iss...,https://api.github.com/repos/jupyter/help/issu...,455876051,MDEyOklzc3VlQ29tbWVudDQ1NTg3NjA1MQ==,"{'login': 'roitmaster', 'id': 17810956, 'node_...",2019-01-20T15:33:07Z,2019-01-20T15:33:07Z,NONE,# packages in environment at /Users/mroitman/D...,,False


In [41]:
if source_github:
    repo_comments.drop("paired", axis=1, inplace=True) # remove paired column 

#### Example Usage
First element in tuple pair is the topic dataframe and the second is the comments dataframe.

In [42]:
issue_threads = [(i, pd.DataFrame(c)) for i, c in issue_threads]

In [43]:
# example usage
num_comments = -1 # decrease this if StopIteration raised, just to show a conversation
ex_issue = next(x for x in issue_threads if len(x[1]) > num_comments)

In [44]:
ex_issue[0]

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
45,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/513,404462675,MDU6SXNzdWU0MDQ0NjI2NzU=,513,Successfully executed cells get stuck in busy ...,...,,0,2019-01-29T20:01:10Z,2019-01-29T20:01:10Z,,NONE,,I need help with this issue:\r\nhttps://github...,,


In [45]:
ex_issue[1]

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app


In [46]:
if source_github:
    url = ex_issue[0]["html_url"].values[0] # verify data at the issue url
    display(url)

'https://github.com/jupyter/help/issues/513'

### TODO
* <del> Retrieve closed issues </del>
* <del> Fix issue_threads to iterate over issue numbers in issue_heads </del>
* <del> Fix authentication limit </del>
* <del> Rework commenting fetching to use non-depecrated API call and pagination </del>
* <del> Figure issue with unpaired comments </del>
* <del> Remove 'paired' column from issue threads (comments) </del>
* <del> GitLab support </del>