## Extracting Issues from GitHub and GitLab Repositories
### Author: carterp@cs.uoregon.edu

Python implementation for extracting issues and comments from a GitHub repository utilizing the GitHub REST API (https://docs.github.com/v3/) or GitLab via (https://docs.gitlab.com/ee/api). Be sure to specify which source via the `source_github` boolean.

In [185]:
import pandas as pd
import requests

source_github = False # True for GitHub, False for GitLab
session = requests.Session()

#### Rate Limiting
The API does not allow more than 60 unauthenticated requests per hour (https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) which is problematic for larger repositories (see **Pagination** below). At maximum efficiency, the API retrieves up to 100 results per page on a single call for requesting both issues and comments. Utilizing a person access token for authenticated requests, however, increases this limit to 5000 authenticated requests per hour, which sufficiently handles large repositories (see https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token for setting up a personal access token). GitLab rate limits at 600 requests per minute.

In [186]:
# setup target repo and authentication for github or gitlab
if source_github:
    owner = "HPCL" # owner or organization of repo
    repo = "autoperf" # repo name
    user = "..." # GitHub username
    token = "..." # personal access token
    
    session.auth = (user, token)
    
    # setup payload params (except page number)
    per_page = 100 # max val is 100, fewest API calls
    state = "all" # grab open and closed issues
    
    # setup headers
    accept = "application/vnd.github.v3+json"
    headers = {"accept" : accept}
    topics_url = f"https://api.github.com/repos/{owner}/{repo}/issues"
    comments_url = f"https://api.github.com/repos/{owner}/{repo}/issues/comments"
else:
    token = "...-sZMX" # personal access token
    project_id = "1293394" # GitLab project id
    
    session.headers = {"Private-Token" : token}
    
    # setup payload params (except page number)
    per_page = 100 # max val is 100, fewest API calls
    scope = "all" # encompasses every issue
    state = "all" # grab open and closed issues
    with_labels_details = "true" # more info on labels
        
    topics_url = f"https://gitlab.com/api/v4/projects/{project_id}/issues"

In [187]:
def api_request(url, payload=None, headers=None):
    # Fetch from API
    try:
        response = session.get(url, params=payload, headers=headers)
    except requests.exceptions.ConnectionError:
        return None
    return response

#### Pagination
In repositories with no more than 100 issues/comments (maximum entries that can be grabbed per page), the API will return all the data on a single call. However, when this limit is exceeded it is necessary to use pagination (https://docs.github.com/en/rest/guides/traversing-with-pagination) and make additional API calls to fetch all the issues/comments for the repository.

##### Example:
$I$ = Number of issues in a repository\
$C$ = Number of comments in a repository \
$A(x) =  \lceil \frac{x}{100} \rceil$ is the total number of API calls \
If $I = 1184$, $C = 4862$, then the number of API calls is $A(I) + A(C) = \lceil \frac{1184}{100} \rceil + \lceil \frac{4862}{100} \rceil =  12 + 49 = 61$ which also shows the importance of using authenticated requests. This number can be a little bit larger due to the number of unpaired comments (see **Finding Unpaired Comments** below).

In [188]:
def retrieve_data(url):
    # Retireve full API information via pagination
    page = 1
    if source_github:
        payload = {"per_page" : per_page, "page" : page, "state" : state}
    else:
        payload = {"per_page" : per_page, "page" : page, "state" : state, "scope" : scope, 
                   "with_labels_details" : with_labels_details}
    
    response = api_request(url=url, payload=payload, headers=headers)
    data = response.json()

    # Case where pagination is not needed
    if not response.links:
        return pd.DataFrame(data)
    
    # Determine last page from response headers
    while "next" in response.links.keys():
        page += 1
        payload["page"] = page
        response = api_request(url=url, payload=payload, headers=headers)
        data += response.json()
        
    return pd.DataFrame(data)

In [189]:
topics = retrieve_data(topics_url) # retrieves all topics

In [190]:
topics.head()

Unnamed: 0,id,iid,project_id,title,description,state,created_at,updated_at,closed_at,closed_by,...,time_stats,task_completion_status,weight,blocking_issues_count,has_tasks,_links,references,moved_to_id,health_status,task_status
0,69471626,407,1293394,Factura cancelada me aparece en saldo en contra,"Que tal, tengo el caso de que el monto de una ...",closed,2020-08-05T15:58:11.170Z,2020-08-06T15:16:52.424Z,2020-08-06T15:16:52.408Z,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#407', 'relative': '#407', 'full': ...",,,
1,69470616,406,1293394,Contraseña default,Tuve que escarbar un poco en el código para en...,closed,2020-08-05T15:27:41.825Z,2020-08-06T16:56:54.159Z,2020-08-06T15:15:07.528Z,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#406', 'relative': '#406', 'full': ...",,,
2,69470562,405,1293394,DEBUG - Need make conf.py,"\nQue tal, cuando llamo python main.py para cu...",closed,2020-08-05T15:25:49.701Z,2020-08-06T16:57:51.762Z,2020-08-06T15:14:12.575Z,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#405', 'relative': '#405', 'full': ...",,,
3,53812155,404,1293394,Usuario o contraseña incorrecta,"Buenas tardes, \n\nNo puedo hacer que el siste...",closed,2020-07-02T21:03:23.048Z,2020-07-14T02:54:33.639Z,2020-07-14T02:54:33.618Z,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#404', 'relative': '#404', 'full': ...",,,
4,49929061,403,1293394,Actualización de la MV,Buenas tardes estimados:\nMe gustaría mucho po...,opened,2020-06-22T22:47:30.244Z,2020-08-14T19:26:29.951Z,,,...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#403', 'relative': '#403', 'full': ...",,,


In [191]:
repo_comments = pd.DataFrame(None)
if source_github: # GitLab only fetches comments for each issue
    repo_comments = retrieve_data(comments_url) # retrieves all comments
    repo_comments = repo_comments.assign(paired=False) # keep track of comments of a deleted issue

In [192]:
repo_comments.head()

#### Build Issue Threads
Create an issue thread from the topic (the intial problem/question) and comments. Issue threads are in the form `(topic_df, comments_df)`.

In [193]:
def build_issue_threads(topics, repo_comments):
    issue_threads = []
    for index, row in topics.iterrows():
        if source_github:
            topic = topics.loc[topics["node_id"] == row["node_id"]]
            # sort by date (most recent first)
            comments = repo_comments.loc[repo_comments["issue_url"] == row["url"]].sort_values("created_at", ascending=False)
            comments.drop("paired", axis=1, inplace=True) # remove paired column
            repo_comments.loc[repo_comments["issue_url"] == row["url"], "paired"] = True
            issue_threads.append(((topic, comments)))
        else:
            topic = topics.loc[topics["iid"] == row["iid"]]
            iid = row["iid"] #issue id
            comments_url =  f"https://gitlab.com/api/v4/projects/{project_id}/issues/{iid}/notes"
            response = api_request(url=comments_url)
            comments = pd.DataFrame(response.json())
            issue_threads.append((topic, comments))
    
    return issue_threads

In [194]:
issue_threads = build_issue_threads(topics=topics, repo_comments=repo_comments)

#### Finding Unpaired Comments (i.e. Comments for Deleted Issues)
For larger repositories, sometimes the comments fetched reference a deleted issue -- found when testing (https://github.com/jupyter/help). When testing with the aforementioned repository, of the 1669 comments fetched only 19 of them were unpaired. This does not apply to repositories from GitLab as comments are only fetched for each issue.

In [195]:
if source_github:
    deleted_issues = []

    unpaired_comments = repo_comments.loc[repo_comments["paired"] == False]
    problematic_urls = unpaired_comments["issue_url"].unique().tolist() # unique issues
    for url in problematic_urls: # remove unpaired comments from repo_comments
        response = api_request(url=url)
        if response.status_code == 404:
            deleted_issues.append(url)
            repo_comments = repo_comments[repo_comments.issue_url != url]

In [196]:
problematic_urls

[]

In [197]:
deleted_issues

[]

In [198]:
unpaired_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired


In [199]:
if source_github:
    repo_comments.drop("paired", axis=1, inplace=True) # remove paired column 

#### Example Usage
First element in tuple pair is the topic dataframe and the second is the comments dataframe.

In [200]:
issue_threads = [(i, pd.DataFrame(c)) for i, c in issue_threads]

In [201]:
# example usage
num_comments = 2 # decrease this if StopIteration raised, just to show a conversation
ex_issue = next(x for x in issue_threads if len(x[1]) > num_comments)

In [202]:
ex_issue[0]

Unnamed: 0,id,iid,project_id,title,description,state,created_at,updated_at,closed_at,closed_by,...,time_stats,task_completion_status,weight,blocking_issues_count,has_tasks,_links,references,moved_to_id,health_status,task_status
1,69470616,406,1293394,Contraseña default,Tuve que escarbar un poco en el código para en...,closed,2020-08-05T15:27:41.825Z,2020-08-06T16:56:54.159Z,2020-08-06T15:15:07.528Z,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",...,"{'time_estimate': 0, 'total_time_spent': 0, 'h...","{'count': 0, 'completed_count': 0}",,,False,{'self': 'https://gitlab.com/api/v4/projects/1...,"{'short': '#406', 'relative': '#406', 'full': ...",,,


In [203]:
ex_issue[1]

Unnamed: 0,id,type,body,attachment,author,created_at,updated_at,system,noteable_id,noteable_type,resolvable,confidential,noteable_iid,commands_changes
0,392014318,,"No, si lo cambias antes, este valor debe estab...",,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",2020-08-06T16:56:54.139Z,2020-08-06T16:56:54.139Z,False,69470616,Issue,False,False,406,{}
1,392012238,,Nunca lo dejo pero es necesario para entrar la...,,"{'id': 1741440, 'name': 'Luis García-Pimentel ...",2020-08-06T16:51:25.906Z,2020-08-06T16:51:25.906Z,False,69470616,Issue,False,False,406,{}
2,391964029,,¿escarbar? ya deberían conocer todo de pi a pa...,,"{'id': 324889, 'name': 'Mauricio Baeza', 'user...",2020-08-06T15:15:07.033Z,2020-08-06T15:15:07.033Z,False,69470616,Issue,False,False,406,{}
3,391386601,,changed title from **Con{-s-}traseña default**...,,"{'id': 1741440, 'name': 'Luis García-Pimentel ...",2020-08-05T15:50:46.778Z,2020-08-05T15:50:46.780Z,True,69470616,Issue,False,False,406,{}


In [204]:
if source_github:
    url = ex_issue[0]["html_url"].values[0] # verify data at the issue url
    display(url)

### TODO
* <del> Retrieve closed issues </del>
* <del> Fix issue_threads to iterate over issue numbers in issue_heads </del>
* <del> Fix authentication limit </del>
* <del> Rework commenting fetching to use non-depecrated API call and pagination </del>
* <del> Figure issue with unpaired comments </del>
* <del> Remove 'paired' column from issue threads (comments) </del>
* <del> GitLab support </del>