## Extracting Issues from GitHub and GitLab Repositories
### Author: carterp@cs.uoregon.edu

Python implementation for extracting issues and comments from a GitHub repository utilizing the GitHub REST API (https://docs.github.com/v3/) or GitLab via (https://docs.gitlab.com/ee/api).

In [870]:
import datetime
import pandas as pd
import requests

# date constants
LATEST = datetime.datetime(year=datetime.MAXYEAR, month=12, day=31)# latest entry to fetch
OLDEST = datetime.datetime.utcfromtimestamp(0) # oldest entry to fetch

#### Configuration Step
User configures this step. Be sure to specify which source via the `source_github` boolean. 

In [871]:
source_github = True # True for GitHub, False for GitLab

#start = datetime.datetime(year=2019, month=1, day=1) example of different start end dates
#end = datetime.datetime(year=2019, month=1, day=31)

date_range = (OLDEST, LATEST) # range of results to fetch, use OLDEST, LATEST if you want to encompass all issues

if source_github:
    # authentication config
    user = "..." # GitHub username
    token = "..." # personal access token
    
    # target repository
    owner = "jupyter" # owner or organization of repo
    repo = "help" # repo name
    
    # filtering parameters
    state = "all" # sort by the state of issue - options: all, open, closed
else:
    # authentication config
    token = "..." # personal access token
    
    # target repository
    project_id = "1441932" # GitLab project id
    
    # filtering parameters
    state = "all" # sort by the state of issue - options: all, opened, closed

#### Rate Limiting
The API does not allow more than 60 unauthenticated requests per hour (https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) which is problematic for larger repositories (see **Pagination** below). At maximum efficiency, the API retrieves up to 100 results per page on a single call for requesting both issues and comments. Utilizing a person access token for authenticated requests, however, increases this limit to 5000 authenticated requests per hour, which sufficiently handles large repositories (see https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token for setting up a personal access token). GitLab rate limits at 600 requests per minute.

In [872]:
session = requests.Session()

In [873]:
# pre-fetching payload & authentication setup
if source_github:
    # authenticate
    session.auth = (user, token)
    
    # setup headers & params
    since = date_range[0].isoformat() # issues updated at or later
    accept = "application/vnd.github.v3+json"
    headers = {"accept" : accept}
    topics_url = f"https://api.github.com/repos/{owner}/{repo}/issues"
    comments_url = f"https://api.github.com/repos/{owner}/{repo}/issues/comments"
else:
    # authenticate
    session.headers = {"Private-Token" : token}
    
    # setup params
    scope = "all" # encompassed every issue
    with_labels_details = "true" # more info on labels
    created_after, created_before = date_range # date_range translate
    created_after, created_before = created_after.isoformat(), created_before.isoformat()
    topics_url = f"https://gitlab.com/api/v4/projects/{project_id}/issues"

In [874]:
def api_request(url, payload=None, headers=None):
    # Fetch from API
    try:
        response = session.get(url, params=payload, headers=headers)
    except requests.exceptions.ConnectionError:
        return None
    return response

#### Pagination
In repositories with no more than 100 issues/comments (maximum entries that can be grabbed per page), the API will return all the data on a single call. However, when this limit is exceeded it is necessary to use pagination (https://docs.github.com/en/rest/guides/traversing-with-pagination) and make additional API calls to fetch all the issues/comments for the repository.

##### Example:
$I$ = Number of issues in a repository\
$C$ = Number of comments in a repository \
$A(x) =  \lceil \frac{x}{100} \rceil$ is the total number of API calls \
If $I = 1184$, $C = 4862$, then the number of API calls is $A(I) + A(C) = \lceil \frac{1184}{100} \rceil + \lceil \frac{4862}{100} \rceil =  12 + 49 = 61$ which also shows the importance of using authenticated requests. This number can be a little bit larger due to the number of unpaired comments (see **Finding Unpaired Comments** below).

In [875]:
def retrieve_data(url):
    # Retireve full API information via pagination
    page = 1
    per_page = 100
    if source_github:
        payload = {"per_page" : per_page, "page" : page, "state" : state, "since" : since}
    else:
        payload = {"per_page" : per_page, "page" : page, "state" : state, "scope" : scope, 
                   "with_labels_details" : with_labels_details, "created_after" : created_after,
                  "created_before" : created_before}
    
    response = api_request(url=url, payload=payload, headers=headers)
    data = response.json()

    # Case where pagination is not needed
    if not response.links:
        return pd.DataFrame(data)
    
    # Determine last page from response headers
    while "next" in response.links.keys():
        page += 1
        payload["page"] = page
        response = api_request(url=url, payload=payload, headers=headers)
        data += response.json()
        
    return pd.DataFrame(data)

In [876]:
topics = retrieve_data(topics_url) # retrieves all topics

In [877]:
# apply date range
if not topics.empty:
    mask = (topics['created_at'] >= OLDEST.isoformat()) & (topics['created_at'] <= LATEST.isoformat())
    topics = topics.loc[mask]

In [878]:
topics.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
0,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/558,424444960,MDU6SXNzdWU0MjQ0NDQ5NjA=,558,ipython: How do I set scrollback history length?,...,,0,2019-03-23T00:45:58Z,2019-03-23T00:45:58Z,,NONE,,My readline/scrollback history is limited to a...,,
1,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/pull/557,423897771,MDExOlB1bGxSZXF1ZXN0MjYzMzY5MTMy,557,Deprecate this repo,...,,3,2019-03-21T19:12:59Z,2019-03-21T20:10:19Z,2019-03-21T20:08:49Z,MEMBER,,@choldgraf @consideRatio fyi for review,,{'url': 'https://api.github.com/repos/jupyter/...
2,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/556,423721239,MDU6SXNzdWU0MjM3MjEyMzk=,556,"installed notebook, service starts, blank browser",...,,4,2019-03-21T13:10:40Z,2019-03-25T08:59:16Z,,NONE,,"Hi I used pip install, installed no problems, ...",,
3,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/555,423541043,MDU6SXNzdWU0MjM1NDEwNDM=,555,Detailed reference for customizing Markdown wi...,...,,0,2019-03-21T01:36:20Z,2019-03-21T01:36:20Z,,NONE,,"Hi,\r\nI want to customize my markdown cells. ...",,
4,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/554,423318438,MDU6SXNzdWU0MjMzMTg0Mzg=,554,Unloading Jupyter notebook,...,,0,2019-03-20T15:30:43Z,2019-03-20T15:30:43Z,,NONE,,"Please, I could not open jupyter notebook to r...",,


In [879]:
repo_comments = pd.DataFrame(None)
if source_github: # GitLab only fetches comments for each issue
    repo_comments = retrieve_data(comments_url) # retrieves all comments
    repo_comments = repo_comments.assign(paired=False) # keep track of comments of a deleted issue

In [880]:
repo_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
0,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/pull/1#issueco...,https://api.github.com/repos/jupyter/help/issu...,204612181,MDEyOklzc3VlQ29tbWVudDIwNDYxMjE4MQ==,"{'login': 'willingc', 'id': 2680980, 'node_id'...",2016-04-01T23:57:58Z,2016-04-01T23:57:58Z,MEMBER,Thanks @jhamrick @rgbkrk and @Carreau \n,,False
1,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/pull/2#issueco...,https://api.github.com/repos/jupyter/help/issu...,204612299,MDEyOklzc3VlQ29tbWVudDIwNDYxMjI5OQ==,"{'login': 'willingc', 'id': 2680980, 'node_id'...",2016-04-01T23:59:04Z,2016-04-01T23:59:04Z,MEMBER,Thanks @rgbkrk \n,,False
2,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/3#issue...,https://api.github.com/repos/jupyter/help/issu...,204758505,MDEyOklzc3VlQ29tbWVudDIwNDc1ODUwNQ==,"{'login': 'willingc', 'id': 2680980, 'node_id'...",2016-04-02T17:05:46Z,2016-04-02T17:05:46Z,MEMBER,"Hi @danieldmm,\nI tried your examples using th...",,False
3,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/3#issue...,https://api.github.com/repos/jupyter/help/issu...,204760336,MDEyOklzc3VlQ29tbWVudDIwNDc2MDMzNg==,"{'login': 'takluyver', 'id': 327925, 'node_id'...",2016-04-02T17:18:58Z,2016-04-02T17:18:58Z,MEMBER,"If stuff gets stuck, you can try interrupting ...",,False
4,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/pull/4#issueco...,https://api.github.com/repos/jupyter/help/issu...,204832384,MDEyOklzc3VlQ29tbWVudDIwNDgzMjM4NA==,"{'login': 'rgbkrk', 'id': 836375, 'node_id': '...",2016-04-03T00:15:33Z,2016-04-03T00:15:33Z,MEMBER,"Looks like this needs a rebase, happy to merge.\n",,False


#### Build Issue Threads
Create an issue thread from the topic (the intial problem/question) and comments. Issue threads are in the form `(topic_df, comments_df)`.

In [881]:
def build_issue_threads(topics, repo_comments):
    issue_threads = []
    for index, row in topics.iterrows():
        if source_github:
            topic = topics.loc[topics["node_id"] == row["node_id"]]
            # sort by date (most recent first)
            comments = repo_comments.loc[repo_comments["issue_url"] == row["url"]].sort_values("created_at", ascending=False)
            comments.drop("paired", axis=1, inplace=True) # remove paired column
            repo_comments.loc[repo_comments["issue_url"] == row["url"], "paired"] = True
            issue_threads.append(((topic, comments)))
        else:
            # this is really slow!
            topic = topics.loc[topics["iid"] == row["iid"]]
            iid = row["iid"] #issue id
            comments_url =  f"https://gitlab.com/api/v4/projects/{project_id}/issues/{iid}/notes"
            response = api_request(url=comments_url)
            comments = pd.DataFrame(response.json())
            issue_threads.append((topic, comments))
    
    return issue_threads

In [882]:
issue_threads = build_issue_threads(topics=topics, repo_comments=repo_comments)

#### Finding Unpaired Comments (i.e. Comments for Deleted Issues)
For larger repositories, sometimes the comments fetched reference a deleted issue -- found when testing (https://github.com/jupyter/help). When testing with the aforementioned repository, of the 1669 comments fetched only 19 of them were unpaired. This does not apply to repositories from GitLab as comments are only fetched for each issue. Note, changing `date_range` will increase the number of problematic urls.

In [883]:
if source_github and not topics.empty:
    deleted_issues = []

    unpaired_comments = repo_comments.loc[repo_comments["paired"] == False]
    problematic_urls = unpaired_comments["issue_url"].unique().tolist() # unique issues
    for url in problematic_urls: # remove unpaired comments from repo_comments
        response = api_request(url=url)
        if response.status_code == 404:
            deleted_issues.append(url)
            repo_comments = repo_comments[repo_comments.issue_url != url]

In [884]:
problematic_urls

['https://api.github.com/repos/jupyter/help/issues/63']

In [885]:
deleted_issues

['https://api.github.com/repos/jupyter/help/issues/63']

In [886]:
unpaired_comments.head()

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app,paired
340,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/63#issu...,https://api.github.com/repos/jupyter/help/issu...,232642018,MDEyOklzc3VlQ29tbWVudDIzMjY0MjAxOA==,"{'login': 'parente', 'id': 153745, 'node_id': ...",2016-07-14T11:42:22Z,2016-07-14T11:42:22Z,MEMBER,@dvigneshwer The https://github.com/jupyter-in...,,False
341,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/63#issu...,https://api.github.com/repos/jupyter/help/issu...,232659623,MDEyOklzc3VlQ29tbWVudDIzMjY1OTYyMw==,"{'login': 'parente', 'id': 153745, 'node_id': ...",2016-07-14T13:05:37Z,2016-07-14T13:05:37Z,MEMBER,Try this as written:\n\n`jupyter cms install -...,,False
342,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/63#issu...,https://api.github.com/repos/jupyter/help/issu...,232678008,MDEyOklzc3VlQ29tbWVudDIzMjY3ODAwOA==,"{'login': 'parente', 'id': 153745, 'node_id': ...",2016-07-14T14:15:23Z,2016-07-14T14:15:23Z,MEMBER,Are you automating the install in Jupyter Hub ...,,False
343,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/63#issu...,https://api.github.com/repos/jupyter/help/issu...,232870094,MDEyOklzc3VlQ29tbWVudDIzMjg3MDA5NA==,"{'login': 'shubhjain26', 'id': 20471277, 'node...",2016-07-15T06:27:04Z,2016-07-15T06:27:04Z,NONE,![jupytererror](https://cloud.githubuserconten...,,False
344,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/63#issu...,https://api.github.com/repos/jupyter/help/issu...,232918418,MDEyOklzc3VlQ29tbWVudDIzMjkxODQxOA==,"{'login': 'parente', 'id': 153745, 'node_id': ...",2016-07-15T10:33:25Z,2016-07-15T10:33:25Z,MEMBER,Something is wrong if you see the quick setup ...,,False


In [887]:
if source_github:
    repo_comments.drop("paired", axis=1, inplace=True) # remove paired column 

#### Example Usage
First element in tuple pair is the topic dataframe and the second is the comments dataframe.

In [888]:
issue_threads = [(i, pd.DataFrame(c)) for i, c in issue_threads]

In [889]:
# example usage
num_comments = -1 # decrease this if StopIteration raised, just to show a conversation
ex_issue = next(x for x in issue_threads if len(x[1]) > num_comments)

In [890]:
ex_issue[0]

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
0,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://api.github.com/repos/jupyter/help/issu...,https://github.com/jupyter/help/issues/558,424444960,MDU6SXNzdWU0MjQ0NDQ5NjA=,558,ipython: How do I set scrollback history length?,...,,0,2019-03-23T00:45:58Z,2019-03-23T00:45:58Z,,NONE,,My readline/scrollback history is limited to a...,,


In [891]:
ex_issue[1]

Unnamed: 0,url,html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,performed_via_github_app


In [892]:
if source_github:
    url = ex_issue[0]["html_url"].values[0] # verify data at the issue url
    display(url)

'https://github.com/jupyter/help/issues/558'

### TODO
* <del> Retrieve closed issues </del>
* <del> Fix issue_threads to iterate over issue numbers in issue_heads </del>
* <del> Fix authentication limit </del>
* <del> Rework commenting fetching to use non-depecrated API call and pagination </del>
* <del> Figure issue with unpaired comments </del>
* <del> Remove 'paired' column from issue threads (comments) </del>
* <del> GitLab support </del>