# Setup

* First, we need to import the necessary libraries.
* Then we load the configuration file.
  * The config file contains information like the database config and the API key
* Then we setup the SQLite database

In [None]:
import requests
import time
import sqlite3
from datetime import datetime
from IPython.display import JSON
import json
import configparser as cp
import pandas as pd
import asyncio
import aiofiles
import aiohttp
import nest_asyncio


In [None]:
"""
    Reads the config file and returns the config object
"""
config = cp.ConfigParser()
config.read('config.ini')

['config.ini']

## SQLite

I decided to user a SQLite database instead of simply keeping the data in memory or in a JSON file\nnot just because of the size of the dataset, but also because of the inherent structure of the data itself.

As it is to be expected that the the n to n relationship between the Repos and the Contributors will be queried a lot, so it makes more sense to have a database which can handle relational requests, instead of manually joining dataframes.

In [None]:
db = sqlite3.connect(config['DB']['NAME'])
   

db.execute("""
           CREATE TABLE IF NOT EXISTS users (
               id TEXT PRIMARY KEY,
               name TEXT NOT NULL,
               location TEXT,
               addedToDB datetime,
               createdAt datetime
            )
              """)

db.execute(""" 
           CREATE TABLE IF NOT EXISTS repos (
               id TEXT PRIMARY KEY,
               name TEXT NOT NULL,
               owner TEXT,
               url TEXT,
               stargazerCount INTEGER,
               watchers INTEGER,
               primaryLanguage text,
               isFork boolean,
               forkCount INTEGER,
               updatedAt datetime,
               createdAt datetime,
               addedToDB datetime,
               allCommits boolean,
               FOREIGN KEY (owner) REFERENCES users(id)
            )
           """)

db.execute("""
           CREATE TABLE IF NOT EXISTS commits (
               id TEXT PRIMARY KEY,
               repo TEXT NOT NULL,
               user TEXT,
               createdAt datetime,
               additions INTEGER,
               deletions INTEGER,
               addedToDB datetime,
               FOREIGN KEY (repo) REFERENCES repos(id),
               FOREIGN KEY (user) REFERENCES users(id)
            )
            """)

db.commit()

In [None]:
# Query to insert a repo into the database
INSERT_REPO = """
                        INSERT OR IGNORE INTO repos(
                            id,
                            name,
                            owner,
                            url,
                            stargazerCount,
                            watchers,
                            %s
                            isFork,
                            forkCount,
                            updatedAt,
                            createdAt,
                            addedToDB,
                            allCommits
                        )
                        VALUES("%s", "%s", "%s","%s",  %s,  %s, %s %s, %s, %s, %s, CURRENT_TIMESTAMP, FALSE)
                        """

# Query to insert a user into the database
# TODO change to parametrized executemany query
# [https://stackoverflow.com/questions/5616895/how-do-i-use-prepared-statements-for-inserting-multiple-records-in-sqlite-using]
INSERT_USER = """
                        INSERT OR IGNORE INTO users(id, name %s %s, addedToDB)
                        VALUES( "%s", "%s" %s %s, CURRENT_TIMESTAMP)
                    """

# Query to insert a commit into the database
INSERT_COMMIT = """
                          INSERT OR IGNORE INTO commits(id, repo, user, createdAt, additions, deletions, addedToDB)
                          VALUES("%s", "%s", "%s", %s, %s, %s, CURRENT_TIMESTAMP)
                      """

## Splitting up the Queries

Due to Githubs limitation on the number of 1000 items returned per query\[1\] we need to create queries which get less than 1000 items, but still cover the entirety of the dataset.

Previous attempts\[2\] to solve this exact problem constrained their queries by the amount of stars for each repository.
A method, which only works as long a there are less than 1000 repositories with the same amount of stars.

This was then mitigated by using the creation date of the repository as a second constraint.
As described in their corresponding blog article \[3\], this solution works by:

* First querying the Github Graphql API to see the result count of how many items a given query would provide
* If it is above a count of 1000 results the takes the date of jungest and oldest repository and splits the query in half of the time range
* Then the size of these two queries is checked again and if they are still above 1000 results the process is repeated until the size of the queries is below 1000 results

In [None]:
# Simple function to convert a Unix timestamp to a string in the format required by the github api
to_string = lambda stamp : datetime.fromtimestamp(stamp).strftime('%Y-%m-%dT%H:%M:%SZ')

### Adding the data to the database

While we are querying the API, we are also need a way to store the data we are getting.
It doen't make sense to store the entire data in memory to serialize it later on, because the size of the data would be too large and if we where to run into an error we cant recover from (like a longer than anticipated connection issue), we would lose all the data we have already queried.

Therefore we are going to store the data in the previously setup SQLite database, by preparing the insert statements now and executing them as we get the data from the API.
This is done by adding each repository to the database and assigning it a creator. 
If that creator doesn't not already exist in the database, we add it as well.

The data we recieve does not always contain the same information, so we need to check if the data is present and if not we add a null value or an empty string to the database.
The empty strings can later be cleaned or replaced as needed, during the data analysis.

The upper part show the code for adding the data to the database once received from the API in form of a json object.
Below that ist the actually query we send to the api, due to it being a GraphQL query it looks 

todo: fill table

| Key | Type | Description |
| --- | --- | --- |
| id | String | The id of the repository |
....

For logging purposes we also add the date of the query to the database, so we can see how the data changes over time.


## Querying the Github API

Now that we have can query the github with sizeable chunks of data, we can start to query the API.
We are still using the graphql API for this, as it enables us to fetch only the data we actually need.
The REST API would require us to fetch the entire repository object, which contains a lot of unnecessary and redundant data.

Just because we are now able to query the bite sized chunks of data, doesn't mean that the query will actually return them.
In order to keep the loading times of the website low, Github uses pagination to limit the amount of data returned per query.
This means that we can only get 100 items per query, which is why we need to use the `endCursor` to get the next 100 items.

The cursor functions like a little bookmark, which tells the API where we left off and where to continue from, it needs to be passed as a parameter to the next query.

### Querying the Repos

The query itself consist of 4 parts:

1. A little snippet, requesting the current state of the rate limit, so we can keep track of how many requests we have left and when to stop
2. The filter for the repositories consisting of the following:
    * Only repositories which are public (this is a bit redundant, as the API only returns public repositories or the ones you have access to)
    * A limit on the amount of stars the repository has, everything below 15 is being ignored as it indicates little relevance
    * The date range of the repositories, this is where we plug in our previously calculated date ranges
3. Then we request a little bit more metadata about the query itself, like the total count of items and the cursor for the next page and whether there is a next page at all
4. Then we tell the API exactly what kind of values we are interested in
   1. This being information about the repository itself, like the name, the url, the description, the creation date, the amount of stars and the amount of forks
   2. but also information about its creator, like the name, the profile creation date and its id.

### Queries for the Commits

But we do not only want the data about the repositories and the person who initially created them, we also want to know who actually worked on them.
This is where the commits come in...

The commits are the actual changes to the code, which are made by contributors. 
We could have queried this during the repository query, but this seems to overload the API and also would make the code utterly unreadable.

This is why we simply query the commits for every repository separately, which makes the query itself a lot simpler.

To get to the commits we need to use the creator name and the repository name, as the api does not accept the repository id itself.
Unfortunately the creator name is not always provided by our query, this is due to the fact, that not only users can create repositories, but also organizations.
If a repository is created by an organization, the creator name is not provided by the API call directly.

Indirectly we can get the creator name by spitting up the repository name by the `/` and taking the first part of the string, which is the name of the organization.
With this information we can then query the API for the commits.

The structurally the query is very similar to the one for the repositories, we are still using the graphql API and we are still traversing.

#### Parallelizing the queries

#TODO incloude sources

Querying the 1.236.664 repositories took several hours. Which is due to the fact, that we didn't start to query the the next repo until we got the response for the previous one, doing it that way is called sequential querying or synchronous querying.
It takes a lot of time but I just left it running over night and it was done in the morning. :)

But now every repo having X commits on average, resulting in X queries for every repo, doing it the simple synchronous would result in a lot of waiting time....



In [None]:
commit_query = """
                 {
                  rateLimit {
                    cost
                    remaining
                    resetAt
                  }
                  repository(name: "%s", owner: "%s") {
                    id,
                    defaultBranchRef {
                      target {
                        ... on Commit {
                          id
                          history(first: 100 %s) {
                            edges {
                              node {
                                id
                                committedDate
                                additions
                                deletions   
                                author {
                                  user {
                                    id
                                    login
                                    location
                                  }
                                }
                              }
                            }
                            totalCount
                            pageInfo {
                              endCursor
                              hasNextPage
                            }
                          }
                        }
                      }
                    }
                  }
                }"""

async def get_commits(repo):
  print(f"Getting commits for {repo[0]}/{repo[1]}")
  cursor = None # Used to keep track of the current page in the query
  has_next_page = True # Used to indicate if there are more pages to query

  while (has_next_page):
    async with aiohttp.ClientSession() as session:
        async with session.post('https://api.github.com/graphql',
                headers={'Authorization': 'bearer '+ config['API']['KEY']},
                json={"query": commit_query % (
                  repo[1],
                  repo[0],
                  f"after: \"{cursor}\"" if cursor else "" )
                }
              ) as resp: 
            commit_query_response = await resp.json()

    # If we encounter an error we write it to the error log
    if  commit_query_response.get("errors") or resp.status != 200:
        print("ERROR!")
        with open(f"{config['GENERAL']['ERROR_LOG']}/{repo[1]}_{repo[0]}.log", "a+") as error_log:
          error_log.write(f"[{datetime.now()}] : {commit_query_response} : {repo[0]}/{repo[1]} : {cursor} \n ")
        continue 

    if commit_query_response["data"]["rateLimit"]["remaining"] <= 10:
      reset_time = datetime.strptime(commit_query_response["data"]["rateLimit"]["resetAt"], '%Y-%m-%dT%H:%M:%SZ')
      while datetime.now() < reset_time:
          print(f"Rate limit reached, waiting until {reset_time} to continue")
          time.sleep(20)
          
    cursor = commit_query_response["data"]["repository"]["defaultBranchRef"]["target"]["history"]["pageInfo"]["endCursor"]
    has_next_page = commit_query_response["data"]["repository"]["defaultBranchRef"]["target"]["history"]["pageInfo"]["hasNextPage"]
    
    print(f"Writing commits for {repo[0]}/{repo[1]} to database")
    print(commit_query_response["data"]["repository"]["defaultBranchRef"]["target"]["history"]["edges"])
    db.execute('BEGIN TRANSACTION')
    for commit in commit_query_response["data"]["repository"]["defaultBranchRef"]["target"]["history"]["edges"]:
      commit = commit["node"]
      if commit.get("author") and commit["author"].get("user"): 
        db.execute(INSERT_USER % (
                        ", createdAt" if commit["author"]["user"].get("createdAt") else "",
                        ", location" if commit["author"]["user"].get("location") else "",
                        commit["author"]["user"]["id"],
                        commit["author"]["user"].get("login") if commit["author"]["user"].get("login") else "",
                        f", \n {datetime.strptime(commit['author']['user']['createdAt'], '%Y-%m-%dT%H:%M:%SZ').timestamp()}" if commit["author"]['user'].get("createdAt") else "",
                        ", \n '%s'" % (commit['author']['user']['location'].replace("'", r"''")) if commit["author"]['user'].get("location") else "",
                    )
                )
        
      db.execute(INSERT_COMMIT % (
                            commit["id"],
                            commit_query_response["data"]["repository"]["id"],
                            commit["author"]["user"]["id"] if commit.get("author") and commit["author"].get("user") else "",
                            datetime.strptime(commit["committedDate"], '%Y-%m-%dT%H:%M:%SZ').timestamp(),
                            commit["additions"],
                            commit["deletions"]
                        )
        )
    db.commit()
        

  db.execute(f""" UPDATE repos
                  SET allCommits = True
                  WHERE id = "{
                    commit_query_response["data"]["repository"]["id"]
                    }"
              """)
  db.commit()
  print(f"Completed {repo[0]}/{repo[1]}")  

#### Batching the Queries

This 

In [None]:

batchsize = 3000
nest_asyncio.apply()
loop = asyncio.get_event_loop()

while True:
    df = pd.read_sql_query("""
                           SELECT users.id, repos.url
                           FROM repos
                           JOIN users ON repos.owner = users.id
                           WHERE repos.allCommits is False
                           LIMIT %d
                           """ % batchsize, db)
    if df.empty:
        break
    repos = df["url"].str.split("/").str[3:5].values.tolist()
    
    loop.run_until_complete(
        asyncio.gather(
            *[get_commits(repo) for repo in repos]
        )
    )
    print (f"Completed {len(repos)} repos")
print("Done!!! :D")

Getting commits for dyoder/waves
Getting commits for myabc/merb_global
Getting commits for defunkt/matzbot
Getting commits for mully/redmine_ticket_emailer
Getting commits for bousquet/tableau
Getting commits for scrooloose/crondle
Getting commits for judofyr/gemify
Getting commits for jackdempsey/attachmerb_fu
Getting commits for thelema/ocaml-community
Getting commits for pcapriotti/github-trac
Getting commits for ymendel/flac2mp3
Getting commits for dustin/ruby-freebase
Getting commits for drnic/datamapper-tmbundle
Getting commits for rubys/mars
Getting commits for jdp/tarn
Getting commits for bumi/tokenizer
Getting commits for norbert/has_uuid
Getting commits for gregwebs/jquery-uitableedit
Getting commits for freels/radiant-extensions
Getting commits for KirinDave/fuzed-old
Getting commits for gnu-lorien/crack-attack
Getting commits for spicycode/spicy-config
Getting commits for TekNoLogic/TourGuide
Getting commits for mattly/hpreserve
Getting commits for shadoi/puppet
Getting com

  addrs = await self._resolver.resolve(host, port, family=self._family)


KeyboardInterrupt: 

: 

## Bibliography

[1] “Resources in the REST API,” GitHub Docs. https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28 (accessed Mar. 07, 2023).

[2] danvk, “How can I get a list of all public GitHub repos with more than 20 stars?,” Stack Overflow, Feb. 02, 2020. https://stackoverflow.com/q/60022429 (accessed Mar. 07, 2023).


[3] D. Vanderkam, “GitHub Stars and the h-index: A Journey,” Medium, Feb. 10, 2020. https://danvdk.medium.com/github-stars-and-the-h-index-a-journey-c104cfe37da6 (accessed Mar. 06, 2023).