# Setup

* First, we need to import the necessary libraries.
* Then we load the configuration file.
  * The config file contains information like the database config and the API key
* Then we setup the SQLite database

In [None]:
import requests
import time
import sqlite3
from datetime import datetime
from IPython.display import JSON
import pandas as pd
import pickle
import configparser as cp

In [None]:
"""
    Reads the config file and returns the config object
"""
config = cp.ConfigParser()
config.read('config.ini')

## SQLite

I decided to user a SQLite database instead of simply keeping the data in memory or in a JSON file\nnot just because of the size of the dataset, but also because of the inherent structure of the data itself.

As it is to be expected that the the n to n relationship between the Repos and the Contributors will be queried a lot, so it makes more sense to have a database which can handle relational requests, instead of manually joining dataframes.

In [None]:
db = sqlite3.connect(config['DB']['NAME'])

if True: # Set to True to reset the database and start from scratch
    db.execute("DROP TABLE IF EXISTS users")
    db.execute("DROP TABLE IF EXISTS repos")
    db.execute("DROP TABLE IF EXISTS contributions")

db.execute("""
           CREATE TABLE IF NOT EXISTS users (
               id TEXT PRIMARY KEY,
               name TEXT NOT NULL,
               location TEXT,
               createdAt datetime
            )
              """)

db.execute(""" 
           CREATE TABLE IF NOT EXISTS repos (
               id TEXT PRIMARY KEY,
               name TEXT NOT NULL,
               owner TEXT,
               url TEXT,
               stargazerCount INTEGER,
               watchers INTEGER,
               primaryLanguage text,
               isFork boolean,
               forkCount INTEGER,
               updatedAt datetime,
               createdAt datetime,
               FOREIGN KEY (owner) REFERENCES users(id)
            )
           """)

db.execute("""
           CREATE TABLE IF NOT EXISTS commits (
               id TEXT PRIMARY KEY,
               createdAt datetime,
               additions INTEGER,
               deletions INTEGER,
               repo TEXT,
               user TEXT,
               FOREIGN KEY (repo) REFERENCES repos(id),
               FOREIGN KEY (user) REFERENCES users(id)
            )
            """)
db.commit()

## Splitting up the Queries

Due to Githubs limitation on the number of 1000 items returned per query\[1\] we need to create queries which get less than 1000 items, but still cover the entirety of the dataset.

Previous attempts\[2\] to solve this exact problem constrained their queries by the amount of stars for each repository.
A method, which only works as long a there are less than 1000 repositories with the same amount of stars.

This was then mitigated by using the creation date of the repository as a second constraint.
As described in their corresponding blog article \[3\], this solution works by:

* First querying the Github Graphql API to see the result count of how many items a given query would provide
* If it is above a count of 1000 results the takes the date of jungest and oldest repository and splits the query in half of the time range
* Then the size of these two queries is checked again and if they are still above 1000 results the process is repeated until the size of the queries is below 1000 results

In [None]:
# Simple function to convert a Unix timestamp to a string in the format required by the github api
to_string = lambda stamp : datetime.fromtimestamp(stamp).strftime('%Y-%m-%dT%H:%M:%SZ')

In [None]:

def split_querys(start, end):
  global amount_of_repos
  global repos_done

  repo_count_response = requests.post(
          'https://api.github.com/graphql',
          headers={'Authorization': 'bearer '+ config['API']['KEY']},
          json={"query": count_query % (config['GENERAL']['STARS'], to_string(start), to_string(end))}
      )
  
  # On the first run we get the total number of repos 
  # This is used to calculate the progress of the script
  if (amount_of_repos is None):
    amount_of_repos = repo_count_response.json()["data"]["search"]["repositoryCount"]
    repos_done = 0

  # If we are close to the rate limit we sleep until the rate limit resets
  if repo_count_response.json()["data"]["rateLimit"]["remaining"] < 10:
    reset_time = datetime.strptime( repo_count_response.json()["data"]["rateLimit"]["resetAt"], '%Y-%m-%dT%H:%M:%SZ')
    
    while datetime.now() < reset_time:
      seconds_till_reset = (reset_time - datetime.now()).total_seconds()
      print ("Sleeping till %s... %d minutes and %d seconds left..." % ( reset_time, *divmod(seconds_till_reset, 60)))
      time.sleep(5)

  # If the number of repos in the repos in the time range is greater than 1000
  if repo_count_response.json()["data"]["search"]["repositoryCount"] > 1000:
    # We split the range in half and do the same query on each half
    # This will continue recursively until the number of repos is less than 1000
    split_querys(start, (start + end)//2)
    split_querys((start + end)//2, end) 
    
  else:
    # If we finnaly get a range with less than 1000 repos we add the timestamps to the sections list
    sections.append((start, end))
    repos_done = repos_done+repo_count_response.json()["data"]["search"]["repositoryCount"]
    print(f"Working on {to_string(start)} to {to_string(end)}. Progress: {repos_done/amount_of_repos*100:.2f}%")
    
# The query to get the number of repos in a given time range as well as the current state of the rate limit
count_query = ''' query { 
                   rateLimit {
                    cost
                    remaining
                    resetAt
                  }
                  search(
                    query:"is:public, stars:>%s, created:%s..%s"
                    type: REPOSITORY, first: 1) {
                    repositoryCount
                  }
                } '''

sections = []

start = 1167609600 # Timestamp for 2007-01-01 (Github was founded in 2008 so this will cover all repos)
end = 1678209714  # Current Time stamp (for consistency will not use time.time()

amount_of_repos = None

split_querys(start, end)

In [None]:
pickle.dump(sections, open("overnight_sections.p", "wb"))

## Querying the Github API

Now that we have can query the github with sizeable chunks of data, we can start to query the API.
We are still using the graphql API for this, as it enables us to fetch only the data we actually need.
The REST API would require us to fetch the entire repository object, which contains a lot of unnecessary and redundant data.

Just because we are now able to query the bite sized chunks of data, doesn't mean that the query will actually return them.
In order to keep the loading times of the website low, Github uses pagination to limit the amount of data returned per query.
This means that we can only get 100 items per query, which is why we need to use the `endCursor` to get the next 100 items.

The cursor functions like a little bookmark, which tells the API where we left off and where to continue from, it needs to be passed as a parameter to the next query.

### Querying the Repos

The query itself consist of 4 parts:

1. A little snippet, requesting the current state of the rate limit, so we can keep track of how many requests we have left and when to stop
2. The filter for the repositories consisting of the following:
    * Only repositories which are public (this is a bit redundant, as the API only returns public repositories or the ones you have access to)
    * A limit on the amount of stars the repository has, everything below 15 is being ignored as it indicates little relevance
    * The date range of the repositories, this is where we plug in our previously calculated date ranges
3. Then we request a little bit more metadata about the query itself, like the total count of items and the cursor for the next page and whether there is a next page at all
4. Then we tell the API exactly what kind of values we are interested in
   1. This being information about the repository itself, like the name, the url, the description, the creation date, the amount of stars and the amount of forks
   2. but also information about its creator, like the name, the profile creation date and its id.

In [None]:
repos_downloaded = 0 # Used to keep track of the progress of the script
nodes = [] # Used to store the repos before they are written to the database


def download_repos (start, end):
  global repos_downloaded 
  global nodes
  cursor = None # Used to keep track of the current page in the query
  has_next_page = True # Used to indicate if there are more pages to query

  repo_query= """
                {
                  rateLimit {
                    cost
                    remaining
                    resetAt
                  }
                  search(
                    query: "is:public, stars:>%s, created:%s..%s"
                    %s
                    type: REPOSITORY
                    first: 100
                  ) {
                    repositoryCount
                    pageInfo {
                      hasNextPage
                      endCursor
                    }
                    edges {
                      node {
                        ... on Repository {
                          createdAt
                          forkCount
                          isFork
                          updatedAt
                          primaryLanguage {
                            name
                          }
                          watchers {
                            totalCount
                          }
                          stargazerCount
                          databaseId
                          owner {
                            id
                            ... on User {
                              id
                              createdAt
                              location
                              databaseId
                              login
                            }
                          }
                          id
                          name
                          url
                        }
                      }
                    }
                  }
                }"""

  while (has_next_page):
    print("-"*100)

    repo_query_response = requests.post(
              'https://api.github.com/graphql',
              headers={'Authorization': 'bearer '+ config['API']['KEY']},
              json={"query": repo_query % (config['GENERAL']["STARS"], to_string(start), to_string(end), f"after: \"{cursor}\"" if cursor else "" )}
          )
    # If we are close to the rate limit we sleep until the rate limit resets
    if repo_query_response.json()["data"]["rateLimit"]["remaining"] < 10:
      reset_time = datetime.strptime( repo_query_response.json()["data"]["rateLimit"]["resetAt"], '%Y-%m-%dT%H:%M:%SZ')
      while datetime.now() < reset_time:
        seconds_till_reset = (reset_time - datetime.now()).total_seconds()
        print ("Sleeping till %s... %d minutes and %d seconds left..." % ( reset_time, *divmod(seconds_till_reset, 60)))
        time.sleep(5)

    # Summing up the progress made so far
    repos_downloaded = repos_downloaded + len(repo_query_response.json()["data"]["search"]["edges"])
    
    # Updating the cursor and has_next_page variables to know if and where to continue the query
    cursor = repo_query_response.json()["data"]["search"]["pageInfo"]["endCursor"]
    has_next_page = repo_query_response.json()["data"]["search"]["pageInfo"]["hasNextPage"]
    
    # Adding the repos to the nodes list
    nodes = nodes + repo_query_response.json()["data"]["search"]["edges"]
    
    # Presenting the progress of the script
    print(f"""Downloading {repos_downloaded}/{amount_of_repos} repos \
          Requests left: {repo_query_response.json()['data']['rateLimit']['remaining']} \
          Progress: {repos_downloaded/amount_of_repos*100:.2f}%.""")

# Actually calling the previously defined function for each section
for section in sections:
  download_repos(section[0], section[1])

In [None]:
pickle.dump(nodes, open("overnight_nodes.p", "wb"))

### Adding the data to the database

Now that we collected the data about the repositories, we need to add it to the database.
This is done by adding the creator of every repository to the database and then adding the repository itself to the database with a reference to its creator.

It would actually be more effiecient to do this step, to better utilize the time in between the queries and would get rid of the need to keep this data in memory.
But this would make the code even less readable than it already is.


In [None]:
for i,item in enumerate(nodes):
    item = item["node"]
    
    print(f" Progress: {i/len(nodes)*100:.2f}% Current Repo: {item['name']}")
    
    create_user_query = """
                            INSERT OR IGNORE INTO users(id, name %s %s)
                            VALUES( "%s", "%s" %s %s
                            )
                        """ % (
                            ", createdAt" if item["owner"].get("createdAt") else "",
                            ", location" if item["owner"].get("location") else "",
                            item["owner"]["id"],
                            item["owner"].get("login") if item["owner"].get("login") else "",
                            f", \n {datetime.strptime(item['owner']['createdAt'], '%Y-%m-%dT%H:%M:%SZ').timestamp()}" if item["owner"].get("createdAt") else "",
                            f", \n \"{item['owner']['location']}\"" if item["owner"].get("location") else "",
                            )
    

    create_repo_query = """
                            INSERT OR IGNORE INTO repos(
                                id,
                                name,
                                owner,
                                url,
                                stargazerCount,
                                watchers,
                                %s
                                isFork,
                                forkCount,
                                updatedAt,
                                createdAt
                            )
                            VALUES("%s", "%s", "%s","%s",  %s,  %s, %s %s, %s, %s, %s)
                        """ % (
                                f"primaryLanguage," if item.get("primaryLanguage") else "",
                                item["id"],
                                item["name"],
                                item["owner"]["id"],
                                item["url"],
                                item["stargazerCount"],
                                item["watchers"]["totalCount"],
                                f'"{item["primaryLanguage"]["name"]}",' if item.get("primaryLanguage") else "",
                                item["isFork"],
                                item["forkCount"],
                                datetime.strptime(item["updatedAt"], '%Y-%m-%dT%H:%M:%SZ').timestamp(),
                                datetime.strptime(item["createdAt"], '%Y-%m-%dT%H:%M:%SZ').timestamp()
                            )
    db.execute(create_user_query)
    db.execute(create_repo_query)

    db.commit()

### Queries for the Commits

But we do not only want the data about the repositories and the person who iniially created them, we also want to know who actually worked on them.
This is where the commits come in.

The commits are the actual changes to the code, which are made by the contributors. 
We could have queried this during the repository query, but this seems to overload the API and would make the code utterly unreadable.

This is why we simply query the commits for every repository separately, which makes the query itself a lot simpler.

To get to the commits we need to use the creator name and the repository name, as the API does not provide a unique identifier for the repositories.
But we cannot simply use the creator name we have in the database, as the names of Organizations are not counted as actual names...

To get to the data anyway, we simply take the name provided within the repository URL and use that as the creator name.

## Bibliography

[1] “Resources in the REST API,” GitHub Docs. https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28 (accessed Mar. 07, 2023).

[2] danvk, “How can I get a list of all public GitHub repos with more than 20 stars?,” Stack Overflow, Feb. 02, 2020. https://stackoverflow.com/q/60022429 (accessed Mar. 07, 2023).


[3] D. Vanderkam, “GitHub Stars and the h-index: A Journey,” Medium, Feb. 10, 2020. https://danvdk.medium.com/github-stars-and-the-h-index-a-journey-c104cfe37da6 (accessed Mar. 06, 2023).