# Setup

* First, we need to import the necessary libraries.
* Then we load the configuration file.
  * The config file contains information like the database config and the API key
* Then we setup the SQLite database

In [9]:
import requests
import time
import sqlite3
from datetime import datetime
from IPython.display import JSON
import pandas as pd
import pickle
import configparser as cp

In [10]:
"""
    Reads the config file and returns the config object
"""
config = cp.ConfigParser()
config.read('config.ini')

['config.ini']

## SQLite

I decided to user a SQLite database instead of simply keeping the data in memory or in a JSON file, not just because of the size of the dataset, but also because of the inherent structure of the data itself.

As it is to be expected that the the n to n relationship between the Repos and the Contributors will be queried a lot, so it makes more sense to have a database which can handle relational requests, instead of manually joining dataframes.

In [11]:
db = sqlite3.connect(config['DB']['NAME'])

## Splitting up the Queries

Due to Githubs limitation on the number of 1000 items returned per query\[1\] we need to create queries which get less than 1000 items, but still cover the entirety of the dataset.

Previous attempts\[2\] to solve this exact problem constrained their queries by the amount of stars for each repository.
A method, which only works as long a there are less than 1000 repositories with the same amount of stars.

This was then mitigated by using the creation date of the repository as a second constraint.
As described in their corresponding blog article \[3\], this solution works by:

* First querying the Github Graphql API to see the result count of how many items a given query would provide
* If it is above a count of 1000 results the takes the date of jungest and oldest repository and splits the query in half of the time range
* Then the size of these two queries is checked again and if they are still above 1000 results the process is repeated until the size of the queries is below 1000 results

In [12]:
# Simple function to convert a Unix timestamp to a string in the format required by the github api
to_string = lambda stamp : datetime.fromtimestamp(stamp).strftime('%Y-%m-%dT%H:%M:%SZ')

In [13]:

def split_querys(start, end):
  global amount_of_repos
  global repos_done

  repo_count_response = requests.post(
          'https://api.github.com/graphql',
          headers={'Authorization': 'bearer '+ config['API']['KEY']},
          json={"query": count_query % (to_string(start), to_string(end))}
      )
  
  # On the first run we get the total number of repos 
  # This is used to calculate the progress of the script
  if (amount_of_repos is None):
    amount_of_repos = repo_count_response.json()["data"]["search"]["repositoryCount"]
    repos_done = 0

  # If we are close to the rate limit we sleep until the rate limit resets
  if repo_count_response.json()["data"]["rateLimit"]["remaining"] < 10:
    reset_time = datetime.strptime( repo_count_response.json()["data"]["rateLimit"]["resetAt"], '%Y-%m-%dT%H:%M:%SZ')
    
    while datetime.now() < reset_time:
      seconds_till_reset = (reset_time - datetime.now()).total_seconds()
      print ("Sleeping till %s... %d minutes and %d seconds left..." % ( reset_time, *divmod(seconds_till_reset, 60)))
      time.sleep(5)

  # If the number of repos in the repos in the time range is greater than 1000
  if repo_count_response.json()["data"]["search"]["repositoryCount"] > 1000:
    # We split the range in half and do the same query on each half
    # This will continue recursively until the number of repos is less than 1000
    split_querys(start, (start + end)//2)
    split_querys((start + end)//2, end) 
    
  else:
    # If we finnaly get a range with less than 1000 repos we add the timestamps to the sections list
    sections.append((start, end))
    repos_done = repos_done+repo_count_response.json()["data"]["search"]["repositoryCount"]
    print(f"Working on {to_string(start)} to {to_string(end)}. Progress: {repos_done/amount_of_repos*100:.2f}%")
    
# The query to get the number of repos in a given time range as well as the current state of the rate limit
count_query = ''' query { 
                   rateLimit {
                    cost
                    remaining
                    resetAt
                  }
                  search(
                    query:"is:public, stars:>15, created:%s..%s"
                    type: REPOSITORY, first: 1) {
                    repositoryCount
                  }
                } '''

sections = []

start = 1167609600 # Timestamp for 2007-01-01 (Github was founded in 2008 so this will cover all repos)
end = 1678209714  # Current Time stamp (for consistency will not use time.time()

amount_of_repos = None

split_querys(start, end)

Working on 2007-01-01T00:00:00Z to 2008-01-05T08:35:07Z. Progress: 0.00%
Working on 2008-01-05T08:35:07Z to 2008-04-06T17:43:53Z. Progress: 0.03%
Working on 2008-04-06T17:43:53Z to 2008-07-08T01:52:40Z. Progress: 0.12%
Working on 2008-07-08T01:52:40Z to 2008-08-23T05:57:03Z. Progress: 0.16%
Working on 2008-08-23T05:57:03Z to 2008-10-08T10:01:27Z. Progress: 0.20%
Working on 2008-10-08T10:01:27Z to 2008-11-23T13:05:50Z. Progress: 0.26%


KeyboardInterrupt: 

In [77]:
sections = pickle.load(open("sections.pkl", "rb"))

In [81]:
repos_downloaded = 0
nodes = []


def download_repos (start, end):
  global repos_downloaded
  global nodes
  cursor = None
  has_next_page = True

  repo_query= """
                {
                  rateLimit {
                    cost
                    remaining
                    resetAt
                  }
                  search(
                    query: "is:public, stars:>15, created:%s..%s"
                    %s
                    type: REPOSITORY
                    first: 100
                  ) {
                    repositoryCount
                    pageInfo {
                      hasNextPage
                      endCursor
                    }
                    edges {
                      node {
                        ... on Repository {
                          createdAt
                          forkCount
                          isFork
                          updatedAt
                          primaryLanguage {
                            name
                          }
                          watchers {
                            totalCount
                          }
                          stargazerCount
                          databaseId
                          owner {
                            id
                            ... on User {
                              id
                              createdAt
                              databaseId
                              name
                            }
                          }
                          id
                          name
                        }
                      }
                    }
                  }
                }"""

  while (has_next_page):
    print("-"*100)

    repo_query_response = requests.post(
              'https://api.github.com/graphql',
              headers={'Authorization': 'bearer '+ config['API']['KEY']},
              json={"query": repo_query % (to_string(start), to_string(end), f"after: \"{cursor}\"" if cursor else "" )}
          )
    # If we are close to the rate limit we sleep until the rate limit resets
    if repo_query_response.json()["data"]["rateLimit"]["remaining"] < 10:
      reset_time = datetime.strptime( repo_query_response.json()["data"]["rateLimit"]["resetAt"], '%Y-%m-%dT%H:%M:%SZ')

      while datetime.now() < reset_time:
        seconds_till_reset = (reset_time - datetime.now()).total_seconds()
        print ("Sleeping till %s... %d minutes and %d seconds left..." % ( reset_time, *divmod(seconds_till_reset, 60)))
        time.sleep(5)

    repos_downloaded = repos_downloaded + len(repo_query_response.json()["data"]["search"]["edges"])
    cursor = repo_query_response.json()["data"]["search"]["pageInfo"]["endCursor"]
    has_next_page = repo_query_response.json()["data"]["search"]["pageInfo"]["hasNextPage"]
    nodes = nodes + repo_query_response.json()["data"]["search"]["edges"]
    
    print(f"Downloading {repos_downloaded}/{amount_of_repos} repos. Progress: {repos_downloaded/amount_of_repos*100:.2f}%.")
    print( repo_query_response.json()["data"]["search"]["pageInfo"])

for section in sections:
  download_repos(section[0], section[1])

----------------------------------------------------------------------------------------------------
Downloading 1/1235319 repos. Progress: 0.00%.
{'hasNextPage': False, 'endCursor': 'Y3Vyc29yOjE='}
----------------------------------------------------------------------------------------------------
Downloading 101/1235319 repos. Progress: 0.01%.
{'hasNextPage': True, 'endCursor': 'Y3Vyc29yOjEwMA=='}
----------------------------------------------------------------------------------------------------
Downloading 201/1235319 repos. Progress: 0.02%.
{'hasNextPage': True, 'endCursor': 'Y3Vyc29yOjIwMA=='}
----------------------------------------------------------------------------------------------------
Downloading 301/1235319 repos. Progress: 0.02%.
{'hasNextPage': True, 'endCursor': 'Y3Vyc29yOjMwMA=='}
----------------------------------------------------------------------------------------------------
Downloading 401/1235319 repos. Progress: 0.03%.
{'hasNextPage': True, 'endCursor': 'Y3Vy

KeyboardInterrupt: 

In [56]:
for node in nodes:
    print(node["node"]["databaseId"])

584556174
582970339
583706111
583824425
584522336
584351443
583172165
583488732
584477080
583999445
583450891
583085674
584443067
584591805
583772808
583430700
583047772
583665827
583824526
584429431
583335116
584262559
583615061
583765500
583704527
583954357
583616027
584017827
583809866
584131216
584165139
584547922
583057559
584502403
584051226
584622314
583459605
583369920
584630595
583174341
584105200
583537416
583803874
584363621
583512692
583375059
584349473
584189373
583699147
583375511
583640426
584022883
582989161
583386618
584239673
584145750
583790421
583773699
583689551
583315360
583597081
583738217
583124988
583331123
583419398
584580166
583268270
584574007
583220952
584521206
584573194
583024727
583508796
583079961
584177219
583835507
583966597
583446148
583332730
582969786
584622033
583935180
584404380
583942910
584365872
583624397
584317678
583500307
584102226
584125419
584219588
583927613
584034634
584370823
583806262
583476976
583785011
583059661
584200680
583409095


## Bibliography

[1] “Resources in the REST API,” GitHub Docs. https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28 (accessed Mar. 07, 2023).

[2] danvk, “How can I get a list of all public GitHub repos with more than 20 stars?,” Stack Overflow, Feb. 02, 2020. https://stackoverflow.com/q/60022429 (accessed Mar. 07, 2023).


[3] D. Vanderkam, “GitHub Stars and the h-index: A Journey,” Medium, Feb. 10, 2020. https://danvdk.medium.com/github-stars-and-the-h-index-a-journey-c104cfe37da6 (accessed Mar. 06, 2023).