## Efficiently crawl data from GitHub's GraphQL API

Note to self:

- Utilize GraphQL for Efficiency: GraphQL allows you to request exactly the data you need, reducing unnecessary data transfer.
- Public Schema for Data Structure: This [public GH GraphQL schema](https://docs.github.com/en/graphql/overview/public-schema) is quite comprehensive.
- Implement Graceful Retry for Rate Limits: Since GitHub's GraphQL API enforces rate limits, implement a retry mechanism that respects these limits to handle requests gracefully without hitting the rate limit. Refer to the rate limits [documentation]((https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api)) for guidance.
- Batch Queries by Year due to Max Results Limit: The maximum number of results you can fetch in one query is 1,000. To work around this, segment your queries by year or another logical division to retrieve all desired data without exceeding this limit.
- Review and Adjust the Base Query as Necessary: [Existing query](ospo_stats/query.py) may need some changes, especially for handling data updates efficiently. Given the brief runtime of 3 minutes for the entire crawl, the current setup seems fine.



In [None]:
from ospo_stats.gh import discover_repos

# Around 3 minutes run time from 2010 to 2024 , this will store all raw data to `data/` dir
# Run once only
discover_repos("uw-madison", 2010, 2024, overwrite=True)

## Visualizations

- Cumulative repo by year
- Cumulative stars by year? (Not implemented yet)
- Cumulative commits by year? (Not implemented yet)

### Calculate cumulative metrics by year

In [None]:
import pandas as pd
import altair as alt
from ospo_stats.parser import load

df = load("data/")
df['year'] = pd.to_datetime(df['created_at']).dt.year

cumulative_repos = df.groupby('year').name.nunique().cumsum()
plot_df = cumulative_repos.reset_index(name='n')

plot_df.head(5)

### Plot cumulative metrics by year

In [None]:
plot = alt.Chart(plot_df).mark_line(point=True).encode(
    x='year:O', 
    y='n:Q',
).properties(
    title="Yearly Growth of UW–Madison's Open-Source Repositories on GitHub",
    width=600,
    height=400
)

plot

### Somewhat interesting statistics

- top-10 stared repo
- top-10 commited repo

In [None]:
df.sort_values('stars', ascending=False).head(10)

In [None]:
df.sort_values('commits', ascending=False).head(10)

## Future plans / discussion

1. Get some inspiration from: https://r-universe.dev/search/
1. What info are needed for outreach? Exemplary OSS repos?
1. Categorization by repo's `description` and `readme` with LLM.
1. Fixing misleading stars and commits by time plots

In [1]:
from ospo_stats.gh import get_stargazers, get_commits
from ospo_stats.parser import parse_stargazers, parse_commits, parse_discover_response
import logging
import pandas as pd

logging.basicConfig(level=logging.INFO) 

In [None]:
gs = get_stargazers("ad-freiburg", "qlever")
gs = [parse_stargazers(g) for g in gs]
pd.DataFrame(gs)


In [None]:
cs = get_commits("jasonlo", "funsearch")
cs = [parse_commits(c) for c in cs]
pd.DataFrame(cs)

In [3]:
import json
with open("data/repos_2011.json", "r") as f:
    data = json.load(f)

In [4]:
data

[{'repo': {'owner': {'login': 'aaronb'},
   'name': 'xv6',
   'url': 'https://github.com/aaronb/xv6',
   'description': 'UW-Madison fork of xv6, the teaching operating system based on UNIX version 6',
   'createdAt': '2011-06-24T02:56:54Z',
   'pushedAt': '2012-01-30T17:23:27Z',
   'stargazers': {'totalCount': 3},
   'issues': {'totalCount': 0},
   'defaultBranchRef': {'target': {'history': {'totalCount': 30}}},
   'readme_standard': None,
   'readme_lower': None}},
 {'repo': {'owner': {'login': 'khazelton'},
   'name': 'math801f11',
   'url': 'https://github.com/khazelton/math801f11',
   'description': 'Repository for materials associated with Math 801, Fall 2011, UW-Madison',
   'createdAt': '2011-09-12T22:52:11Z',
   'pushedAt': '2012-01-26T12:22:58Z',
   'stargazers': {'totalCount': 1},
   'issues': {'totalCount': 0},
   'defaultBranchRef': {'target': {'history': {'totalCount': 35}}},
   'readme_standard': None,
   'readme_lower': None}}]