# 📈 GitHub API Analyst Test

This notebook is part of the Data APY Analyst technical test.

It connects to the GitHub REST API to:
- Search Public repositories
- Retrieve commit data
- list file contents in a repo

It handles:
- Authentication(optional token)
- Pagination
- Rate limiting
- Error handling

Includes helper functions for post-processin and downloading



In [1]:
#Imports

import requests, time, os

## 🔐 Authentication

The notebook use an optionan GitHub Personal Access Token (PAT)
Git allows 5,000 request/hour if a valid token is provided, otherwise the notebook will fall back to public mode (60 request/hour)



In [15]:
#First validate if we have a correct token

GITHUB_TOKEN ="" #Import your token from a .env

def validate_token(token : str = None) -> dict:
    """Validate if a token is valid"""
    headers = {
        "Accept": "application/vnd.github+json",
        "Authorization": f"Bearer {token}",
        "X-GitHub-Api-Version": "2022-11-28",
    }

    try:
      resp = requests.get("https://api.github.com/user", headers=headers)
      if resp.status_code == 200:
        print(f"🔐 Authenticated as: {resp.json().get('login')}")
        return headers
      elif resp.status_code == 401:
        print("⚠️ Invalid or expired GitHub token. Falling back to unauthenticated mode.")
      else:
        print(f"❌ Auth Error {resp.status_code}:{resp.text}")

    except Exception as e:
      print(f"❌ Auth Error: {e}")
    return {
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }


HEADERS = validate_token(token=GITHUB_TOKEN)
if "Authorization" not in HEADERS:
  print(f"⚠️ Using public API Git Hub")
else:
  print(f"🔐 Using private API Git Hub")

⚠️ Invalid or expired GitHub token. Falling back to unauthenticated mode.
⚠️ Using public API Git Hub


## ⛓️‍💥 GitHub Helper

Handles:

- Authentication (use validates headers)
- Rate limit errors (pauses)
- Pagination (optional)
- Normalizing results

In this way, is eassier to interact with differents endpoins

In [3]:
def github_get(url, params=None, paginate=False):
    """GET a GitHub REST endpoint, handle rate‑limit & optional pagination."""
    all_items, page = [], 1
    while True:
      try:
        resp = requests.get(url, headers=HEADERS, params={**(params or {}), "page": page, "per_page": 100})
        if resp.status_code == 403 and resp.headers.get("X-RateLimit-Remaining") == "0":
            reset = int(resp.headers["X-RateLimit-Reset"])
            sleep_for = max(reset - time.time(), 0) + 1
            print(f"Rate‑limited. Sleeping {sleep_for:.0f}s…")
            time.sleep(sleep_for)
            continue
        resp.raise_for_status()
        data = resp.json()
        if data is None:
          print("Error: Empty response")
          break

        if isinstance(data, dict) and "items" in data:
            batch = data["items"]
        else:
            batch = data

        all_items.extend(batch)

        if not paginate or "next" not in resp.links or not batch:
            break
        page += 1
      except Exception as e:
        print(f"Error: {e}")
        break
    return all_items

## 🧭 Rate Limit Handling

GitHub's API has a primary rate limit of 60 requests per hour for unauthenticated users and 5,000 per hour for authenticated users.

The `github_get()` function handles rate limits automatically:

- It checks the `X-RateLimit-Remaining` and `X-RateLimit-Reset` headers.
- If the limit is hit (403 response), it calculates how long to wait before retrying.
- It then pauses using `time.sleep()` and retries the request.

This prevents errors and ensures smooth API usage even in long loops.

In [None]:
# 1. search repositories

repos = github_get(
    "https://api.github.com/search/repositories",
    params={"q": "topic:data-science", "per_page": 5}
)
print("repos", len(repos), "repositories")
# 2. Commits from a popular repo
commits = github_get(
    "https://api.github.com/repos/pandas-dev/pandas/commits",
    params={"per_page": 5},
    paginate=True
)
print("Fetched", len(commits), "commits")

# 3. Contents of the repo root
contents = github_get(
    "https://api.github.com/repos/pandas-dev/pandas/contents",
    paginate=False
)
print("Root has", len(contents), "items")

repos 100 repositories
Fetched 36249 commits
Root has 31 items


## 🔄 Reusable Functions

some helpers that

- Search a repositories based in a query, allows the user seach per pages
- Obtain the repo commits and content from a given repo  

In [5]:
def search_repositories(query: str, per_page: int = 10, pages: int =1) -> list:
    """Search for repositories using the GitHub EndPoints"""
    all_results = []

    for page in range (1, pages +1):
      results = github_get(
          url = "https://api.github.com/search/repositories",
          params = {"q": query, "per_page": per_page, "page": pages},
          paginate = False #GutHub API uses a secondary rate limit, its necesary we control de loop
      )
      all_results.extend(results)
      if len(results)<per_page:
        break

    return all_results



In [6]:
def get_repo_commits(owner: str, repo: str, per_page: int = 100, max_pages: int = 1):
  """ Get commits from a repository """
  all_commits =[]

  for page in range(1, max_pages + 1):
    commits = github_get(
        url = f"https://api.github.com/repos/{owner}/{repo}/commits",
        params = {"per_page": per_page, "page": page},
        paginate = False
    )
    all_commits.extend(commits)
    if len(commits) < per_page:
      break
  return all_commits

In [7]:
def get_repo_contents(owner: str, repo: str, path: str=""):
  """ List contents of a given path in a repository """
  contents = github_get(
      url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}",
      paginate = False
  )
  return contents

In [8]:
repos = search_repositories("data-science", per_page= 5, pages=1)
repos[5]

{'id': 43759462,
 'node_id': 'MDEwOlJlcG9zaXRvcnk0Mzc1OTQ2Mg==',
 'name': 'DataSciencePython',
 'full_name': 'ujjwalkarn/DataSciencePython',
 'private': False,
 'owner': {'login': 'ujjwalkarn',
  'id': 5948390,
  'node_id': 'MDQ6VXNlcjU5NDgzOTA=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/5948390?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/ujjwalkarn',
  'html_url': 'https://github.com/ujjwalkarn',
  'followers_url': 'https://api.github.com/users/ujjwalkarn/followers',
  'following_url': 'https://api.github.com/users/ujjwalkarn/following{/other_user}',
  'gists_url': 'https://api.github.com/users/ujjwalkarn/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/ujjwalkarn/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/ujjwalkarn/subscriptions',
  'organizations_url': 'https://api.github.com/users/ujjwalkarn/orgs',
  'repos_url': 'https://api.github.com/users/ujjwalkarn/repos',
  'events_url': 'https://api.git

In [9]:
commits = get_repo_commits("fastai", "fastai", per_page=100, max_pages=1)
commits[0]

{'sha': '1ac4ee147baf86d2f66f13da9d755a4970f1160b',
 'node_id': 'C_kwDOBiNAztoAKDFhYzRlZTE0N2JhZjg2ZDJmNjZmMTNkYTlkNzU1YTQ5NzBmMTE2MGI',
 'commit': {'author': {'name': 'Jeremy Howard',
   'email': 'github@jhoward.fastmail.fm',
   'date': '2025-06-05T22:32:37Z'},
  'committer': {'name': 'GitHub',
   'email': 'noreply@github.com',
   'date': '2025-06-05T22:32:37Z'},
  'message': 'Merge pull request #4098 from Timmecom/update-plot-lr-find\n\nUpdate plot lr find',
  'tree': {'sha': 'd035f20c1f2a1ac6963fa6e8c4d276cceaf34ce3',
   'url': 'https://api.github.com/repos/fastai/fastai/git/trees/d035f20c1f2a1ac6963fa6e8c4d276cceaf34ce3'},
  'url': 'https://api.github.com/repos/fastai/fastai/git/commits/1ac4ee147baf86d2f66f13da9d755a4970f1160b',
  'comment_count': 0,
  'verification': {'verified': True,
   'reason': 'valid',
   'signature': '-----BEGIN PGP SIGNATURE-----\n\nwsFcBAABCAAQBQJoQhsFCRC1aQ7uu5UhlAAADDYQAAPjCLkpZFLx8i1CYKktPU66\nDY62TnkGBYBXiNGmcHjksHa+bc+yCHPY6BcDrQFXbOiTavzb822USZyoRmAu

In [10]:
contents = get_repo_contents("fastai", "fastai", path="")
contents[0]

{'name': '.devcontainer.json',
 'path': '.devcontainer.json',
 'sha': '5c93e3519d22eedfe281bc5b0bcf29029d4ee825',
 'size': 545,
 'url': 'https://api.github.com/repos/fastai/fastai/contents/.devcontainer.json?ref=main',
 'html_url': 'https://github.com/fastai/fastai/blob/main/.devcontainer.json',
 'git_url': 'https://api.github.com/repos/fastai/fastai/git/blobs/5c93e3519d22eedfe281bc5b0bcf29029d4ee825',
 'download_url': 'https://raw.githubusercontent.com/fastai/fastai/main/.devcontainer.json',
 'type': 'file',
 '_links': {'self': 'https://api.github.com/repos/fastai/fastai/contents/.devcontainer.json?ref=main',
  'git': 'https://api.github.com/repos/fastai/fastai/git/blobs/5c93e3519d22eedfe281bc5b0bcf29029d4ee825',
  'html': 'https://github.com/fastai/fastai/blob/main/.devcontainer.json'}}

## 📦 Bonus Functions

This functions helps the user:
- Clean commits to extract important information 
- Filter repositories by stars or language
- Download files or entire repositores (zip)



In [11]:
def clean_commits(raw_commits):
    """Extract useful fields from raw commit data."""
    cleaned = []
    for commit in raw_commits:
        cleaned.append({
            "sha": commit.get("sha"),
            "author": commit.get("commit", {}).get("author", {}).get("name"),
            "date": commit.get("commit", {}).get("author", {}).get("date"),
            "message": commit.get("commit", {}).get("message"),
            "url": commit.get("html_url"),
        })
    return cleaned

clean_commits(commits)

[{'sha': '1ac4ee147baf86d2f66f13da9d755a4970f1160b',
  'author': 'Jeremy Howard',
  'date': '2025-06-05T22:32:37Z',
  'message': 'Merge pull request #4098 from Timmecom/update-plot-lr-find\n\nUpdate plot lr find',
  'url': 'https://github.com/fastai/fastai/commit/1ac4ee147baf86d2f66f13da9d755a4970f1160b'},
 {'sha': 'c5df718db7f3ff7b72928dd21e28026d525bfb88',
  'author': 'Timmecom',
  'date': '2025-06-05T20:20:27Z',
  'message': 'Ran nbdev_export',
  'url': 'https://github.com/fastai/fastai/commit/c5df718db7f3ff7b72928dd21e28026d525bfb88'},
 {'sha': 'a08b61d99a1b00ff7a1cb857d02d5bb9cdb0ba4a',
  'author': 'Timmecom',
  'date': '2025-06-04T10:59:13Z',
  'message': 'Add return statement for figure in plot_lr_find method within the notebook.',
  'url': 'https://github.com/fastai/fastai/commit/a08b61d99a1b00ff7a1cb857d02d5bb9cdb0ba4a'},
 {'sha': '76457f3a094e874c44887df3ccc6fb037f925c83',
  'author': 'Timmecom',
  'date': '2025-06-04T10:53:52Z',
  'message': 'Remove the return statement from

In [12]:
def filter_repositories(repos, min_stars = 0, max_stars = 0, language = None):
  """ Return repos with at least min_stars and at most max_stars and optional language"""
  filtered_repos = []
  for r in repos:
    if r["stargazers_count"] < min_stars or r["stargazers_count"] > max_stars:
      continue
    if language and r["language"] != language:
      continue
    filtered_repos.append(r)
  return filtered_repos

filter_repositories(repos, min_stars= 10000, max_stars= 100000, language= "Python")

[{'id': 29749635,
  'node_id': 'MDEwOlJlcG9zaXRvcnkyOTc0OTYzNQ==',
  'name': 'data-science-ipython-notebooks',
  'full_name': 'donnemartin/data-science-ipython-notebooks',
  'private': False,
  'owner': {'login': 'donnemartin',
   'id': 5458997,
   'node_id': 'MDQ6VXNlcjU0NTg5OTc=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/5458997?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/donnemartin',
   'html_url': 'https://github.com/donnemartin',
   'followers_url': 'https://api.github.com/users/donnemartin/followers',
   'following_url': 'https://api.github.com/users/donnemartin/following{/other_user}',
   'gists_url': 'https://api.github.com/users/donnemartin/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/donnemartin/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/donnemartin/subscriptions',
   'organizations_url': 'https://api.github.com/users/donnemartin/orgs',
   'repos_url': 'https://api.github.com/us

In [13]:
def download_file_from_repo(owner: str, repo: str, path: str, save_as: str = None):
    """Download a single file from a GitHub repo's contents."""
    url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
    response = requests.get(url, headers=HEADERS)

    if response.status_code == 200:
        data = response.json()
        download_url = data.get("download_url")

        if not download_url:
            print("⚠️ This path is not a downloadable file.")
            return

        file_data = requests.get(download_url)
        save_path = save_as or path.split("/")[-1]

        with open(save_path, "wb") as f:
            f.write(file_data.content)
        print(f"✅ File downloaded: {save_path}")

    else:
        print(f"❌ Error fetching file: {response.status_code}")


download_file_from_repo("fastai", "fastai", "README.md")


✅ File downloaded: README.md


In [14]:
def download_repo_zip(owner: str, repo: str, branch: str = "main", save_as: str = None):
    """Download a full GitHub repo as a ZIP archive."""
    zip_url = f"https://github.com/{owner}/{repo}/archive/refs/heads/{branch}.zip"
    response = requests.get(zip_url)

    if response.status_code == 200:
        filename = save_as or f"{repo}-{branch}.zip"
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"✅ Repository downloaded: {filename}")
    else:
        print(f"❌ Error downloading ZIP: {response.status_code}")

download_repo_zip("fastai", "fastai", branch="main")


✅ Repository downloaded: fastai-main.zip


## 📘 Reflection

This thest provided a valuable opportunity to worj with GithHub's API and develop a structure.

In a production environment, I would further optimize the worrkflow by modularizing helper functions into reusable packages, implementing robust loggins and incorporating retry mechanisms with backoff for resilient error handling.

This notebook can be shared and rerun with a minimal setup.

