# GitHub API - Data Extraction Notebook

### Objective

This notebook demonstrates how to extract, handle, and explore data using the GitHub REST API. The focus is on retrieving public repositories, commit histories, and file contents for technical analysis.

Author: Gabriela Rivera Plascencia


##  Notebook Structure

1. Objective and Scope  
2. Library Imports  
3. Token-Based Authentication  
4. Reusable Functions  
   - Authentication Setup  
   - Safe Request Handling  
   - Pagination Management  
5. API Endpoint Tests  
   -  `/search/repositories`: Search for popular repositories  
   - `/repos/{owner}/{repo}/commits`: Retrieve commit history  
   - `/repos/{owner}/{repo}/contents/{path}`: Get file contents  
6. Comments and Observations  
7. Final Thoughts  


### Utility Functions


In [1]:
import requests
import json
from pprint import pprint

# Authentication
def get_headers(token):
    return {
        "Authorization": f"Bearer {token}",
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28"
    }

# Pagination
def get_paginated_results(url, headers, max_pages=5):
    results = []
    page = 1
    while page <= max_pages:
        response = requests.get(f"{url}&page={page}", headers=headers)
        if response.status_code != 200:
            print(f"Error on page {page}: {response.status_code}")
            break
        data = response.json()
        if 'items' in data:
            results.extend(data['items'])
        else:
            break
        page += 1
    return results

# Error handling
def safe_request(url, headers):
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print("Request failed:", e)
        return None


## Library Imports

In [19]:
## Import Libraries

import requests
import time
import pandas as pd
from pprint import pprint

### Authentication

This section handles the secure authentication required to access the GitHub REST API.  
We use a **Personal Access Token (PAT)** retrieved securely through user input, which is passed in the request headers for all subsequent API calls.  
Proper authentication increases the rate limit and grants access to protected endpoints.


In [None]:
# Replace with your actual GitHub token
TOKEN = "Enter your full GitHub token:"


In [21]:
# Authentication headers
headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Accept": "application/vnd.github+json",
    "X-GitHub-Api-Version": "2022-11-28"
}

###  GET Search Repositories

This endpoint allows us to search public repositories on GitHub based on a query. In this example, we search for Python repositories with more than 10,000 stars, sorted by the number of stars.




In [32]:
def search_repositories(query="data", language="python", sort="stars", order="desc", per_page=30, page=1):
    url = "https://api.github.com/search/repositories"
    params = {
        "q": f"{query}+language:{language}",
        "sort": sort,
        "order": order,
        "per_page": per_page,
        "page": page
    }
    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        return response.json()["items"]
    else:
        print(f"Error: {response.status_code}")
        print(response.json())
        return []


In [33]:
# Define the search query
query = "language:python stars:>10000"
url = f"https://api.github.com/search/repositories?q={query}&sort=stars&order=desc"

# Get paginated results (up to 3 pages)
repos = get_paginated_results(url, headers, max_pages=3)

# Print the first repository as an example
print(f"Total repositories retrieved: {len(repos)}")
print(repos[0])


Total repositories retrieved: 90
{'id': 13491895, 'node_id': 'MDEwOlJlcG9zaXRvcnkxMzQ5MTg5NQ==', 'name': 'free-programming-books', 'full_name': 'EbookFoundation/free-programming-books', 'private': False, 'owner': {'login': 'EbookFoundation', 'id': 14127308, 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjE0MTI3MzA4', 'avatar_url': 'https://avatars.githubusercontent.com/u/14127308?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/EbookFoundation', 'html_url': 'https://github.com/EbookFoundation', 'followers_url': 'https://api.github.com/users/EbookFoundation/followers', 'following_url': 'https://api.github.com/users/EbookFoundation/following{/other_user}', 'gists_url': 'https://api.github.com/users/EbookFoundation/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/EbookFoundation/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/EbookFoundation/subscriptions', 'organizations_url': 'https://api.github.com/users/EbookFoundation/orgs', 'repos_url': 'http

In [34]:
# Example: Get paginated results for /search/repositories
search_url = "https://api.github.com/search/repositories?q=data+language:python&sort=stars&order=desc"
repo_results = get_paginated_results(search_url, headers, max_pages=3)

print(f"Total repositories retrieved: {len(repo_results)}")


Total repositories retrieved: 90


### Get Commits

This section retrieves the list of commits for a specified repository.  
It supports filtering by branch (`sha`), file path, and date range.  
We implement pagination to extract a full commit history and include error handling for rate limits or invalid requests.


In [35]:
def get_commits(owner, repo, per_page=100, max_pages=5):
    commits = []
    for page in range(1, max_pages + 1):
        url = f"https://api.github.com/repos/{owner}/{repo}/commits"
        params = {"per_page": per_page, "page": page}
        response = requests.get(url, headers=HEADERS, params=params)
        if response.status_code == 200:
            data = response.json()
            if not data:
                break
            commits.extend(data)
        elif response.status_code == 403:
            print("Rate limit reached.")
            break
        else:
            print(f"Error: {response.status_code}")
            print(response.json())
            break
    return commits


### Repository Commits

This section fetches the commit history of a known open-source repository (`pandas-dev/pandas`).  
Using pagination, we retrieve up to 3 pages of commit data (300 commits) to demonstrate API behavior over time.


In [41]:
commits_url = "https://api.github.com/repos/pandas-dev/pandas/commits"

commits = get_commits (commits_url, headers, per_page=100, max_pages=3)
print(f"Total commits fetched: {len(commits)}")

Error: 401
{'message': 'Bad credentials', 'documentation_url': 'https://docs.github.com/rest', 'status': '401'}
Total commits fetched: 0


### Sample Output Preview

To validate the structure and content of the retrieved data, we print a preview of the first element.  
This helps confirm that key fields (e.g., `name`, `html_url`, `stargazers_count`) are present and properly formatted.


In [27]:

print(repo_results[0])


{'id': 145553672, 'node_id': 'MDEwOlJlcG9zaXRvcnkxNDU1NTM2NzI=', 'name': 'funNLP', 'full_name': 'fighting41love/funNLP', 'private': False, 'owner': {'login': 'fighting41love', 'id': 11475294, 'node_id': 'MDQ6VXNlcjExNDc1Mjk0', 'avatar_url': 'https://avatars.githubusercontent.com/u/11475294?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/fighting41love', 'html_url': 'https://github.com/fighting41love', 'followers_url': 'https://api.github.com/users/fighting41love/followers', 'following_url': 'https://api.github.com/users/fighting41love/following{/other_user}', 'gists_url': 'https://api.github.com/users/fighting41love/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/fighting41love/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/fighting41love/subscriptions', 'organizations_url': 'https://api.github.com/users/fighting41love/orgs', 'repos_url': 'https://api.github.com/users/fighting41love/repos', 'events_url': 'https://api.github.com/

### Get Repos Contents

This endpoint allows us to access the content of a file in a repository. In this case, we read the README file from the root of the pandas-dev/pandas repository.


In [39]:
def get_file_content(owner, repo, path):
    url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        print(response.json())
        return {}

In [40]:
path = "README.md"
url_content = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"

readme_data = safe_request(url_content, headers)

# Print information about the README file
if readme_data:
    print(f"File name: {readme_data['name']}")
    print(f"Encoding: {readme_data['encoding']}")
    print(f"Download URL: {readme_data['download_url']}")
    print("File content preview (truncated):")
    print(readme_data['content'][:300])  # Show just the beginning


NameError: name 'owner' is not defined

###  Function Testing

This section is dedicated to testing the reusable functions created for authentication, error handling, and pagination.  
We verify that each function behaves as expected with real API responses.  
This modular approach improves maintainability and ensures consistent behavior across endpoints.


In [43]:
# Search for test repositories

repos = search_repositories()
for repo in repos[:3]:
    print(f"Repo: {repo['full_name']} - ⭐ {repo['stargazers_count']}")

# Get commits from the first repository
if repos:
    owner, name = repos[0]["owner"]["login"], repos[0]["name"]
    commits = get_commits(owner, name)
    print(f"\nTotal commits obtenidos de {owner}/{name}: {len(commits)}")
    if commits:
        print(f"Primer commit SHA: {commits[0]['sha']}")

# Read a sample file (README.md)
file_content = get_file_content(owner, name, "README.md")
if file_content:
    print("\nArchivo README.md encontrado:")
    print(file_content.get("Objetive"))


Repo: fighting41love/funNLP - ⭐ 74232
Repo: lk-geimfari/mimesis - ⭐ 4589
Repo: kaitai-io/kaitai_struct - ⭐ 4215
Error: 401
{'message': 'Bad credentials', 'documentation_url': 'https://docs.github.com/rest', 'status': '401'}

Total commits obtenidos de fighting41love/funNLP: 0

Archivo README.md encontrado:
None


### Convert to DataFrame

After successfully retrieving and validating the API responses, this section converts selected JSON data into structured Pandas DataFrames.  
This allows for easier exploration, analysis, and export of the extracted data.  
We also inspect the resulting structure to ensure data integrity and usability.


In [42]:
df = pd.DataFrame([
    {
        "sha": c["sha"],
        "author": c["commit"]["author"]["name"],
        "message": c["commit"]["message"],
        "date": c["commit"]["author"]["date"]
    }
    for c in commits
])

df.head()