# GitHub API Interaction with Python
This notebook demonstrates how to interact with the GitHub API using Python. The following functionalities are covered:
1. Authentication and testing the API connection.
2. Searching for repositories using various query parameters.
3. Fetching commits for a specific repository.
4. Retrieving repository contents.
5. Handling pagination for large datasets.
6. Exporting the cleaned data to CSV and JSON files.

I also include error handling and rate limit management to ensure the script runs smoothly.


## Authentication
The authentication step uses a Personal Access Token (PAT) to authenticate with the GitHub API. This token is necessary for accessing endpoints that require authorization and for increasing the rate limit to 5,000 requests per hour.
- **Function**: `test_auth`
- **Purpose**: To verify that the provided token is valid and functional.
- **Key Components**:
  - Base URL for GitHub API: `https://api.github.com`
  - Authorization headers: Include the PAT in the `Authorization` header.



In [3]:
import requests
import time


BASE_URL = "https://api.github.com"
TOKEN = "your_personal_token"

HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "X-GitHub-Api-Version": "2022-11-28"
}

In [None]:
def test_auth():
    """Test if the authentication token is valid."""
    url = f"{BASE_URL}/user"
    try:
        response = requests.get(url, headers=HEADERS)
        response.raise_for_status()
        print("Authentication successful:", response.json())
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred: {http_err}")
        raise
    except Exception as err:
        print(f"Other error occurred: {err}")
        raise


if __name__ == "__main__":
    # Test authentication
    try:
        test_auth()
    except Exception as e:
        print("Authentication failed. Exiting.")
        exit(1)


## Class: `GitHubAPI`

### **Description**
A reusable class to interact with the GitHub API, handling authentication and providing methods to execute common API operations.

## API Interaction Functions
### Search Repositories
- **Function**: `search_repositories`
- **Purpose**: Searches for repositories on GitHub based on a query (e.g., programming language, popularity).
- **Parameters**:
  - `query`: Search keywords (e.g., `language:Python`).
  - `sort`: Sort by criteria (e.g., stars, forks).
  - `order`: Sorting order (`asc` or `desc`).

### List Commits
- **Function**: `list_commits`
- **Purpose**: Fetches the list of commits for a specific repository.
- **Parameters**:
  - `owner`: The owner of the repository.
  - `repo`: The name of the repository.

### Get Repository Contents
- **Function**: `get_contents`
- **Purpose**: Retrieves the contents of a specific file or directory in a repository.
- **Parameters**:
  - `owner`: The owner of the repository.
  - `repo`: The name of the repository.
  - `path`: The path of the file or directory.

In [None]:
class GitHubAPI:
    """A class to interact with the GitHub API."""

    def __init__(self, base_url, headers):
        """
        Initialize the GitHubAPI class with a base URL and headers.
        :param base_url: The base URL for the GitHub API
        :param headers: Headers for API authentication
        """
        self.base_url = base_url
        self.headers = headers

    def search_repositories(self, query="language:Python", sort="stars", order="desc"):
        """
        Search for repositories based on a query.
        :param query: Search query (e.g., language:Python)
        :param sort: Sort parameter (e.g., stars, forks)
        :param order: Order parameter (e.g., asc, desc)
        :return: List of repository items
        """
        url = f"{self.base_url}/search/repositories"
        params = {"q": query, "sort": sort, "order": order}
        try:
            response = requests.get(url, headers=self.headers, params=params)
            response.raise_for_status()
            return response.json().get("items", [])
        except requests.exceptions.HTTPError as http_err:
            print(f"HTTP error occurred: {http_err}")
        except Exception as err:
            print(f"Other error occurred: {err}")

    def list_commits(self, owner, repo):
        """
        List commits for a specific repository.
        :param owner: Repository owner
        :param repo: Repository name
        :return: List of commits
        """
        url = f"{self.base_url}/repos/{owner}/{repo}/commits"
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.HTTPError as http_err:
            print(f"HTTP error occurred: {http_err}")
        except Exception as err:
            print(f"Other error occurred: {err}")

    def get_contents(self, owner, repo, path):
        """
        Retrieve contents of a file or directory.
        :param owner: Repository owner
        :param repo: Repository name
        :param path: Path to the file or directory
        :return: JSON response with contents
        """
        url = f"{self.base_url}/repos/{owner}/{repo}/contents/{path}"
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.HTTPError as http_err:
            print(f"HTTP error occurred: {http_err}")
        except Exception as err:
            print(f"Other error occurred: {err}")


if __name__ == "__main__":
    # Initialize the GitHubAPI class
    github_api = GitHubAPI(BASE_URL, HEADERS)

    # Example 1: Search for Python repositories sorted by stars
    print("Searching for Python repositories sorted by stars:")
    repositories = github_api.search_repositories()
    if repositories:
        for repo in repositories[:5]:  # Show only the top 5 results
            print(f"Repository: {repo['full_name']} | Stars: {repo['stargazers_count']}")

    # Example 2: List commits for a specific repository
    print("\nListing commits for the 'microsoft/vscode' repository:")
    commits = github_api.list_commits("microsoft", "vscode")
    if commits:
        for commit in commits[:5]:  # Show only the top 5 commits
            print(f"Commit: {commit['commit']['message']} | Author: {commit['commit']['author']['name']}")

    # Example 3: Get contents of a specific path
    print("\nRetrieving contents of the README file in 'microsoft/vscode':")
    contents = github_api.get_contents("microsoft", "vscode", "README.md")
    if contents:
        print(contents)


## Handling Pagination
GitHub API responses for large datasets are paginated. This means:
- Only a limited number of results are returned per request (default: 30, max: 100).
- Subsequent pages must be fetched using the `Link` header in the response.

### Function: `fetch_paginated_results`
- **Purpose**: To automatically fetch all pages of results from a paginated API endpoint.
- **Implementation**:
  - Continuously parse the `Link` header for the URL of the next page.
  - Append results from each page to a list.


In [None]:
class PageFetcher:
    def __init__(self, base_url, headers):
        """
        Initializes the PageFetcher with a base URL and headers.

        :param base_url: Base URL for the API
        :param headers: Headers to include in API requests
        """
        self.base_url = base_url
        self.headers = headers

    def fetch_paginated_results(self, url, max_pages=None):
        """
        Fetches a limited number of paginated results from the given API endpoint.

        :param url: API endpoint with pagination
        :param max_pages: Maximum number of pages to fetch (optional, for testing purposes)
        :return: List of all results from the fetched pages
        """
        results = []
        page_count = 0  # To track the number of pages fetched

        while url:
            response = requests.get(url, headers=self.headers)
            if response.status_code != 200:
                raise Exception(f"Failed to fetch data: {response.status_code}, {response.text}")

            data = response.json()
            results.extend(data)

            # Increment the page count
            page_count += 1
            if max_pages and page_count >= max_pages:
                print(f"Stopping after fetching {page_count} pages (testing limit reached).")
                break

            # Check for the 'Link' header to find the next page URL
            link_header = response.headers.get('Link')
            if link_header:
                next_url = None
                for link in link_header.split(','):
                    if 'rel="next"' in link:
                        next_url = link.split(';')[0].strip('<> ')
                url = next_url
            else:
                url = None  # No more pages to fetch

        return results


if __name__ == "__main__":
    # Create an instance of the PageFetcher
    fetcher = PageFetcher(BASE_URL, HEADERS)

    # Initial URL for the first page
    initial_url = f"{BASE_URL}/repos/microsoft/vscode/contributors?per_page=10"

    try:
        # Set a maximum number of pages to fetch for testing (e.g., 2 pages)
        max_pages_to_fetch = 2
        contributors = fetcher.fetch_paginated_results(initial_url, max_pages=max_pages_to_fetch)
        print(f"Total contributors fetched: {len(contributors)}")
        for contributor in contributors:
            print(contributor)
    except Exception as e:
        print(f"Error fetching contributors: {e}")


## Rate Limit Management
GitHub imposes rate limits to prevent abuse:
- **Authenticated requests**: 5,000 requests per hour.
- **Secondary rate limits**: Apply additional restrictions for high-volume traffic.

### Function: `check_rate_limit`
- **Purpose**: Monitors the current rate limit and determines the time until reset.
- **Implementation**:
  - Sends a request to `/rate_limit`.
  - Logs the remaining requests and reset time.
  
### Function: `wait_if_rate_limited`
- **Purpose**: Pauses execution if the rate limit is exceeded.
- **Implementation**:
  - Calculates the wait time until the rate limit resets.
  - Delays execution using `time.sleep`.


In [None]:
class GitHubRateLimiter:
    """A class to handle GitHub API rate limiting."""

    def __init__(self, base_url, headers):
        """
        Initialize the GitHubRateLimiter class.
        :param base_url: The base URL for the GitHub API
        :param headers: Headers for API authentication
        """
        self.base_url = base_url
        self.headers = headers

    def check_rate_limit(self):
        """
        Check the current rate limit for the GitHub API.
        :return: Dictionary with rate limit details
        """
        url = f"{self.base_url}/rate_limit"
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()
            rate_limit = response.json().get("rate", {})
            print("Rate Limit:", rate_limit)
            return rate_limit
        except requests.exceptions.HTTPError as http_err:
            print(f"HTTP error occurred: {http_err}")
        except Exception as err:
            print(f"Other error occurred: {err}")
            return None

    def wait_if_rate_limited(self):
        """
        Pause execution if the rate limit is exceeded.
        """
        rate_limit = self.check_rate_limit()
        if rate_limit and rate_limit.get("remaining", 0) == 0:
            reset_time = rate_limit.get("reset", 0)
            wait_time = reset_time - int(time.time())
            if wait_time > 0:
                print(f"Rate limit exceeded. Waiting {wait_time} seconds.")
                time.sleep(wait_time)
            else:
                print("Rate limit reset time has already passed. Continuing...")


if __name__ == "__main__":
    # Initialize the GitHubRateLimiter class
    rate_limiter = GitHubRateLimiter(BASE_URL, HEADERS)

    # Example: Check the current rate limit
    print("Checking rate limit:")
    rate_limiter.check_rate_limit()

    # Example: Wait if rate limit is exceeded
    print("\nTesting rate limit handling:")
    rate_limiter.wait_if_rate_limited()


## Data Export
After retrieving and cleaning the data, it can be exported to local files for further analysis or sharing.
- **Formats**:
  - CSV: Tabular format for easy processing.
  - JSON: Nested format for detailed analysis.

### Implementation:
- Save repositories as a CSV file.
- Save commit data as a JSON file.
- Use the `files` module in Colab to download the outputs directly.

### Example:
- CSV: `repositories.csv`
- JSON: `commits.json`


## Error Handling
Errors are handled gracefully to ensure smooth execution:
- **Try-Except Blocks**:
  - Catch HTTP errors (e.g., `401 Unauthorized`, `403 Forbidden`).
  - Handle network issues or invalid responses.
- **Key Functions**:
  - `test_auth`: Checks for authentication errors.
  - `fetch_paginated_results`: Manages pagination errors.
  - `check_rate_limit`: Handles rate limit issues.


## Summary
This notebook provides a comprehensive workflow for interacting with the GitHub API, including:
1. Authentication and connection testing.
2. Data retrieval for repositories, commits, and contents.
3. Pagination handling for large datasets.
4. Exporting data for further use.
5. Error handling and rate limit management.

By following these steps, I ensure efficient and reliable interaction with the GitHub API.
