Extraction and transformation data from gitHUB by gitHUB API.


>[Install extra library](#scrollTo=YmqZtuFVDKpJ)

>[Import necessary libraries](#scrollTo=6cbXaYaPDOcZ)

>[Authorization](#scrollTo=1bcx6Dv43kya)

>[Rate limit](#scrollTo=9-9rL5nO8ec1)

>>>[Control primary rate limit](#scrollTo=9-9rL5nO8ec1)

>>>[Control secondary rate limit](#scrollTo=9-9rL5nO8ec1)

>[Pagination](#scrollTo=bM0YMTqccpbi)

>[Collection](#scrollTo=6whHRGT5Djsz)

>>[Search repo](#scrollTo=Y1jAzGUmL13T)

>>[Collect all commits by the repositories](#scrollTo=j4yADKNsBKP-)

>>[Collect repo's content from main branch from main page](#scrollTo=l0cR_VFzhjCP)



# Install extra library

In [None]:
# install extra library for auth
!pip install python-dotenv

# Import necessary libraries

In [2]:
# import necessary libraries
import requests
import pandas as pd
from dotenv import load_dotenv
import json
import os
from datetime import datetime
import re
import time


# Authorization

Prepare .env file:<br>
GITHUB-TOKEN=Bearer YOUR TOKEN<br>
VERSION=API VERSION<br>
- Bearer YOUR TOKEN -  that fine-grainted token. [Create token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-fine-grained-personal-access-token)
- set the api version. [Supported version](https://docs.github.com/en/rest/about-the-rest-api/api-versions?apiVersion=2022-11-28#supported-api-versions)

Save it in folder

Check authorization using next

In [None]:
# check .env file
def check_auth_headers():
    # load environment variables .env
    load_dotenv(".env")

    # Extract variables
    api_token = os.getenv("GITHUB-TOKEN")
    api_version = os.getenv("VERSION")
    check_exist = 'You fill .env'
    check_auth_headers_flag = True

    if api_token is None or api_version is None:
        check_exist = 'Error - you miss token or version, check .env'
        check_auth_headers_flag = False

    return check_exist, check_auth_headers_flag, api_token, api_version
check_auth_headers()
print(f'token - {check_auth_headers()[2]}', f'version - {check_auth_headers()[3]}', sep = '\n')


# Authenticating to the REST API
def autorization():
    if check_auth_headers()[1] is False:
        return check_auth_headers()[0]

    else:
        url = "https://api.github.com/octocat"
        headers = {
        'Accept': 'application/vnd.github+json',
        'Authorization': check_auth_headers()[2],
        'X-GitHub-Api-Version': check_auth_headers()[3]
        }
        response = requests.request("GET", url, headers=headers)
        if response.status_code == 200:
            return headers
        else:
            return print(f'Authentication failed. Status code: {response.status_code}')

autorization()

# Rate limit
- **Primary rate limit** <br>
for personal autherisation users 5000 requests per hour. <br>[Check](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#primary-rate-limit-for-authenticated-users)
- **Secondary rate limit**
-- No more than 100 concurrent requests are allowed
-- No more than 900 points per minute are allowed for REST API endpoints
-- No more than 90 seconds of CPU time per 60 seconds of real time is allowed.
-- Make too many requests that consume excessive computing resources in a short period of time.<br>
[Check](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits)

### Control primary rate limit
-- x-ratelimit-remaining header from response should be more then 0
-- when x-ratelimit-remaining header is 0, the next request should be after x-ratelimit-reset time

### Control secondary rate limit
-- If the retry-after response header is present, you should not retry your request until after that many seconds has elapsed.

# Pagination
- Some endpoints have the pagination.<br>
If the endpoint has the page parameter in response headers you find the header [Link](https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28)
-- page - just the current page number
-- per_page - number of objects per page will returned

# Collection

## Search repo
- 30 per page
- 30 request fine-granted
- serch only by name
- exrtact the owner login and repo names

In [None]:
# Set the prompt for searching repo
max_length = 256

while True:
    user_input = input("Type prompt (less than 256 symbols): ")
    if len(user_input) <= max_length:
        break
    else:
        print(f"Too long! Max length — {max_length}. Repeat.")


url = f'https://api.github.com/search/repositories?q={user_input} in:name'
all_results = []  # List of result
payload = {}
headers = autorization()

# Set the request limit
requests_per_hour = 5000  # max request per hour for auth user
time_between_requests = 3600 / requests_per_hour  # time interval between requests
secondary_rate_limit_wait = 60 # waiting time in secondary rate limit reach

while url:
    print(f'url:{url}') # checking url and pagination in url
    response = requests.get(url, headers=headers, data=payload)

    if response.status_code == 200:
        data = response.json()
        all_results.extend(data.get("items", []))  # add the current page result

        # Check next page if exists
        link_header = response.headers.get("Link", "")

        if 'rel="next"' in link_header:
            # Check the next page to see if it exists
            pattern = r'<.*?page=(\d+)>; rel="next"'
            match = re.search(pattern, link_header)
            last_page = f'&page={match.group(1)}'

            url = f'{url.split("&page")[0]}{last_page}'

        else:
            url = None  # if next page doesn't exist
        time.sleep(time_between_requests) #timebrake between requests

    # handling 403 and 429 errors
    elif response.status_code == 403 or response.status_code == 429:
        # Check Secondary Rate Limit
        if 'Retry-After' in response.headers:
            print(f"Secondary rate limit reached. Waiting for {secondary_rate_limit_wait} seconds...")
            time.sleep(secondary_rate_limit_wait)
            continue
        # Check Primary Rate Limit
        if 'X-RateLimit-Remaining' in response.headers:
            remaining = int(response.headers['X-RateLimit-Remaining'])
            if remaining == 0:
                reset_time = int(response.headers['X-RateLimit-Reset'])
                wait_time = reset_time - int(time.time())
                print(f"Primary rate limit reached. Waiting for {wait_time} seconds...")
                time.sleep(wait_time + 1)
                continue

    # handling others's errors
    else:
        error_message = response.json().get("message", "No error message provided")
        print(f"Error: {response.status_code} {error_message}")
        break
# extraction repo's name and repo's owner
repo_list = [[i['name'], i['owner']['login']] for i in all_results]

## Collect all commits by the repositories
- repo_list - the list with all founded repos and owners.

In [None]:
all_commits = []
commit_data_list = []
for i in repo_list:
    repo_name = i[0]
    owner_name = i[1]
    url = f'https://api.github.com/repos/{owner_name}/{repo_name}/commits'

    while url:
        print(url)
        response = requests.get(url, headers=headers, data=payload)

        if response.status_code == 200:
                data = response.json()
                for j in range(len(data)):
                    commit_author = data[j]['commit']['committer']['name']
                    commit_date = data[j]['commit']['committer']['date']
                    commit_message = data[j]['commit']['message']
                    commit_data_list.append({
                        'owner': owner_name,
                        'repository': repo_name,
                        'commit_author': commit_author,
                        'commit_date': commit_date,
                        'commit_message': commit_message
                    })
                    #print(repo_name,owner_name, commit_author,  commit_date, commit_message)
                all_commits.extend(data)
                link_header = response.headers.get("Link", "")

                if 'rel="next"' in link_header:
                    # Change link for next page
                    pattern = r'<.*?page=(\d+)>; rel="next"'
                    match = re.search(pattern, link_header)
                    last_page = f'?page={match.group(1)}'

                    url = f'{url.split("?page")[0]}{last_page}'

                else:
                    url = None  # if next page doesn't exist

        # handling 403 and 429 errors
        elif response.status_code == 403 or response.status_code == 429:
            # Check Secondary Rate Limit
            if 'Retry-After' in response.headers:
                print(f"Secondary rate limit reached. Waiting for {secondary_rate_limit_wait} seconds...")
                time.sleep(secondary_rate_limit_wait)
                continue
            # Check Primary Rate Limit
            if 'X-RateLimit-Remaining' in response.headers:
                remaining = int(response.headers['X-RateLimit-Remaining'])
                if remaining == 0:
                    reset_time = int(response.headers['X-RateLimit-Reset'])
                    wait_time = reset_time - int(time.time())
                    print(f"Primary rate limit reached. Waiting for {wait_time} seconds...")
                    time.sleep(wait_time + 1)
                    continue
        # handling others's errors
        else:
            error_message = response.json().get("message", "No error message provided")
            print(f"Error: {response.status_code} {error_message}")
            break


# Transform to df, clean from NULL, save as .csv
if commit_data_list == []:
    print(f'{user_input} repository does not found')
else:
    commits_df = pd.DataFrame(commit_data_list)
    commits_df.dropna(subset =['commit_message'], inplace = True)
    commits_df.to_csv('Commit_list.csv')
    commits_df.info()

## Collect repo's content from main branch from main page

In [None]:

all_content = []
content_data_list = []
for i in repo_list:
    repo_name = i[0]
    owner_name = i[1]
    url = f'https://api.github.com/repos/{owner_name}/{repo_name}/contents'
    response = requests.get(url, headers=headers, data=payload)
    if response.status_code == 200:
            data = response.json()
            for j in range(len(data)):
                content_name = data[j]['name']
                content_path = data[j]['path']
                content_size = data[j]['size']
                content_type = data[j]['type']
                content_data_list.append({
                    'owner': owner_name,
                    'repository': repo_name,
                    'content_name': content_name,
                    'content_path': content_path,
                    'content_size': content_size,
                    'content_type': content_type
                })
                #print(repo_name,owner_name, content_name,  content_path, content_size, content_type)
            all_content.extend(data)


    elif response.status_code == 403 or response.status_code == 429:
        # Check Secondary Rate Limit
        if 'Retry-After' in response.headers:
            #print(f"Secondary rate limit reached. Waiting for {secondary_rate_limit_wait} seconds...")
            time.sleep(secondary_rate_limit_wait)
            continue
        # Check Primary Rate Limit
        if 'X-RateLimit-Remaining' in response.headers:
            remaining = int(response.headers['X-RateLimit-Remaining'])
            if remaining == 0:
                reset_time = int(response.headers['X-RateLimit-Reset'])
                wait_time = reset_time - int(time.time())
                print(f"Primary rate limit reached. Waiting for {wait_time} seconds...")
                time.sleep(wait_time + 1)
                continue
    else:
        error_message = response.json().get("message", "No error message provided")
        print(f"Error: {response.status_code} {error_message}")
        continue


# Transform to df, clean from NULL, save as .csv
if content_data_list == []:
    print(f'{user_input} repository does not found')
else:
    content_df = pd.DataFrame(content_data_list)
    content_df.dropna(subset =['content_path', 'content_size', 'content_type'], inplace = True)
    content_df.info()
    content_df.to_csv('Content_list.csv')