## BigQuery -  Github
This basic Python kernel shows you how to query the `commits` table in the GitHub Repos BigQuery dataset. We will use this information to obatin a representative sample of all the public repositories at Github. To run 

In [None]:
from google.cloud import bigquery
import pandas as pd
import math
import random

In [None]:

client = bigquery.Client()
QUERY = """
        SELECT *
        FROM `bigquery-public-data.github_repos.commits`
        LIMIT 2000
        """

query_job = client.query(QUERY)

iterator = query_job.result(timeout=30)
rows = list(iterator)

commit_messages = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))

# Look at the first 10 headlines
commit_messages.head(10)

In [None]:

def get_unique_repo_names(dataset_id='bigquery-public-data', table_id='github_repos.commits', limit=2000):
    """
    Get all unique values of the 'repo_name' variable in a BigQuery table.

    Parameters:
    - dataset_id (str): The ID of the dataset containing the table.
    - table_id (str): The ID of the table.
    - limit (int): The maximum number of rows to retrieve.

    Returns:
    - pd.Series: A Pandas Series containing unique repo names.
    """
    client = bigquery.Client()

    # Build the SQL query
    if limit:
        raise ValueError()
        query = f"""
            SELECT DISTINCT repo_name
            FROM `{dataset_id}.{table_id}`, UNNEST(repo_name) AS repo_name
            LIMIT {limit}
        """
    else: 
        query = f"""
            SELECT DISTINCT repo_name
            FROM `{dataset_id}.{table_id}`, UNNEST(repo_name) AS repo_name
        """
    
    # Run the query
    query_job = client.query(query)
    iterator = query_job.result(timeout=30)
    rows = list(iterator)

    # Create a Pandas Series from the results
    repo_names = pd.Series([row['repo_name'] for row in rows], name='repo_name')

    return repo_names

unique_repo_names = get_unique_repo_names(limit=False)

## Subsample 
Now that we have the total amount of repositories in the Population, we now can create a rule of thumb to obtain a representative sample of the population. I'm obtaining a subsample of data to reduce the training time of the models and still preserve the original distribution of the population. In a more constraint scenario I would evaluate that the sample also preserve the proportion of all the different languages that appear in the population distributio, however, for this challeange I will simplify the problem and only obtain 9,900 random repositories. For a production version of this application we could use the entire population in the Database and train the model in an multicore Ec2 instance. Also, for a more robust smaple extraction,we can implement an startified sampling of the repositories.


Stratified sampling is a sampling technique involving the division of the total population into smaller subgroups, known as strata, and subsequently taking a sample from each stratum in proportion to its size in the overall population. This methodology aims to ensure that all subgroups are adequately represented in the final sample. In the context of analyzing GitHub repositories, applying a stratified approach would involve identifying different categories of repositories, such as those written in different programming languages or belonging to various application domains. 

In [None]:

def calculate_sample_size(population_size, confidence_level=0.95, margin_of_error=0.01):
    z_score = 1.96  # for a 95% confidence level
    p = 0.5  # assuming a conservative estimate for proportion
    numerator = z_score ** 2 * p * (1 - p)
    denominator = margin_of_error ** 2
    sample_size = math.ceil((numerator / denominator) / (1 + ((numerator - 1) / population_size)))
    return sample_size

total_repositories = 3_300_000
sample_size = calculate_sample_size(total_repositories)
print(f"Recommended sample size: {sample_size}")


In [None]:

def get_representative_sample(repo_names, sample_size=10, seed=42):
    """
    Get a representative sample of repositories from the provided list.

    Parameters:
    - repo_names (pd.Series): Pandas Series containing repository names.
    - sample_size (int): The size of the representative sample.
    - seed (int): Seed for the random number generator.

    Returns:
    - list: A list containing a representative sample of repository names.
    """
    # Set the seed for reproducibility
    random.seed(seed)

    # Check if the sample size is greater than the total number of repositories
    if sample_size > len(repo_names):
        raise ValueError("Sample size cannot be greater than the total number of repositories.")

    # Get a representative sample using random sampling
    sample = random.sample(repo_names.tolist(), sample_size)

    return sample

# Example usage
# Assuming you already have a Pandas Series named unique_repo_names
subset_sample = get_representative_sample(unique_repo_names, sample_size=sample_size, seed=123)

# Print the representative sample
print(len(subset_sample))
subset_sample[:10]

In [None]:
query = """
    SELECT
        repo_name, 
        COUNT(*) AS commit_count,
        year, week_number
    FROM

    (SELECT
        ARRAY_TO_STRING(repo_name, ',') AS repo_name,
        FORMAT_TIMESTAMP('%Y%m%d', TIMESTAMP_SECONDS(committer.date.seconds)) AS date,
        EXTRACT(YEAR FROM TIMESTAMP_SECONDS(committer.date.seconds)) AS year,
        EXTRACT(ISOWEEK FROM TIMESTAMP_SECONDS(committer.date.seconds)) AS week_number,
        EXTRACT(MONTH FROM TIMESTAMP_SECONDS(committer.date.seconds)) AS month,
        EXTRACT(DAY FROM TIMESTAMP_SECONDS(committer.date.seconds)) AS day,
    FROM
        `bigquery-public-data.github_repos.commits`) A
        
    GROUP BY
        A.repo_name, A.year, A.week_number
"""




query_job = client.query(query)
iterator = query_job.result()
rows = list(iterator)
result_df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))


## Preprocess Data
Now that we have all the information for a subsample of 9900 repositories, I want to extract the required variables for our Forecast models. The main variable for this exercise will be the date and the number of commits per week. This information is contained inside the commiter variables (represented as a json). After appliying preprocess_data function we are ready to start exploring our data and create assumptions for the forecasting models.

In [None]:
result_df.to_csv("../../data/commit_history.csv")

In [None]:
sample_df = result_df[result_df.repo_name.isin(subset_sample)]
result_df.to_csv("../../data/commit_history_sample.csv")