## Subsample 
Now that we have the total amount of repositories in the Population, we now can create a rule of thumb to obtain a representative sample of the population. I'm obtaining a subsample of data to reduce the training time of the models and still preserve the original distribution of the population. In a more constraint scenario I would evaluate that the sample also preserve the proportion of all the different languages that appear in the population distributio, however, for this challeange I will simplify the problem and only obtain 9,900 random repositories. For a production version of this application we could use the entire population in the Database and train the model in an multicore Ec2 instance. Also, for a more robust smaple extraction,we can implement an startified sampling of the repositories.


Stratified sampling is a sampling technique involving the division of the total population into smaller subgroups, known as strata, and subsequently taking a sample from each stratum in proportion to its size in the overall population. This methodology aims to ensure that all subgroups are adequately represented in the final sample. In the context of analyzing GitHub repositories, applying a stratified approach would involve identifying different categories of repositories, such as those written in different programming languages or belonging to various application domains. 


In [None]:
import pandas as pd
import os
import math
import random
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Get the current working directory
current_dir = os.getcwd()

# Navigate to the data directory from the forecast directory
data_path = os.path.join(current_dir, "../data/raw/commit_history_raw.csv")

# Read the CSV file
df_raw = pd.read_csv(data_path)

### Reducing information
The raw data was obtainned directly from google big query. We have thousands of repositories that may repeat the name of the user and repository. Therfore, repo_name is not a valid id. Even tough the data was grouped previously at a repo_name level, there are special cases where more that one user is author of the project and then he is collaborator of a similar repository with the same name. We are going to make the assumtion that all the repo_name belong to the same user and the same repository, therefore we need to group the information again. 

In [None]:
df_raw.shape

In [None]:
df_regroup = df_raw.groupby(["repo_author_single", "year", "week_number"])["commit_count"].sum()

In [None]:
df_regroup = df_regroup.reset_index()

In [None]:
unique_repo_names =df_regroup.repo_author_single.unique()

In [None]:
unique_repo_names =df_regroup.repo_author_single.unique()

## Preprocess Data
Now that we have all the information for a subsample of 9605 repositories, I want to extract the required variables for our Forecast models. The main variable for this exercise will be the date and the number of commits per week. This information is contained inside the commiter variables (represented as a json). After appliying preprocess_data function we are ready to start exploring our data and create assumptions for the forecasting models.

In [None]:
def get_representative_sample(repo_names, sample_size=10, seed=42):
    """
    Get a representative sample of repositories from the provided list.

    Parameters:
    - repo_names (pd.Series): Pandas Series containing repository names.
    - sample_size (int): The size of the representative sample.
    - seed (int): Seed for the random number generator.

    Returns:
    - list: A list containing a representative sample of repository names.
    """
    # Set the seed for reproducibility
    random.seed(seed)

    # Check if the sample size is greater than the total number of repositories
    if sample_size > len(repo_names):
        raise ValueError("Sample size cannot be greater than the total number of repositories.")

    # Get a representative sample using random sampling
    sample = random.sample(repo_names.tolist(), sample_size)

    return sample


def calculate_sample_size(population_size, confidence_level=0.95, margin_of_error=0.01):
    z_score = 1.96  # for a 95% confidence level
    p = 0.5  # assuming a conservative estimate for proportion
    numerator = z_score ** 2 * p * (1 - p)
    denominator = margin_of_error ** 2
    sample_size = math.ceil((numerator / denominator) / (1 + ((numerator - 1) / population_size)))
    return sample_size

total_repositories = len(df_regroup)
sample_size = calculate_sample_size(total_repositories)
print(f"Recommended sample size: {sample_size}")



# Example usage
# Assuming you already have a Pandas Series named unique_repo_names
subset_sample = get_representative_sample(unique_repo_names, sample_size=sample_size, seed=123)

# Print the representative sample
print(len(subset_sample))
subset_sample[:10]
    


## Data Analysis
### Date Selection
In the following plot we can see how many observations we collected for each Date. We can already read some outliers of dates before 1998. Our models will behave better, the more data we pass to them. This is particularly true when using seasonal variables like months and years. However we will focus only on the last four years of data to reduce the computation time of the algoriths, the space of the trainning dataset and also to facilitate the evaluation of the models. In a production setting we can use the entire information but still keeping observations that happend after 2000.

In [None]:
def format_date(df):
    df['week_number'] =  df['week_number'].apply(lambda x: int(x))
    df['year'] =  df['year'].apply(lambda x: int(x))
    df['date'] = pd.to_datetime(df['year'].astype(str) +df['week_number'].astype(str) + '1', format='%Y%W%w')
    # Find the closest Sunday for each date
    df['date'] = df['date'] + pd.to_timedelta((6 - df['date'].dt.dayofweek) % 7, unit='D')
    return df

df_regroup = format_date(df_regroup)

In [None]:
#Make sure that we are using Sunday as the first day of the week, this is the default week start for the shift and window functions
max_date = pd.to_datetime('2022-01-01')
min_date = df_regroup['date'].min()

three_years_ago = max_date - pd.DateOffset(years=3)
df_regroup = df_regroup[(df_regroup.date >= three_years_ago)&(df_regroup.date <= max_date)]

sorted_dates = sorted(df_regroup.date.unique())
if sorted_dates[0].dayofweek != 6:
    raise ValueError("Weeks are not starting on Sunday")

In [None]:
sorted_dates[0].dayofweek

In [None]:
max_date = df_regroup['date'].max()
min_date = df_regroup['date'].min()
print("max_date: " , max_date) 
print("min_date: " , min_date) 


plt.figure(figsize=(12, 6))

# Use Seaborn's countplot with specified bin width
bin_width = 1 # Adjust the bin width as needed
sns.histplot(x='date', data=df_regroup, kde=False)

# Set labels and title
plt.xlabel('Date Count')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Date Count')

# Add a legend
plt.legend()

# Show the plot
plt.show()



In [None]:
max_date = pd.to_datetime('2022-01-01')
min_date = df_regroup['date'].min()

three_years_ago = max_date - pd.DateOffset(years=3)
df_regroup = df_regroup[(df_regroup.date >= three_years_ago)&(df_regroup.date <= max_date)]

print("max_date: " , max_date) 
print("min_date: " , min_date) 

In [None]:



plt.figure(figsize=(12, 6))

# Use Seaborn's countplot with specified bin width
bin_width = 1 # Adjust the bin width as needed
sns.histplot(x='date', data=df_regroup, kde=False)

# Add a vertical line for the outlier benchmark

# Set labels and title
plt.xlabel('Date Count')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Date Count')

# Add a legend
plt.legend()

# Show the plot
plt.show()



### Replace outliers
We want to detect all the outliers inside our data. A common rule of thumb is to identify all the data points that fall 2std deviations away from the sample mean. First we want to test if the number of commits per repository follows an normal distribution qithout heavy tails. We want to repeat the same analysis at a repository-week level, and set all these observations to the bounded maximum value we just defined (mean +- 2std)


In [None]:

# Example usage
# Assuming you already have a Pandas Series named unique_repo_names
subset_sample = get_representative_sample(unique_repo_names, sample_size=sample_size, seed=123)

# Print the representative sample
print(len(subset_sample))

df_sample = df_regroup[df_regroup.repo_author_single.isin(subset_sample)]

In [None]:
df_sample.columns
mean = df_sample['commit_count'].mean()
std = df_sample['commit_count'].std()
outlier_benchmark = mean + (2*std) 
max_commits_week =df_sample['commit_count'].max()
df_test = df_sample[df_sample.commit_count<=100]



In [None]:
print("outlier_benchmark:" , outlier_benchmark)
print("max_commits_week:" , max_commits_week)
print("mean:" , mean)

#### Repositories Weekly Commit Distribution

In [None]:

plt.figure(figsize=(12, 6))

# Use Seaborn's countplot with specified bin width
bin_width = 1 # Adjust the bin width as needed
sns.histplot(x='commit_count', data=df_test, kde=False)

# Add a vertical line for the outlier benchmark
plt.axvline(x=outlier_benchmark, color='red', linestyle='--', label='Outlier Benchmark')

# Set labels and title
plt.xlabel('Commit Count')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Commit Count')

# Add a legend
plt.legend()

# Show the plot
plt.show()


For visualizations puurposes I already cut the distribution since we repositories that got more than 5000, commits in a single week. We can see that the avergae falls at 7.73 commits per day. We will repeat the same plot, this time grouping data at a repository level: 


In [None]:
df_group_repo = df_sample.groupby(["repo_author_single"])["commit_count"].sum().reset_index()
df_test = df_group_repo[df_group_repo.commit_count<=100]


df_sample.columns
mean = df_group_repo['commit_count'].mean()
std = df_group_repo['commit_count'].std()
outlier_benchmark = mean + (2*std) 
max_commits_week =df_group_repo['commit_count'].max()
df_test = df_group_repo[df_group_repo.commit_count<=100]



In [None]:
print("outlier_benchmark:" , outlier_benchmark)
print("max_commits_week:" , max_commits_week)
print("mean:" , mean)

In [None]:

# Set the figure size
plt.figure(figsize=(12, 6))

# Use Seaborn's countplot with specified bin width
bin_width = 1 # Adjust the bin width as needed
sns.histplot(x='commit_count', data=df_test, kde=False)

# Add a vertical line for the outlier benchmark
plt.axvline(x=outlier_benchmark, color='red', linestyle='--', label='Outlier Benchmark')

# Set labels and title
plt.xlabel('Commit Count')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Commit Count')

# Add a legend
plt.legend()

# Show the plot
plt.show()

Lets repeat the same plot for our already inputed data and see how the outlier benchmark changes 

In [None]:
#Apply outlier imputation
df_bounded = df_sample.copy()
df_bounded.loc[df_bounded["commit_count"]>=outlier_benchmark, "commit_count"] = int(round(outlier_benchmark))

df_group_repo_bounded = df_bounded.groupby(["repo_author_single"])["commit_count"].sum().reset_index()
df_test = df_group_repo_bounded[df_group_repo_bounded.commit_count<=100]

mean = df_group_repo_bounded['commit_count'].mean()
std = df_group_repo_bounded['commit_count'].std()
outlier_benchmark = mean + (2*std) 
max_commits_week =df_group_repo_bounded['commit_count'].max()
df_test = df_group_repo[df_group_repo_bounded.commit_count<=100]

print("outlier_benchmark:" , outlier_benchmark)
print("max_commits_week:" , max_commits_week)
print("mean:" , mean)


In [None]:

# Set the figure size
plt.figure(figsize=(12, 6))

# Use Seaborn's countplot with specified bin width
bin_width = 1 # Adjust the bin width as needed
sns.histplot(x='commit_count', data=df_test, kde=False)

# Add a vertical line for the outlier benchmark
plt.axvline(x=outlier_benchmark, color='red', linestyle='--', label='Outlier Benchmark')

# Set labels and title
plt.xlabel('Commit Count')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Commit Count')

# Add a legend
plt.legend()

# Show the plot
plt.show()

It doesn't seem that the weekly repository outlier  solve the outliers for a repository level. There is a paradigm in how to threat these outliers. On the one hand, we don't want to delete important observations from our dataset, particularly, if we are building a tool that tells the user which repository is going to have more support in the long term, then we dont want to delete these attipical values that heppen to have a lot of support from the users. On the other hand, these values could still be outliers and these scarse observatiosn represeny very popular repositopries with a lot of support and it's really unlikely that a random repository will behave the same way. In other words, the outliers can bias the model making us creating overestimated to total number of commits, but if we don't keep them we can delete entirely a subcategory of the distributions of our data (very impportant-popular-longtermsupporte repositories). 

I'll make another assumption here and represent these values as outliers and delete these repositories entirely from my sample. If I have enough time I'll come back and make a stratified outlier replacement by reducing the all the weekly commit_count values for these special cases in the same proportion, making the total commit beign less than 470 total commits. 

Note: We also should shorten the lifespan of the repositories to compare repositories in the same period of time. For example only take the first year of existance of each repository. Nevertheless, we only have 

In [None]:
df_bounded.columns


In [None]:
df_bounded = df_bounded.rename(columns= {"repo_author_single":"repo_name"})
len(df_bounded.repo_name.unique())

In [None]:
print("Max commits per day: ", df_bounded.commit_count.max())
print("Max date: ", df_bounded.date.max())
print("Min date: ", df_bounded.date.min())
print("Shape: ", df_bounded.shape)

output_path= os.path.join(current_dir, "../data/preprocess/commit_history_subset.csv")
df_bounded.to_csv(output_path, index=False)

In [None]:
testing_sample = df_bounded.repo_name.unique()[:150]
df_testing = df_bounded[df_bounded.repo_name.isin(testing_sample)]

print("Max commits per day: ", df_testing.commit_count.max())
print("Max date: ", df_testing.date.max())
print("Min date: ", df_testing.date.min())
print("Shape: ", df_testing.shape)

testing_path= os.path.join(current_dir, "../data/preprocess/commit_history_subset_test.csv")
df_testing.to_csv(testing_path)




### Data imputation
For the last step of the data preprocessing, I'm going to impute the commit for all the non reported weeks. We cna be sure that there is not ommited data in either in the subsample or the population data, since we collect it directly from the BigQuery open source data. However, we need to complete the series for the feature engineering process. The script complete_series.py deal with this problem. From this point forward I'll be working with the sample data alone, to reduce the computation and preprocessing time of all the involved functions.
