# Demonstration of PyCIRAS

## Usage

The user needs to import the setup script to build the environment

In [None]:
import setup_notebook_environment

After this - the user can access the functionality through a variety of entry points on the form **pyciras.run_**

In [None]:
import pyciras

### Full Analysis - Code Quality, Git metrics, Unit-Testing, Stargazers data

The user can specify a list of git repos they want to analyze directly in the notebook, or in the default **/repos.txt** file.

In [None]:
repos = ['https://github.com/SamuelThand/TDD-Hangman',
         'https://github.com/coinse/sadl',
         'https://github.com/zhangj111/astnn']

There are some options for the analysis:

- Repo URLs
- Chunk Size - How many repos the program should analyze before writing results to disk
- Paralellism - If the program should use subprocesses equal to the chunk size, to speed up computation
- Remove Repos - The program can remove the downloaded repositories after the analysis, to save storage space

For long-running analysis, we support **ntfy** push notifications, so you can leave it running and get notified when completed.

In [None]:
pyciras.run_analysis(repo_urls=repos, chunk_size = 2, parallelism = False, remove_repos_after_completion=False)

### Single Analysis - More targeted and efficient mining for specific data

In [None]:
# pyciras.run_analysis(repo_urls=repos, chunk_size = 2, parallelism = False, remove_repos_after_completion=False, 
#                     analyse_stargazers=False,
#                     analyze_code_quality=False,
#                     analyze_repositories=True,
#                     analyze_unit_testing=True)

### Gaining insights from the data

We have included tools for data analysis and visualisation, that can be used to examine the data aquired through our tool.

The data produced during the experiment can always be found in the **pyciras.data_directory**, which is the timestamped data directory for the analysis


#### Plotting

In [None]:
from utility import config
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv(
    config.DATA_FOLDER / pyciras.data_directory / "stargazers-over-time.csv")

In [None]:
# Assuming df is already loaded and 'DATE' has been set as datetime and index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Resample to get the last entry of each month
df_monthly = df.resample('M').last()

# Drop rows where all values except 'DATE' are NaN
df_monthly.dropna(how='all', inplace=True)

# Reset index to get 'DATE' back as a column for plotting
df_line = df_monthly.reset_index()

# plotting
plt.figure(figsize=(20, 15))
for column in df_line.columns[1:]:  # Adjust column indexing if necessary
    plt.plot(df_line['date'], df_line[column], marker='o', label=column)

plt.title('Project Stars Over Time')
plt.xlabel('Date')
plt.ylabel('Stars')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)

# Improve the date format on the X-axis
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(plt.matplotlib.dates.MonthLocator())

plt.tight_layout()
plt.show()

## Statistical Analysis

Lets say we are interested in doing a correlation analysis between the Pylint score of a project, and the amount of stars it has.

### Extracting the relevant data

In [None]:
astnn_pylint = pd.read_csv(config.DATA_FOLDER / pyciras.data_directory / "pylint-astnn.csv")
stargazers_over_time = pd.read_csv(config.DATA_FOLDER / pyciras.data_directory / "stargazers-over-time.csv")

In [None]:
astnn_pylint

In [None]:
stargazers_over_time

In [None]:
astnn_global_note_over_time = astnn_pylint[['date', 'stats.global_note']]
astnn_stargazers_over_time = stargazers_over_time[['date', 'astnn']]

In [None]:
astnn_global_note_over_time

In [None]:
astnn_stargazers_over_time

### Joining the dataframes on DATE using an As Of Join

This type of join is useful when you want to merge observations as of certain times without having exact matches in time. For instance, if you have a value on 2024-02-13 in one set and the closest date in the other set is 2024-02-12, it will merge these two records.

In [None]:
# Convert 'date' columns to datetime format
astnn_global_note_over_time['date'] = pd.to_datetime(astnn_pylint['date'], utc=True).dt.date

# For 'DATE' column in 'stargazers_over_time', first ensure it's a datetime with the correct timezone,
# then convert to a naive datetime by removing the timezone
astnn_stargazers_over_time['date'] = pd.to_datetime(stargazers_over_time['date'], utc=True).dt.tz_localize(None).dt.date

In [None]:
astnn_global_note_over_time

In [None]:
astnn_stargazers_over_time

#### As Of Join

In [None]:
# Convert to datetime objects
astnn_global_note_over_time['date'] = pd.to_datetime(astnn_global_note_over_time['date'])
astnn_stargazers_over_time['date'] = pd.to_datetime(astnn_stargazers_over_time['date'])

# Sort the DataFrames by 'date' column
astnn_global_note_over_time_sorted = astnn_global_note_over_time.sort_values('date', ascending=True)
astnn_stargazers_over_time_sorted = astnn_stargazers_over_time.sort_values('date', ascending=True)

# Perform the 'as of' merge
global_note_stargazers_asof = pd.merge_asof(
    left=astnn_global_note_over_time_sorted,
    right=astnn_stargazers_over_time_sorted,
    left_on='date',
    right_on='date',
    direction='nearest',
    tolerance=pd.Timedelta('1000 days')  # TODO Adjust the tolerance as per your data's requirements, will not match correctly if too far between
)

# Now global_note_stargazers_asof will contain the merged data


In [None]:
# Set the display option to show all rows
pd.set_option('display.max_rows', None)

global_note_stargazers_asof

# Conclusion of demo experiment

In [None]:
# Now you can perform correlation analysis and descriptive statistics on 'stats.global_note' and 'astnn' in the merged_data dataframe

import seaborn as sns

# Create a scatterplot with a regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='stats.global_note', y='astnn', data=global_note_stargazers_asof, fit_reg=True)

plt.title('Scatterplot with Line of Best Fit')
plt.xlabel('ASTNN Global Note from Pylint')
plt.ylabel('ASTNN Stargazers')
plt.show()

correlation = global_note_stargazers_asof['stats.global_note'].corr(global_note_stargazers_asof['astnn'])

astnn_global_note_over_time
astnn_stargazers_over_time

# Print the correlation
print('Correlation:', correlation)
print()

# Descriptive statistics
print('Descriptive statistics for astnn global note:')
print(global_note_stargazers_asof['stats.global_note'].describe())

print('\nDescriptive statistics for astnn stargazers over time:')
print(global_note_stargazers_asof['astnn'].describe())