# Demonstration of PyCIRAS

## Usage

### Prepare base imports
Import the setup script to build the environment and pyciras

In [1]:
import setup_notebook_environment
import pyciras

### Mining - Code Quality, Git metrics, Unit-Testing, Stargazers data

The user can specify a list of git repos they want to analyze directly in the notebook, or in the default **/repos.txt** file.

For long-running analysis, we support **ntfy** push notifications, so you can leave it running and get notified when completed. This requires modifications in `.env` and `utility/config.py`.

In [3]:
repos = ['https://github.com/SamuelThand/TDD-Hangman',
         'https://github.com/coinse/sadl',
         'https://github.com/zhangj111/astnn']

#### Clone Repositories
You can clone the repositories in advance to ensure that all repositories can be accessed before executing a long mining process. However, this is not needed as `pyciras.run_mining()` will clone the currently mined repository if not found.

The function `run_repo_cloner()` clones all repositories to the folder specified in `utility/config.py`. The parameters are listed below:
- **repo_urls**: List of repository URLs to clone. If None, the list is loaded from `repos.txt`.
- **chunk_size**: Number of repositories to clone in each operation chunk. Defaults to 1.
- **multiprocessing**: Flag to enable or disable multiprocessing during cloning. Defaults to False.

In [None]:
pyciras.run_repo_cloner(repo_urls=repos,  # replace with 'None' to use repos.txt
                        chunk_size=100,
                        multiprocessing=True)

#### Mine the Repositories
Start the mining process to gather data. You can change content of `utility/config.py` to save the data in different formats. If you mine large amount of data, do not use JSON as it will create enormous files that may fill up you RAM.

The function `run_mining()` will mine the aspects specified in the parameters listed below:
- **repo_urls**: A list of repository URLs to be mined. If None, URLs will be loaded from `repos.txt`.
- **chunk_size**: The number of repositories to process in each chunk. Defaults to 1.
- **multiprocessing**: Enables or disables multiprocessing for the mining operations. Defaults to False.
- **persist_repos**: If True, cloned repositories will be persisted in a local directory. Defaults to True.
- **stargazers**: If True, information about stargazers will be collected for each repository. Defaults to True.
- **metadata**: Enable or disable the metadata analysis. Defaults to True.
- **lint**: Enables or disables linting analysis. Defaults to True.
- **test**: Enables or disables testing analysis. Defaults to True.
- **git**: Enables or disables Git history analysis. Defaults to True.

In [None]:
pyciras.run_mining(repo_urls=repos,  # replace with 'None' to use repos.txt
                   chunk_size=1,
                   multiprocessing=False,
                   persist_repos=False,
                   stargazers=True,
                   metadata=True,
                   test=True,
                   git=True,
                   lint=True)

## Statistical Analysis
Lets say we are interested in visualizing DMM and doing a correlation analysis with the amount of stars.

### Load and display the git CSV
Load the csv to a dataframe and visualize the content.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from IPython.display import display

In [8]:
git_df = pd.read_csv('../out/data/<CHANGE ME>/git.csv')
display(git_df.head())
display(git_df.describe())

In [None]:
metadata_df = pd.read_csv('../out/data/<CHANGE ME>/metadata.csv')
display(metadata_df.head())
display(git_df.describe())

### Display the data in a boxplot

In [10]:
# Step 1: Extract data
average_dmm_method_lines_of_code = git_df['average_dmm_method_lines_of_code']
average_dmm_method_cyclomatic_complexity = git_df['average_dmm_method_cyclomatic_complexity']
average_dmm_method_number_of_parameters = git_df['average_dmm_method_number_of_parameters']

# Step 2: Combine into a single DataFrame
aggregated_data = pd.DataFrame({
    'Unit Size': average_dmm_method_lines_of_code,
    'Unit Complexity': average_dmm_method_cyclomatic_complexity,
    'Unit Interfacing': average_dmm_method_number_of_parameters
})

# Step 3: Create a boxplot using seaborn
plt.figure(figsize=(5, 5))
sns.boxplot(data=aggregated_data)

plt.title('DMM Scores In Artifacts')
plt.ylabel('Score')

plt.tight_layout()
plt.show()

Unnamed: 0,date,TDD-Hangman,astnn,sadl
0,2019-01-03,0,1,1
1,2019-01-16,0,1,2
2,2019-04-12,0,3,3
3,2019-05-07,0,3,4
4,2019-05-31,0,7,5
...,...,...,...,...
128,2021-05-07,0,95,27
129,2021-05-14,0,96,27
130,2021-05-16,0,97,27
131,2021-05-25,0,99,28


### Perform correlation test

In [12]:
var1 = 'average_dmm_method_cyclomatic_complexity'
var2 = 'stargazerCount'

# Load and merge datasets
dataframe = pd.merge(git_df, metadata_df, on='repo')[[var1, var2]]
display(dataframe.head())
display(dataframe.describe())

# Shapiro-Wilk Test - Normal distribution of data
shapiro_test_var1 = stats.shapiro(dataframe[var1].dropna())
shapiro_test_var2 = stats.shapiro(dataframe[var2].dropna())
display(f"Shapiro-Wilk Test for {var1}: Statistic={shapiro_test_var1.statistic}, P-value={shapiro_test_var1.pvalue}")
display(f"Shapiro-Wilk Test for {var2}: Statistic={shapiro_test_var2.statistic}, P-value={shapiro_test_var2.pvalue}")

# Decide on the correlation method based on normality test
if shapiro_test_var1.pvalue > 0.05 and shapiro_test_var2.pvalue > 0.05:
    # If data is normally distributed, use Pearson's correlation
    pearson_corr = dataframe[[var1, var2]].corr(method='pearson')
    correlation = pearson_corr.iloc[0, 1]
    p_value = stats.pearsonr(dataframe[var1].dropna(), dataframe[var2].dropna())[1]
    method_used = 'Pearson'
    print("Pearson's correlation matrix:\n", pearson_corr)
    print("Pearson's p-value:", pearson_corr)
else:
    # If data is not normally distributed, use Spearman's correlation
    spearman_corr, p_value = stats.spearmanr(dataframe[var1].dropna(), dataframe[var2].dropna())
    correlation = spearman_corr
    method_used = 'Spearman'
    print("Spearman's correlation coefficient:", spearman_corr)
    print("Spearman's p-value:", p_value)

# Interpretation and reporting of results
alpha = 0.05
correlation_strength = 'no'
if abs(correlation) > 0.7:
    correlation_strength = 'strong'
elif abs(correlation) > 0.3:
    correlation_strength = 'moderate'
elif abs(correlation) > 0.1:
    correlation_strength = 'weak'

correlation_type = 'positive' if correlation > 0 else 'negative'
significance = 'statistically significant' if p_value < alpha else 'not statistically significant'

report = (f"There is a {correlation_strength} {correlation_type} correlation between {var1} and {var2} "
          f"using {method_used} method. The correlation coefficient is {correlation:.3f}. "
          f"The p-value = {p_value:.3e} which is {'smaller' if p_value < alpha else 'greater'} than the "
          f"significance level (alpha) of {alpha}. The relationship is {significance}.")

display(report)

Unnamed: 0,date,stats.global_note
0,2018-08-23 18:45:14+08:00,5.74856
1,2018-08-23 19:42:01+08:00,5.74856
2,2018-08-23 19:43:25+08:00,5.74856
3,2018-08-23 19:54:30+08:00,5.74856
4,2018-08-23 20:04:16+08:00,5.74856
5,2019-01-03 20:26:24+08:00,5.74856
6,2019-01-03 20:27:01+08:00,5.74856
7,2019-01-28 15:17:30+08:00,5.74856
8,2019-01-28 15:24:33+08:00,5.74856
9,2019-01-28 15:26:05+08:00,5.74856
