<a href="https://colab.research.google.com/github/KetiLaz/Core_CSR_Genes/blob/main/Bootstrap_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bootstrap

This script will randomly choose 51 genes from a CSV file - from this choice genes extracted from Darling are excluded.  These 51 genes will be considered the control genes.
Then it will find the p-values and the percentage of genes with pvalues < 0.05 for each study.
Finally, it will make a paired t-test analysis of the overall percentages for control and core genes to see if there is a statistically signficant difference between them.
All of the above will iterate 100 times.

In [None]:
#Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Import libraries
import pandas as pd
import os
import random
import numpy as np

In [None]:
# Set a seed for reproducibility
np.random.seed(42)
random.seed(42)

In [None]:
# Function to process each of the edgeR CSV files, in order to find how many of the chosen control genes are present in each study, the count and percentage of the statistically significant genes
def process_conrtol_genes(csv_file, control_genes):
    # Read CSV file
    df = pd.read_csv(csv_file, sep=";")
    # Check how many of the control genes are present in the file
    present_genes = df[df['Ensembl ID'].isin(control_genes)]
    num_genes_present = len(present_genes)
    # Check how many genes have PValue < 0.05
    num_genes_pvalue_lt_05 = len(present_genes[present_genes['PValue'] < 0.05])
    # Calculate the percentage
    percentage = (num_genes_pvalue_lt_05 / len(control_genes)) * 100
    # Return results
    return num_genes_present, num_genes_pvalue_lt_05, percentage

In [None]:
# Directory containing CSV files to iterate through
data_dir = "/content/drive/MyDrive/Διπλωματική/Final_files/edgeR_csv"

#The gene to be excluded from the random selection
exclusion_genes_file = "/content/drive/MyDrive/Διπλωματική/Mesh Search/all_response_genes.csv"

# Get a list of all the edgeR CSV files in the directory
csv_files = [file for file in os.listdir(data_dir) if file.endswith(".csv")]

#The directory where the results will be stored as csv files
output_dir = "/content/drive/MyDrive/Διπλωματική/Final_files/Bootstrap"

In [None]:
# Read the csv file that contain the genes that will be excluded from the control genes selection
exclusion_genes_df = pd.read_csv(exclusion_genes_file, sep = ";")

# Get a list of genes to exclude
exclude_genes = exclusion_genes_df['Ensembl ID'].tolist()


In [None]:
# Perform 100 iterations of the bootstrap analysis
for i in range(100):
    # Randomly select 51 genes from a random CSV file, excluding the genes extracted from Darling
    random_edgeR_file = random.choice(csv_files)
    random_edgeR_file_path = os.path.join(data_dir, random_edgeR_file)
    df = pd.read_csv(random_edgeR_file_path, sep=";")
    available_genes = df[~df['Ensembl ID'].isin(exclude_genes)]['Ensembl ID'].tolist()
    control_genes = random.sample(available_genes, min(51, len(available_genes)))

    # Initialize a list to store results for the current iteration
    results = []

    # Iterate through other CSV files
    for csv_file in csv_files:
        study_name = os.path.splitext(csv_file)[0]
        csv_file_path = os.path.join(data_dir, csv_file)
        # Process CSV file
        num_genes_present, num_genes_pvalue_lt_05, percentage = process_conrtol_genes(csv_file_path, control_genes)
        # Append results
        results.append({"Study Name": study_name,
                        "Genes Present": num_genes_present,
                        "PValue < 0.05": num_genes_pvalue_lt_05,
                        "Percentage": percentage})

    # Create DataFrame from results
    control_genes_df = pd.DataFrame(results)

    # Generate output file path for the current iteration
    output_file_path = os.path.join(output_dir, f"iteration_{i+1}.csv")

    # Write results to a CSV file
    control_genes_df.to_csv(output_file_path, index=False)

# Paired t-test

After the previous script we will now use paired t-test for each iteration of control genes and the core stress genes.
Basically we will see if there is a statistically significant difference between the percentages of PValue < 0.05 for control and core genes

In [None]:
from scipy.stats import ttest_rel

In [None]:
# Directory containing the control gene files (output files from previous iterations)
control_files_dir = "/content/drive/MyDrive/Διπλωματική/Final_files/Bootstrap"

# File path of the core stress genes file
core_genes_file_path = "/content/drive/MyDrive/Διπλωματική/Final_files/core_summary.csv"

In [None]:
# Read the core stress genes file into a DataFrame
core_genes_df = pd.read_csv(core_genes_file_path, sep=";")

# Initialize a dictionary to store the results of t-tests
t_test_results = {}

# Initialize a counter for statistically significant differences
significant_diff_count = 0

Index(['Study Name', 'Genes Present', 'PValue < 0.05', 'Percentage'], dtype='object')

In [None]:
# Iterate through each control gene file
for control_file in os.listdir(control_files_dir):
    if control_file.endswith(".csv"):
        # Read the control gene file into a DataFrame
        control_file_path = os.path.join(control_files_dir, control_file)
        control_df = pd.read_csv(control_file_path)

        # Extract the study name to use as a key for the t-test results dictionary
        study_name = os.path.splitext(control_file)[0]

        # Perform a paired t-test with the core stress genes file based on the "Percentage" column
        t_statistic, p_value = ttest_rel(control_df['Percentage'], core_genes_df['Percentage'])

        # Store the t-test results in the dictionary
        t_test_results[study_name] = {'t_statistic': t_statistic, 'p_value': p_value}

        # Check if the p-value is less than 0.05
        if p_value < 0.05:
            significant_diff_count += 1

In [None]:
# Convert the dictionary to a DataFrame
t_test_results_df = pd.DataFrame.from_dict(t_test_results, orient='index')

# Save the t-test results to a CSV file
t_test_results_df.to_csv("/content/drive/MyDrive/Διπλωματική/Final_files/t_test_results.csv")

# Print the number of control files with statistically significant differences
print(f"{significant_diff_count} out of {len(os.listdir(control_files_dir))} had statistically significant differences.")

100 out of 100 had statistically significant differences.


# Parameters and Statistics for the Bootstrap Analysis
We will calculate the statistics of the bootstrap analysis (the mean, median and standard deviation)

In [None]:
# File path of the CSV file containing T-statistics and P-values
csv_file_path = "/content/drive/MyDrive/Διπλωματική/Final_files/t_test_results.csv"

In [None]:
# Read the CSV file into a DataFrame
bootstrap_results = pd.read_csv(csv_file_path)

In [None]:
# Extract T-statistics and P-values from the DataFrame
t_statistics = bootstrap_results['t_statistic']
p_values = bootstrap_results['p_value']

In [None]:
# Calculate mean, median, and standard deviation of T-statistic
mean_t_statistic = np.mean(t_statistics)
median_t_statistic = np.median(t_statistics)
std_t_statistic = np.std(t_statistics)

In [None]:
# Calculate mean, median, and standard deviation of P-value
mean_p_value = np.mean(p_values)
median_p_value = np.median(p_values)
std_p_value = np.std(p_values)

In [None]:
# Print the calculated statistics
print("Mean T-Statistic:", mean_t_statistic)
print("Median T-Statistic:", median_t_statistic)
print("Standard Deviation of T-Statistic:", std_t_statistic)
print("Mean P-Value:", mean_p_value)
print("Median P-Value:", median_p_value)
print("Standard Deviation of P-Value:", std_p_value)

Mean T-Statistic: -8.71501494373564
Median T-Statistic: -8.574179636226972
Standard Deviation of T-Statistic: 1.3823250761732524
Mean P-Value: 1.8566834004916785e-08
Median P-Value: 3.621009452065445e-13
Standard Deviation of P-Value: 1.5284812735177514e-07
