### SAD Score Distribution

Author: Sophie Sigfstead

Purpose of the notebook: Currently I am working to design a better thresholding mechanism for our SNP activity difference scores. One method proposed is to use FDR to include / disclude SNPs based on whether or not we reject the null hypothesis of SAD_<track_i>_<snp_j> = 0. 

In order to do these computations (e.g z-test etc.) I'd like to first see how the SAD scores are distributed on average across tracks. As such, I'm doing some simple visualizations here to see what this looks like. 

In [1]:
import pandas as pd
import os
from tqdm import tqdm

# Define the directory containing the CSV files
csv_directory = "../GWAS_Data/1000genomes_as_csv"

# Create an empty DataFrame to hold the combined SAD values
combined_df = pd.DataFrame()

# Define the maximum number of rows
max_rows = 5_000_000

# Loop through all CSV files in the directory
for file_name in (os.listdir(csv_directory)):
    if file_name.endswith(".csv"):  # Ensure the file is a CSV
        file_path = os.path.join(csv_directory, file_name)
        
        # Read the CSV file in chunks to handle large file sizes
        chunk_size = 10**6  # Adjust chunk size as needed
        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
            # Extract all columns starting with "SAD"
            sad_columns = [col for col in chunk.columns if col.startswith("SAD")]
            sad_chunk = chunk[sad_columns]
            
            # Append the SAD columns to the combined DataFrame
            combined_df = pd.concat([combined_df, sad_chunk], ignore_index=True)
            # Break the outer loop if the maximum row limit is reached
            if len(combined_df) >= max_rows:
                break
            
    # Break the outer loop if the maximum row limit is reached
    if len(combined_df) >= max_rows:
        break    # Break the outer loop if the maximum row limit is reached
       

# Save the combined DataFrame to a CSV file
combined_df.to_csv("combined_sad_values.csv", index=False)

# Display the first few rows of the combined DataFrame
print(combined_df.head())



       SAD0      SAD1      SAD2      SAD3      SAD4      SAD5      SAD6  \
0  0.000029  0.000004 -0.000087 -0.000120 -0.000009  0.000006  0.000010   
1  0.000142 -0.000034 -0.000021  0.000067  0.000462  0.000327  0.000635   
2 -0.000003  0.000141  0.000202  0.000088  0.000104  0.000171  0.000078   
3  0.000049  0.000041  0.000054  0.000051  0.000037  0.000045  0.000041   
4  0.000023  0.000017  0.000014  0.000039  0.000024  0.000031  0.000016   

       SAD7      SAD8      SAD9  ...    SAD674    SAD675    SAD676    SAD677  \
0 -0.000017 -0.000023 -0.000057  ... -0.002296  0.002531 -0.000041  0.002285   
1  0.000135  0.000390 -0.000043  ...  0.002403  0.002401  0.000389  0.001868   
2  0.000091  0.000102  0.000137  ...  0.000347 -0.000556  0.001503 -0.000402   
3  0.000027  0.000044  0.000039  ...  0.000195  0.000122  0.000713 -0.000057   
4  0.000019  0.000024  0.000011  ...  0.000821  0.001335  0.000609  0.001102   

     SAD678    SAD679    SAD680    SAD681    SAD682    SAD683  
0  0

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming `df` contains the data with columns SAD0, SAD1, ..., SAD683
columns_to_plot = [col for col in combined_df.columns if col.startswith("SAD")]

# Set up the plot
plt.figure(figsize=(12, 8))

# Plot each column as a KDE (Kernel Density Estimate) plot
for column in columns_to_plot:
    sns.kdeplot(combined_df[column], label=column, linewidth=1)

# Add labels and legend
plt.title("Bell Curves for Each Dataset (SAD Columns)", fontsize=16)
plt.xlabel("Value", fontsize=14)
plt.ylabel("Density", fontsize=14)
plt.legend(loc='upper right', fontsize=10, ncol=2)  # Adjust legend placement and size
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()