# GWAS 1: Window Size Analysis SD = 0.5
Author: Sophie Sigfstead 

Purpose: This is an in-depth analysis of loci size and how it affects the results of our method for GWAS 1 (Depression).
Specifically, my goal was to test various window sizes and see how this affected the accuracy of both our method (i.e., using the brain track set) versus a random set.

When I refer to window size, this is the distance around a significant snp that is eliminated in the GWAS 1 procedure. SNPs are selected by order of p-value, with any snps with base pair locations within the window size (to the left or right) being eliminated (on the basis that there would be LD etc.). This repeats until there are no SNPs remaining that meet the p-value threshold. In Dr. Cai's original study, she used +/- 1Mb as the window size, and we have replicated this procedure. However, this creates very large windows and potentially doesn't differentiate our method as well (to be seen in this notebook).

In the gwas_1_single_track_analysis directory, I performed multiple analyses comparing the results of our method using brain track sets versus random sets, using a 1Mb window size. Suprisingly, while the brain tracks were a bit better, they weren't as improved as we'd expect. On closer inspection, within the matching loci, the distance between our identified SNP and the study's identified SNP was at most 500kB, while the same was not necessarily true for the random tracks. As such, it may be that the 1Mb window is too large to differentiate our method - or, it could be that tissue specificity is not as important as we assumed.

In this notebook, I will provide the results of running our method with various window sizes (250kb, 375kb, 500kb, 750kb, 1Mb), comparing the brain track set ("reference") to the aggregated result of 16 random sets (id = a through p). The random sets contain tracks that are not brain tracks, and were sampled without replacement, so that the sets contain an diverse representation of all the tracks. 

In [1]:
import pandas as pd

window_sizes = [250000, 375000, 500000, 750000, 1000000]

keys = ['a', 'b', 'c', 'd', 'e', 'f','g','h','i','j','k','l','m','n','o','p']

coding_snps_file =  pd.read_csv('../gwas_1_single_track_analysis/coding_region_set.csv')
coding_snps_set = set(coding_snps_file['snp'])

In [2]:
reference_results = pd.DataFrame(columns = ['window', 'total_snps', 'total_coding', 'total_non_coding'])
# Create a reference table
for window in window_sizes: 
    reference_directory = f"snp_lists_results/id=reference/window={window}/filtered_snps_gwas_1_sd=0.0.csv"
    reference_snps = pd.read_csv(reference_directory)
    reference_coding_snps= len(reference_snps[reference_snps['snp'].isin(coding_snps_set)])
    reference_results = reference_results._append({'window': window, 'total_snps': len(reference_snps), 
                        'total_coding':reference_coding_snps, 'total_non_coding': (len(reference_snps)-reference_coding_snps)}, ignore_index=True)
    
reference_results.sort_values(by=['window'], ascending = False)

Unnamed: 0,window,total_snps,total_coding,total_non_coding
4,1000000,29,0,29
3,750000,31,0,31
2,500000,37,1,36
1,375000,37,0,37
0,250000,44,0,44


The above table simply outlines the reference set sizes for each window size. For example, for window size 250000, the reference set contains 44 total snps, 0 of which are coding.
Below, I will create a table for each window size. Note that all of the results are done using a 1.0 SD threshold. 

In [3]:
import pandas as pd

# Function to create a table for each window size
def create_window_size_results(window):
    # Initialize the main DataFrame
    df = pd.DataFrame(columns=["set", "total_snps_found", "total_snps_overlap", "total_loci_overlap", "total_coding_snps", "average_p_value"])

    # Create the reference result
    reference_data = pd.read_csv(f"combined_results/id=reference/window={window}/combined_result_sd=0.5.csv").iloc[0]
    reference_dict = {
        "set": "reference",
        "total_snps_found": reference_data['num_snps_found'],
        "total_snps_overlap": reference_data['num_snps_overlap'],
        "total_loci_overlap": reference_data['num_loci_overlap'],
        "total_coding_snps": reference_data['num_coding_snps'],
        "average_p_value": reference_data['p_value']
    }
    df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)

    # Initialize a DataFrame to collect all random set data
    id_data = pd.DataFrame(columns=["set", "total_snps_found", "total_snps_overlap", "total_loci_overlap", "total_coding_snps", "average_p_value"])
    keys = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p']

    # Collect data for each random set (a - p)
    for key in keys:
        id_data_path = f"combined_results/id={key}/window={window}/combined_result_sd=0.5.csv"
        id_data_df = pd.read_csv(id_data_path).iloc[0]
        id_data_dict = {
            "set": key,
            "total_snps_found": id_data_df['num_snps_found'],
            "total_snps_overlap": id_data_df['num_snps_overlap'],
            "total_loci_overlap": id_data_df['num_loci_overlap'],
            "total_coding_snps": id_data_df['num_coding_snps'],
            "average_p_value": id_data_df['p_value']
        }
        id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)

    # Aggregate the statistics across all random sets
    agg_id_data_mean = {
        "set": "random_aggregated_mean",
        "total_snps_found": id_data['total_snps_found'].mean(),
        "total_snps_overlap": id_data['total_snps_overlap'].mean(),
        "total_loci_overlap": id_data['total_loci_overlap'].mean(),
        "total_coding_snps": id_data['total_coding_snps'].mean(),
        "average_p_value": id_data['average_p_value'].mean()
    }
    agg_id_data_median = {
        "set": "random_aggregated_median",
        "total_snps_found": id_data['total_snps_found'].median(),
        "total_snps_overlap": id_data['total_snps_overlap'].median(),
        "total_loci_overlap": id_data['total_loci_overlap'].median(),
        "total_coding_snps": id_data['total_coding_snps'].median(),
        "average_p_value": id_data['average_p_value'].median()
    }
    agg_id_data_max = {
        "set": "random_aggregated_max",
        "total_snps_found": id_data['total_snps_found'].max(),
        "total_snps_overlap": id_data['total_snps_overlap'].max(),
        "total_loci_overlap": id_data['total_loci_overlap'].max(),
        "total_coding_snps": id_data['total_coding_snps'].max(),
        "average_p_value": id_data['average_p_value'].max()
    }
    agg_id_data_min = {
        "set": "random_aggregated_min",
        "total_snps_found": id_data['total_snps_found'].min().astype(int),
        "total_snps_overlap": id_data['total_snps_overlap'].min(),
        "total_loci_overlap": id_data['total_loci_overlap'].min(),
        "total_coding_snps": id_data['total_coding_snps'].min(),
        "average_p_value": id_data['average_p_value'].min()
    }

    # Add the aggregated results to the main DataFrame
    df = pd.concat([df, pd.DataFrame([agg_id_data_mean])], ignore_index=True)
    df = pd.concat([df, pd.DataFrame([agg_id_data_median])], ignore_index=True)
    df = pd.concat([df, pd.DataFrame([agg_id_data_max])], ignore_index=True)
    df = pd.concat([df, pd.DataFrame([agg_id_data_min])], ignore_index=True)

    return df
   


### Result for window size = 1Mb

In [4]:
df = create_window_size_results(1000000)
df

  df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)
  id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)


Unnamed: 0,set,total_snps_found,total_snps_overlap,total_loci_overlap,total_coding_snps,average_p_value
0,reference,32.0,14.0,28.0,3.0,9.215529e-08
1,random_aggregated_mean,31.4375,17.375,28.0,2.4375,8.108919e-08
2,random_aggregated_median,31.0,17.5,28.0,2.5,8.156106e-08
3,random_aggregated_max,35.0,21.0,28.0,3.0,9.725933e-08
4,random_aggregated_min,30.0,14.0,28.0,1.0,6.988058e-08


The total possible snps / overlap loci was 29. They all recover 28 tracks. This is not a suprise as this has been seen in already in the gwas_1_single_track_analysis notebooks.  In addition, we see that the number of matching snps in the random sets, on average is better than the reference set. We also see a lower number of snps found.

### Result for window size = 750000

In [5]:
df = create_window_size_results(750000)
df

  df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)
  id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)


Unnamed: 0,set,total_snps_found,total_snps_overlap,total_loci_overlap,total_coding_snps,average_p_value
0,reference,34.0,14.0,30.0,3.0,9.215529e-08
1,random_aggregated_mean,33.4375,18.3125,30.0,1.8125,8.108919e-08
2,random_aggregated_median,33.0,18.0,30.0,2.0,8.156106e-08
3,random_aggregated_max,37.0,23.0,30.0,3.0,9.725933e-08
4,random_aggregated_min,32.0,14.0,30.0,1.0,6.988058e-08


Again, we see that there is not much difference between the brain track set and the random set. The reference value here is 31 and we have the brain tracks recovering 30 snps and the random sets recovering 30. The number of matching snps is higher on average in the random set, which is again suprising. The 3 coding snps found in the 1Mb analysis are also seen consistently here. 

### Result for window size = 500000

In [6]:
df = create_window_size_results(500000)
df

  df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)
  id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)


Unnamed: 0,set,total_snps_found,total_snps_overlap,total_loci_overlap,total_coding_snps,average_p_value
0,reference,40.0,20.0,36.0,3.0,9.215529e-08
1,random_aggregated_mean,39.4375,24.3125,36.0,2.625,8.108919e-08
2,random_aggregated_median,39.0,24.5,36.0,3.0,8.156106e-08
3,random_aggregated_max,43.0,29.0,36.0,3.0,9.725933e-08
4,random_aggregated_min,38.0,20.0,36.0,2.0,6.988058e-08


Similar results to above, the reference value is n = 37 with 1 coding snp. 
### Result for window size = 375000

In [7]:
df = create_window_size_results(375000)
df

  df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)
  id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)


Unnamed: 0,set,total_snps_found,total_snps_overlap,total_loci_overlap,total_coding_snps,average_p_value
0,reference,40.0,20.0,36.0,2.0,9.215529e-08
1,random_aggregated_mean,39.4375,23.5625,36.0,1.9375,8.108919e-08
2,random_aggregated_median,39.0,24.0,36.0,2.0,8.156106e-08
3,random_aggregated_max,43.0,29.0,36.0,3.0,9.725933e-08
4,random_aggregated_min,38.0,19.0,36.0,1.0,6.988058e-08


Similar results to above, the reference value is n = 37 with 0 coding snps. 
### Result for window size = 250000


In [8]:
df = create_window_size_results(250000)
df

  df = pd.concat([df, pd.DataFrame([reference_dict])], ignore_index=True)
  id_data = pd.concat([id_data, pd.DataFrame([id_data_dict])], ignore_index=True)


Unnamed: 0,set,total_snps_found,total_snps_overlap,total_loci_overlap,total_coding_snps,average_p_value
0,reference,47.0,21.0,43.0,2.0,9.215529e-08
1,random_aggregated_mean,46.625,25.0,43.0,1.9375,8.108919e-08
2,random_aggregated_median,46.5,25.5,43.0,2.0,8.156106e-08
3,random_aggregated_max,50.0,32.0,43.0,3.0,9.725933e-08
4,random_aggregated_min,45.0,19.0,43.0,1.0,6.988058e-08


Similar results to above, the reference value is n = 44. 

Initial impressions: it may be that SAD scores >= 0.5 SD are just really good at capturing high activity / important genomic regions. It may be very hard to differentiate the methods at this level, because no matter what the track set, we are re-capturing essentially every SNP. 


This analysis has been repeated in 1.0 SD and 2.0 SD, achieving similar results. 