# PGLYRP2 - Single Gene Analysis in GP2 genotyping imputed data

#### GP2 Data Release 9


Project: Sex Differences in PGLYRP2 Variant rs892145 in Parkinson's Disease

Version: Python/3.10.12

Last Updated: 12-JUNE-2025

Update Description: Updated gene coordinates and issues with removal of related individuals

Gene coordinates PGLYRP2 from NCBI gene: hg38 (chr19:15468645-15479501)

## Notebook overview
In this notebook we performed regression analyses with PGLYRP2 gene variants and PD in the GP2 NBA data using PLINK

## Description

* Loading Python libraries
* Set paths to the GP2 data
* Install packages
* Create a covariate file with GP2 data
* Annotation of the gene
* Turn binary files into VCF
* Annotate the gene using ANNOVAR
* Burden analyses using RVTESTs
* Case/control analyses
* Sex stratified case/control analyses
* Interaction analyses
* Copy files

## Loading Python libraries

In [None]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

import subprocess

# Numpy for basics
import numpy as np

# Use pathlib for file path manipulation
import pathlib

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

#Import Sys
import sys as sys

## Set paths

In [None]:
## Set path to GP2_RELEASE_PATH

In [None]:
## Set path to:

# EXTENDED_CLINICAL_DATA_PATH
# CLINICAL_DATA_PATH 
# EXTENDED_CLINICAL_DATA_PATH
# RELATED_DATA_PATH
# RAW_GENO_PATH
# IMPUTED_GENO_PATH
# PCS_PATH

## Install Packages

In [None]:
%%capture
%%bash

# Install plink 1.9
cd /home/jupyter/
if test -e /home/jupyter/plink; then
    echo "Plink is already installed in /home/jupyter/"
else
    echo "Plink is not installed"
    cd /home/jupyter

    wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 

    unzip -o plink_linux_x86_64_20190304.zip
    mv plink plink1.9
fi

In [None]:
%%bash

# chmod plink 1.9 to make sure you have permission to run the program
chmod u+x /home/jupyter/plink1.9

In [None]:
%%capture
%%bash

# Install plink 2.0
cd /home/jupyter/
if test -e /home/jupyter/plink2; then

echo "Plink2 is already installed in /home/jupyter/"
else
echo "Plink2 is not installed"
cd /home/jupyter/

wget http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_latest.zip

unzip -o plink2_linux_x86_64_latest.zip

fi

In [None]:
%%bash

# chmod plink 2 to make sure you have permission to run the program
chmod u+x /home/jupyter/plink2

In [None]:
%%capture
%%bash

# Install ANNOVAR: We are adding the download link after registration on the annovar website
# https://www.openbioinformatics.org/annovar/annovar_download_form.php

if test -e /home/jupyter/annovar; then

echo "annovar is already installed in /home/jupyter/notebooks"
else
echo "annovar is not installed"
cd /home/jupyter/

wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvfz annovar.latest.tar.gz

fi

In [None]:
%%capture
%%bash

# Install BCFtools

if test -e /home/jupyter/bcftools; then
    echo "BCFtools is already installed in /home/jupyter/bcftools"
else
    echo "BCFtools is not installed"
    cd /home/jupyter/

    # Download the latest version of BCFtools
    wget https://github.com/samtools/bcftools/releases/download/1.21/bcftools-1.21.tar.bz2

    # Extract the downloaded file
    tar -xvjf bcftools-1.21.tar.bz2

    # Move into the extracted directory
    cd bcftools-1.21

    # Compile and install BCFtools
    make
    make install

    # Move the installed BCFtools to a specific directory
    mv bcftools /home/jupyter/bcftools
fi

In [None]:
%%capture
%%bash

# Install ANNOVAR: Download resources for annotation

cd /home/jupyter/annovar/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20170905 humandb/
#perl annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp147 humandb/ 
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp47a humandb/ #latest version of dbNSFP
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad211_genome humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ljb26_all humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/


In [None]:
%%bash
cd /home/jupyter/annovar/humandb
ls

In [None]:
%%bash

# Install RVTESTS: Option 1 (~15min)
if [ ! -e /home/jupyter/tooles/rvtests ]; then
    echo "RVTESTS not found. Installing..."
    mkdir -p /home/jupyter/tools/rvtests
    cd /home/jupyter/tools/rvtests

    wget https://github.com/zhanxw/rvtests/releases/download/v2.1.0/rvtests_linux64.tar.gz 

    tar -zxvf rvtests_linux64.tar.gz
else
    echo "RVTESTS is already installed."
fi

In [None]:
%%bash
cd /home/jupyter/tools/rvtests/executable
ls

In [None]:
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [None]:
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [None]:
%%bash

/home/jupyter/tools/rvtests/executable/rvtest --help

## Create a covariate file with GP2 data

In [None]:
# Let's load the master key
key = pd.read_csv(CLINICAL_DATA_PATH, low_memory=False)
print(f'Clinical data (num rows, num columns): {key.shape}')
pd.set_option('display.max_columns', None)
key.head()

In [None]:
# Subsetting to keep only a few columns 
key = key[['GP2ID', 'baseline_GP2_phenotype_for_qc', 'biological_sex_for_qc', 'age_at_sample_collection', 'age_of_onset', 'nba_label','wgs_label']]
# Renaming the columns
key.rename(columns = {'GP2ID':'IID',
                                     'baseline_GP2_phenotype_for_qc':'phenotype',
                                     'biological_sex_for_qc':'SEX', 
                                     'age_at_sample_collection':'AGE', 
                                     'age_of_onset':'AAO'}, inplace = True)
key

In [None]:
key["label"] = key["nba_label"].combine_first(key["wgs_label"])
key = key.drop(columns=["nba_label", "wgs_label"])
key

In [None]:
mkdir /home/jupyter/PGLYRP2_NBA_R9/

In [None]:
ancestries = {'AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH'}

for ancestry in ancestries:
    !mkdir /home/jupyter/PGLYRP2_NBA_R9/{ancestry}

In [None]:
related_df = pd.read_csv(f'{RELATED_DATA_PATH}/{ancestry}_release9_vwb.related')
related_df.head()

In [None]:
ancestries = {'AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH'}

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    ## Subset to keep ancestry of interest 
    ancestry_key = key[key['label']==ancestry].copy()
    ancestry_key.reset_index(drop=True)

    # Convert phenotype to binary (1/2)
    ## Assign conditions so case=2 and controls=1, and -9 otherwise (matching PLINK convention)
    # PD = 2; control = 1
    pheno_mapping = {"PD": 2, "Control": 1}
    ancestry_key['PHENO'] = ancestry_key['phenotype'].map(pheno_mapping).astype('Int64')
    
    # Check value counts of pheno
    ancestry_key['PHENO'].value_counts(dropna=False)
    
    ## Get the PCs
    pcs = pd.read_csv(f'{RAW_GENO_PATH}/{ancestry}/{ancestry}_release9_vwb.eigenvec', sep='\t')
    
    #Select just first 5 PCs
    selected_columns = ['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
    pcs = pd.DataFrame(data=pcs.iloc[:, 1:7].values, columns=selected_columns)
    
    # Drop the first row (since it's now the column names)
    pcs = pcs.drop(0)
    
    # Reset the index to remove any potential issues
    pcs = pcs.reset_index(drop=True)
    
    # Check size
    print(f'PCs: {pcs.shape}')
    
    # Check value counts of SEX
    sex_og_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - original:\n {sex_og_values.to_string()}')
    
    # Convert sex to binary (1/2)
    ## Assign conditions so female=2 and men=1, and -9 otherwise (matching PLINK convention)
    # Female = 2; Male = 1
    sex_mapping = {"Female": 2, "Male": 1}
    ancestry_key['SEX'] = ancestry_key['SEX'].map(sex_mapping).astype('Int64')
    
    # Check value counts of SEX after recoding
    sex_recode_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - recoded:\n{sex_recode_values.to_string()}')
    
    ## Make covariate file
    df = pd.merge(ancestry_key,pcs, on='IID')
    print(f'Check columns for covariate file: {df.columns}')
    
    # Load information about related individuals in the ancestry analyzed
    related_df = pd.read_csv(f'{RELATED_DATA_PATH}/{ancestry}_release9_vwb.related')
    related_df['IID1'] = related_df['IID1'].str.replace('_s1', '', regex=False)
    print(f'Related individuals: {related_df.shape}')
    
    # Make a list of just one set of related people
    related_list = list(related_df['IID1'])
    print(f'Number of related IIDs in dataset before filtering: {df["IID"].isin(related_list).sum()}')
    
    # Check value counts of related and remove only one related individual
    df = df[~df["IID"].isin(related_list)]
    
    # Check size
    print(f'Unrelated individuals: {df.shape}')
    
    #Make additional columns - FID, fatid and matid - these are needed for RVtests!!
    #RVtests needs the first 5 columns to be fid, iid, fatid, matid and sex otherwise it does not run correctly
    #Uppercase column name is ok
    #See https://zhanxw.github.io/rvtests/#phenotype-file
    df['FID'] = 0
    df['FATID'] = 0
    df['MATID'] = 0
    
    ## Clean up and keep columns we need
    final_df = df[['FID','IID', 'FATID', 'MATID', 'SEX', 'AGE','AAO', 'PHENO','PC1', 'PC2', 'PC3', 'PC4', 'PC5']].copy()
    
    ##DO NOT replace missing values with -9 as this is misinterpreted by RVtests - needs to be nonnumeric
    #Leave missing values as NA
    
    #Check number of PD cases missing age
    pd_missAge = final_df[(final_df['PHENO']==2)&(final_df['AGE'].isna())]
    print(f'Number of PD cases missing age: {pd_missAge.shape[0]}')
    
    #Check number of controls missing age
    control_missAge = final_df[(final_df['PHENO']==1)&(final_df['AGE'].isna())]
    print(f'Number of controls missing age: {control_missAge.shape[0]}')
   
    ## Make file of sample IDs to keep
    samples_toKeep = final_df[['FID', 'IID']].copy()
    samples_toKeep.columns = ['#FID','IID']
    samplestokeep_path = pathlib.Path(pathlib.Path.home(), f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep')
    
    # Create the output CSV file's parent folder in the cloud storage bucket, if it doesn't already exist.
    if not samplestokeep_path.parent.exists():
        !mkdir -p {samplestokeep_path.parent}
        print(f'Created {samplestokeep_path.parent}')
    samples_toKeep.to_csv(samplestokeep_path, sep = '\t', index=False)
    finaldf_path = pathlib.Path(pathlib.Path.home(), f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt')
    
    # Create the output CSV file's parent folder in the cloud storage bucket, if it doesn't already exist.
    if not finaldf_path.parent.exists():
        !mkdir -p {finaldf_path.parent}
        print(f'Created {finaldf_path.parent}')
   
    final_df.to_csv(finaldf_path, sep = '\t', na_rep='NA', index=False)

In [None]:
finaldf = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_covariate_file.txt', sep='\t')
finaldf.head()

In [None]:
samplestokeep = pd.read_csv('/home/jupyter/PGLYRP2_NBA_R9/EUR/EUR.samplestokeep', sep='\t')
samplestokeep.head()

In [None]:
#Check mean age and SD for cases and controls, also divided by the sexes

ancestries = {'AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH'}

for ancestry in ancestries:
    print(f'\n\nWORKING ON: {ancestry}')
    final_df = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt', sep='\t')

    # Check value counts of PHENO (e.g., 1 = control, 2 = case)
    pheno_counts = final_df['PHENO'].value_counts(dropna=False)
    print(f'PHENO counts:\n{pheno_counts}')

    # Mean and SD of AGE per PHENO
    age_stats = final_df.groupby('PHENO')['AGE'].agg(['mean', 'std']).rename(columns={'mean': 'Mean Age', 'std': 'SD Age'})
    print(f'Mean age and SD:\n{age_stats}')

    # Counts of SEX per PHENO
    sex_counts = pd.crosstab(final_df['PHENO'], final_df['SEX'])
    sex_counts.columns = ['Male (1)', 'Female (2)']  # Assuming SEX: 1=Male, 2=Female
    print(f'Sex counts:\n{sex_counts}')

    # Optional: Add sex percentage if you want to see proportions too
    sex_pct = pd.crosstab(final_df['PHENO'], final_df['SEX'], normalize='index') * 100
    sex_pct.columns = ['% Male', '% Female']
    print(f'Sex percentages:\n{sex_pct}')
    

In [None]:
# Males - Check N cases and controls for individuals with age data:

ancestries = {'AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH'}

for ancestry in ancestries:
    print(f'\n\nWORKING ON - MALES: {ancestry}')
    final_df = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt', sep='\t')

    #Males
    df_filtered_males_age = final_df[final_df['AGE'].notna() & (final_df['AGE'] != '') & (final_df['SEX'] == 1)]

    # Check value counts of PHENO (e.g., 1 = control, 2 = case)
    pheno_counts = df_filtered_males_age['PHENO'].value_counts(dropna=False)
    print(f'PHENO counts:\n{pheno_counts}')


In [None]:
# Females - Check N cases and controls for individuals with age data:

ancestries = {'AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH'}

for ancestry in ancestries:
    print(f'\n\nWORKING ON - FEMALES: {ancestry}')
    final_df = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt', sep='\t')

    #Males
    df_filtered_males_age = final_df[final_df['AGE'].notna() & (final_df['AGE'] != '') & (final_df['SEX'] == 2)]

    # Check value counts of PHENO (e.g., 1 = control, 2 = case)
    pheno_counts = df_filtered_males_age['PHENO'].value_counts(dropna=False)
    print(f'PHENO counts:\n{pheno_counts}')


## Annotation of the gene

### Extract the region using PLINK

- Extract *PGLYRP2* gene 
- *PGLYRP2* coordinates: chr19:15468645-15479501

In [None]:
## extract region using plink
for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --pfile {IMPUTED_GENO_PATH}/{ancestry}/chr19_{ancestry}_release9_vwb \
    --chr 19 \
    --from-bp 15468645 \
    --to-bp 15479501 \
    --mac 2 \
    --hwe 0.0001 \
    --make-bed \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2

In [None]:
# Visualize bim file
! head /home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_PGLYRP2.bim

In [None]:
# Visualize fam file
! head /home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_PGLYRP2.fam

## Turn binary files into VCF

In [None]:
for ancestry in ancestries:
        
    ## Turn binary files into VCF
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --recode vcf id-paste=iid \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2

In [None]:
!sudo apt-get update -y
!sudo apt-get install -y tabix

In [None]:
### Bgzip and Tabix (zip and index the file)
for ancestry in ancestries:    
    ! bgzip -f /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.vcf
    ! tabix -f -p vcf /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.vcf.gz 
    

## Annotate using ANNOVAR

In [None]:
## annotate using ANNOVAR
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
for ancestry in ancestries:
        
    ! perl /home/jupyter//annovar/table_annovar.pl /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.vcf.gz /home/jupyter/annovar/humandb/ -buildver hg38 \
    -out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.annovar \
    -remove -protocol refGene,clinvar_20170905,dbnsfp47a \
    -operation g,f,f \
    --nopolish \
    -nastring . \
    -vcfinput

In [None]:
test = pd.read_csv('/home/jupyter/PGLYRP2_NBA_R9/CAS/CAS_PGLYRP2.annovar.hg38_multianno.txt',sep='\t')
test.head()

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
        
    print(f'WORKING ON: {ancestry}')
    
    # Read in ANNOVAR multianno file
    gene = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.annovar.hg38_multianno.txt', sep = '\t')
    
    #Filter for the correct gene name (sometimes other genes are also included)
    gene = gene[gene['Gene.refGene'] == 'PGLYRP2']
    
    # Convert the CADD scores to float, set errors to NaN
    gene['CADD_phred'] = pd.to_numeric(gene['CADD_phred'], errors='coerce')

    #Print number of variants in the different categories
    results = [] 

    intronic = gene[gene['Func.refGene']== 'intronic']
    upstream = gene[gene['Func.refGene']== 'upstream']
    downstream = gene[gene['Func.refGene']== 'downstream']
    utr5 = gene[gene['Func.refGene']== 'UTR5']
    utr3 = gene[gene['Func.refGene']== 'UTR3']
    splicing = gene[gene['Func.refGene']== 'splicing']
    exonic = gene[gene['Func.refGene']== 'exonic']
    stopgain = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'stopgain')]
    stoploss = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'stoploss')]
    startloss = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'startloss')]
    frameshift_deletion = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'frameshift deletion')]
    frameshift_insertion = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'frameshift insertion')]
    nonframeshift_deletion = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonframeshift deletion')]
    nonframeshift_insertion = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonframeshift insertion')]
    coding_nonsynonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
    coding_synonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'synonymous SNV')]
        
    print({ancestry})
    print('Total variants: ', len(gene))
    print("Intronic: ", len(intronic))
    print("Upstream: ", len(upstream))
    print("Downstream: ", len(downstream))
    print('UTR3: ', len(utr3))
    print('UTR5: ', len(utr5))
    print("Splicing: ", len(splicing))
    print("Total exonic: ", len(exonic))
    print("Stopgain: ", len(stopgain))
    print("Stoploss: ", len(stoploss))
    print("Startloss: ", len(startloss))
    print("Frameshift deletion: ", len(frameshift_deletion))
    print("Frameshift insertion: ", len(frameshift_insertion))
    print("Non-frameshift insertion: ", len(nonframeshift_insertion))
    print("Non-frameshift deletion: ", len(nonframeshift_deletion))
    print('Synonymous: ', len(coding_synonymous))
    print("Nonsynonymous: ", len(coding_nonsynonymous))
    results.append((gene, intronic, upstream, downstream, utr3, utr5, splicing,exonic,stopgain,stoploss,startloss, frameshift_deletion,frameshift_insertion,nonframeshift_deletion,nonframeshift_insertion,coding_synonymous, coding_nonsynonymous))
    print('\n')
    
    ## For rvtests
    
    # Potential functional: These are variants annotated as frameshift, nonframeshift, startloss, stoploss, stopgain, splicing, missense, exonic, UTR5, UTR3, upstream (-100bp), downstream (+100bp), or ncRNA. 
    potentially_functional = gene[gene['Func.refGene'] != 'intronic']
    # Coding: These are variants annotated as frameshift, nonframeshift, startloss, stoploss, stopgain, splicing, or missense.
    coding_variants = gene[(gene['Func.refGene'] == 'splicing') | (gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] != 'synonymous SNV')]
    # Loss of function: These are variants annotated as frameshift, startloss,stopgain, or splicing.
    loss_of_function = gene[(gene['Func.refGene'] == 'splicing') | (gene['ExonicFunc.refGene'] == 'stopgain') | (gene['ExonicFunc.refGene'] == 'startloss') | (gene['ExonicFunc.refGene'] == 'frameshift deletion') | (gene['ExonicFunc.refGene'] == 'frameshift insertion')]
    
    # Save in PLINK format
    variants_toKeep = potentially_functional[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.potentially_functional.variantstoKeep.txt', sep="\t", index=False, header=False)

    variants_toKeep2 = coding_variants[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep2.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.coding_variants.variantstoKeep.txt', sep="\t", index=False, header=False)

    variants_toKeep3 = loss_of_function[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep3.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.loss_of_function.variantstoKeep.txt', sep="\t", index=False, header=False)
        
    maf_01 = gene[gene['Otherinfo1'] < 0.01]
    variants_toKeep4 = maf_01[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep4.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.maf_01.variantstoKeep.txt', sep="\t", index=False, header=False)

    maf_03 = gene[gene['Otherinfo1'] < 0.03]
    variants_toKeep5 = maf_03[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep5.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.maf_03.variantstoKeep.txt', sep="\t", index=False, header=False)

    # For assoc
    
    # These are all exonic variants
    exonic = gene[gene['Func.refGene'] == 'exonic']
    
    # Save in PLINK format
    variants_toKeep7 = exonic[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep7.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.exonic.variantstoKeep.txt', sep="\t", index=False, header=False)
    

In [None]:
!head /home/jupyter/PGLYRP2_NBA_R9/EAS/EAS_PGLYRP2.rare_CADD.variantstoKeep.txt

## Burden Analyses using RVTests

In [None]:
#Prepare the file format for RVTESTs
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['potentially_functional', 'coding_variants','loss_of_function','maf_01','maf_03']

#Loop over all the ancestries and the 3 variant classes
for ancestry in ancestries:
    for variant_class in variant_classes:
                
        # Print the command to be executed (for debugging purposes)
        print(f'Running plink to extract {variant_class} variants for ancestry: {ancestry}')
        
        #Extract relevant variants
        ! /home/jupyter/plink2 \
        --pfile {IMPUTED_GENO_PATH}/{ancestry}/chr19_{ancestry}_release9_vwb \
        --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
        --extract range /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}.variantstoKeep.txt \
        --recode vcf-iid \
        --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running bgzip and tabix for {variant_class} variants for ancestry: {ancestry}')
        
        ## Bgzip and Tabix (zip and index the file)
        ! bgzip -f /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}.vcf
        ! tabix -f -p vcf /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}.vcf.gz

In [None]:
#Loop over the different variant classes 
#Run with all covariates (including age)
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['potentially_functional', 'coding_variants','loss_of_function','maf_01','maf_03']

for ancestry in ancestries:
    for variant_class in variant_classes:
                
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.burden.{variant_class} \
        --kernel skat,skato \
        --inVcf /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}.vcf.gz \
        --pheno /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
        --pheno-name PHENO \
        --gene PGLYRP2 \
        --geneFile /home/jupyter/refFlat.txt \
        --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
        

In [None]:
# Look at one of the output file
! cat /home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_PGLYRP2.burden.maf_03.Skat.assoc

In [None]:
#Loop over the different variant classes 
#Run with all covariates (excluding age)
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['potentially_functional', 'coding_variants','loss_of_function','maf_01','maf_03']

for ancestry in ancestries:
    for variant_class in variant_classes:
                
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.burden.{variant_class}_NOAGE \
        --kernel skat,skato \
        --inVcf /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.{variant_class}.vcf.gz \
        --pheno /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
        --pheno-name PHENO \
        --gene PGLYRP2 \
        --geneFile /home/jupyter/refFlat.txt \
        --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
        --covar-name SEX,PC1,PC2,PC3,PC4,PC5
        

In [None]:
# Look at one of the output 
! cat /home/jupyter/PGLYRP2_NBA_R9/AFR/AFR_PGLYRP2.burden.maf_03_NOAGE.Skat.assoc

## Case/Control Analysis

### Glossary

- CHR Chromosome code
- SNP Variant identifier
- A1 Allele 1 (usually minor)
- A2 Allele 2 (usually major)
- MAF Allele 1 frequency in all subjects
- F_A/MAF_A Allele 1 frequency in cases
- F_U/MAF_U Allele 1 frequency in controls
- NCHROBS_A Number of case allele observations
- NCHROBS_U Number of control allele observations

## ALL VARIANTS
#### assoc

In [None]:
#Run case-control analysis using plink assoc for ALL variants, not adjusting for any covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --assoc \
    --maf 0.01 \
    --mac 2 \
    --hwe 0.0001 \
    --adjust \
    --allow-no-sex \
    --ci 0.95 \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A).
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants

In [None]:
#Process results from plink assoc unadjusted analysis for CODING variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    #Look at assoc results
    freq = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants.assoc', delim_whitespace=True)
    
    #Filter for significant variants p < 0.05 - if any
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'There are {len(sig_all_nonadj)} variants with p-value < 0.05')

    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })
        
    # Return
    df_results = pd.DataFrame(results)
    df_results['SNP'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
          
    #Merge with the assoc file
    sig_merge = freq[['SNP','A1','F_A','F_U','A2','L95','OR','U95','P']]
    merged = pd.merge(df_results, sig_merge, on='SNP', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}') 
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc.txt', sep = '\t', index=False)

#### glm - Adjusting for age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --adjust \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_glm_age.txt', sep = '\t', index=False)

Look at if there are any significant variants in the adjusted analysis:

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_age.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True) #Sort on Bonferroni, smallest to largest
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")



#### glm - Not adjusting for age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --adjust \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name SEX,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_glm_noage.txt', sep = '\t', index=False)

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_noage.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True)
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")

### LD calculations for identified variants of interest 

In [None]:
# Checking LD of SNPs of interest
# Variants that have come up as potentially significant are: 
#Top entries for AFR (Males - no age):
   #CHROM                  ID A1     UNADJ        GC      BONF      HOLM  \
# 0      19  chr19:15470473:T:A  A  0.000665  0.020361  0.043217  0.043217   

#LD will be calculated with the rs892145 variant chr19:15475861A>T [GRCh38]

#Update the code below with the variants that are of interest

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --ld chr19:15470473:T:A chr19:15475861:A:T \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_ld_result
    

## Extracting information on rs892145 for all ancestries + HWE

##### assoc

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_glm_age.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"Match in {ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - no age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_glm_noage.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

#### HWE - rs892145

In [None]:
## extract the SNP and calculate HWE
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --chr 19 \
    --from-bp 15475861 \
    --to-bp 15475861 \
    --keep-if PHENO1==1 \
    --hardy \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2_HWE

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    file_path = f"/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2_HWE.hardy"
    print(f"--- Contents of {file_path} ---")
    try:
        with open(file_path, 'r') as file:
            print(file.read())
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    print("\n" + "-"*80 + "\n")

## Sex stratified analyses
#### assoc - males

In [None]:
#Run case-control analysis using plink assoc for ALL variants, not adjusting for any covariates and only in males (=1)
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --assoc \
    --filter-males \
    --maf 0.01 \
    --mac 2 \
    --hwe 0.0001 \
    --adjust \
    --allow-no-sex \
    --ci 0.95 \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_males
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A).
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --filter-males \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_males

In [None]:
#Process results from plink assoc unadjusted analysis 
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    #Look at assoc results
    freq = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_males.assoc', delim_whitespace=True)
    
    #Filter for significant variants p < 0.05 - if any
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'There are {len(sig_all_nonadj)} variants with p-value < 0.05')

    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_males.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })
        
    # Return
    df_results = pd.DataFrame(results)
    df_results['SNP'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
          
    #Merge with the assoc file
    sig_merge = freq[['SNP','A1','F_A','F_U','A2','L95','OR','U95','P']]
    merged = pd.merge(df_results, sig_merge, on='SNP', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}') 
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc_males.txt', sep = '\t', index=False)

#### glm - males - age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --filter-males \
    --adjust \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_males
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --filter-males \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_males

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_males.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_males.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_age_males.txt', sep = '\t', index=False)

Look at if there are any significant variants in the adjusted analysis:

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_age_males.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True) #Sort on Bonferroni, smallest to largest
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")



#### glm - males - no age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --adjust \
    --filter-males \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_males
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --filter-males \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_males

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_males.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_males.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_noage_males.txt', sep = '\t', index=False)

Look at if there are any significant variants in the adjusted analysis:

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_noage_males.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True) #Sort on Bonferroni, smallest to largest
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")



#### assoc - females

In [None]:
#Run case-control analysis using plink assoc for ALL variants, not adjusting for any covariates and only in females (=2)
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --assoc \
    --filter-females \
    --maf 0.01 \
    --mac 2 \
    --hwe 0.0001 \
    --adjust \
    --allow-no-sex \
    --ci 0.95 \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_females
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A).
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --filter-females \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_females

In [None]:
#Process results from plink assoc unadjusted analysis for CODING variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    #Look at assoc results
    freq = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_females.assoc', delim_whitespace=True)
    
    #Filter for significant variants p < 0.05 - if any
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'There are {len(sig_all_nonadj)} variants with p-value < 0.05')

    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_females.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })
        
    # Return
    df_results = pd.DataFrame(results)
    df_results['SNP'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
          
    #Merge with the assoc file
    sig_merge = freq[['SNP','A1','F_A','F_U','A2','L95','OR','U95','P']]
    merged = pd.merge(df_results, sig_merge, on='SNP', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}') 
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc_females.txt', sep = '\t', index=False)

#### glm - females - age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --filter-females \
    --adjust \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_females
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --filter-females \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_females

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_females.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_females.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_age_females.txt', sep = '\t', index=False)

Look at if there are any significant variants in the adjusted analysis:

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_age_females.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True) #Sort on Bonferroni, smallest to largest
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")



#### glm - females - no age

In [None]:
#Run case-control analysis for all variants with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --glm \
    --adjust \
    --filter-females \
    --maf 0.01 \
    --mac 2 \
    --ci 0.95 \
    --hwe 0.0001 \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_females
    
    #--recode A creates a new text fileset, showing each variant in each case and control for the minor allele (A). 
    ! /home/jupyter/plink1.9 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --filter-females \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_females

In [None]:
#Process results from plink glm analysis for ALL variants
#As there are very few or no significant variants with p-value < 0.05 - we will save results dataframe of all coding variants

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    print(f'WORKING ON: {ancestry}')
    
    #Read in glm results
    assoc = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_females.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Filter for significant variants p < 0.05 - if any
    significant = assoc_add[assoc_add['P']<0.05]
    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Read in plink recoded data (.raw file)
    recode = pd.read_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_females.raw', delim_whitespace=True)

    # Make a list from the column names
    column_names = recode.columns.tolist()

    # Drop the first 6 columns to keep the variants 
    variants = column_names[6:]

    print(f'Number of variants in {ancestry} for PGLYRP2: {len(variants)}')

    # Pre-filter the dataset
    cases_data = recode[recode['PHENOTYPE'] == 2]
    controls_data = recode[recode['PHENOTYPE'] == 1]

    results = []

    # Pre-filter the dataset
    total_cases = cases_data.shape[0]
    total_controls = controls_data.shape[0]
    results = []

    for variant in variants:
        ## For PD cases
        hom_cases = (cases_data[variant] == 2).sum()
        het_cases = (cases_data[variant] == 1).sum()
        hom_ref_cases = (cases_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_cases = total_cases - (hom_cases + het_cases + hom_ref_cases)  # Missing data count
        freq_cases = (2 * hom_cases + het_cases) / (2 * (total_cases - missing_cases))  # Adjust for missing data in denominator

        ## For controls
        hom_controls = (controls_data[variant] == 2).sum()
        het_controls = (controls_data[variant] == 1).sum()
        hom_ref_controls = (controls_data[variant] == 0).sum()  # Homozygous reference genotype
        missing_controls = total_controls - (hom_controls + het_controls + hom_ref_controls)  # Missing data count
        freq_controls = (2 * hom_controls + het_controls) / (2 * (total_controls - missing_controls))  # Adjust for missing data in denominator
    
        # Append results in dictionary format
        results.append({
            'Variant': variant,
            'Hom Cases': hom_cases,
            'Het Cases': het_cases,
            'Hom Ref Cases': hom_ref_cases,
            'Missing Cases': missing_cases,
            'Total Cases': total_cases,
            'Carrier Freq in Cases': freq_cases,
            'Hom Controls': hom_controls,
            'Het Controls': het_controls,
            'Hom Ref Controls': hom_ref_controls,
            'Missing Controls': missing_controls,
            'Total Controls': total_controls,
            'Carrier Freq in Controls': freq_controls
        })

    # Return
    df_results = pd.DataFrame(results)
    df_results['ID'] = df_results['Variant'].apply(lambda x: x.rsplit('_', 1)[0])

    #Print dimensions of the df_results dataframe
    print(f'df_results shape: {df_results.shape}')
    
    #Merge with the glm file
    sig_merge = assoc_add[['ID','A1','A1_FREQ','OBS_CT','L95','OR','U95','LOG(OR)_SE','Z_STAT','P']]
    merged = pd.merge(df_results, sig_merge, on='ID', how='right')
    
    #Print dimensions of the merged dataframe (just adding more columns)
    print(f'Merged dataframe shape: {merged.shape}')
    
    ## Save to CSV
    merged.to_csv(f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_noage_females.txt', sep = '\t', index=False)

Look at if there are any significant variants in the adjusted analysis:

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    file_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2.allvariants_noage_females.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        glm_age_adjust = pd.read_csv(file_path, delim_whitespace=True)
        sorted_glm = glm_age_adjust.sort_values(by='BONF', ascending=True) #Sort on Bonferroni, smallest to largest
        print(f"\nTop entries for {anc}:")
        print(sorted_glm.head())  # or change to sorted_glm.to_string(index=False) for a cleaner look
    except FileNotFoundError:
        print(f"File not found for ancestry: {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")



### Extracting information on rs892145 for all ancestries + HWE

#### males

##### assoc

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc_males.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"Match in {ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_age_males.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_males.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - no age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_noage_males.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP - noage
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_males.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

#### females

##### assoc

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_assoc_females.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"Match in {ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_age_females.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_age_females.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

##### glm - no age

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.allvariants_noage_females.txt'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['Variant'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

In [None]:
#Get the adjusted results for the SNP -no age
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

# Variant to search for
target_variant = 'chr19:15475861:A:T'

# Loop through each ancestry group
for ancestry in ancestries:
    filename = f'/home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2.allvariants_noage_females.PHENO1.glm.logistic.hybrid.adjusted'
    try:
        df = pd.read_csv(filename, sep='\t')
        
        # Filter the row where Variant column contains the target variant
        match = df[df['ID'].str.contains(target_variant, na=False)]
        
        if not match.empty:
            print(f"{ancestry}:")
            print(match.to_string(index=False))
            print("\n" + "="*80 + "\n")
        else:
            print(f"No match found in {ancestry}.")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {e}")

## Interaction analyses

##### Look at if there is a significant interaction term between the SNP of interest and sex using R

In [None]:
# Install R 

!pip install rpy2

In [None]:
# Load rpy2 and activate the R interface

import rpy2.robjects as robjects


In [None]:
# Test if R is working correctly 

robjects.r('R.version.string')

In [None]:
# To use R natively in a cell, load the R magic extension

%load_ext rpy2.ipython

In [None]:
!echo "chr19:15475861:A:T" > /home/jupyter/PGLYRP2_NBA_R9/single_snp.txt
!echo "chr19:15475861:A:T T" > /home/jupyter/PGLYRP2_NBA_R9/reference_allele.txt

In [None]:
# Create a new covariate file where genotypes are coded additively (0,1,2)

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    ! /home/jupyter/plink2 \
    --bfile /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2 \
    --extract /home/jupyter/PGLYRP2_NBA_R9/single_snp.txt \
    --keep /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}.samplestokeep \
    --pheno /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --pheno-name PHENO \
    --covar /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_covariate_file.txt \
    --covar-name SEX,PC1,PC2,PC3,PC4,PC5 \
    --ref-allele /home/jupyter/PGLYRP2_NBA_R9/reference_allele.txt \
    --recode A \
    --out /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/{ancestry}_PGLYRP2_interaction

In [None]:
!head /home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_PGLYRP2_interaction.raw

In [None]:
!head /home/jupyter/PGLYRP2_NBA_R9/EUR/EUR_PGLYRP2_interaction.cov

In [None]:
#Merge the ANC_PGLYRP2_interaction.raw with the ANC_PGLYRP2_interaction.cov on IID
#Drop SEX from one of the files to avoid duplicates

ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for anc in ancestries:
    cov_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2_interaction.cov'
    raw_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2_interaction.raw'

    try:
        # Load both files
        cov_df = pd.read_csv(cov_path, delim_whitespace=True)
        raw_df = pd.read_csv(raw_path, delim_whitespace=True)

        # Rename '#IID' to 'IID' for merge compatibility
        cov_df = cov_df.rename(columns={'#IID': 'IID'})

        # Drop 'SEX' from cov file to avoid duplication (optional, since values are the same)
        if 'SEX' in cov_df.columns:
            cov_df = cov_df.drop(columns=['SEX'])

        # Merge
        merged_df = pd.merge(raw_df, cov_df, on='IID')

        # Save back to original .raw path (space-delimited, no index)
        merged_df.to_csv(raw_path, sep=' ', index=False)

        print(f"Saved merged file: {raw_path}")

    except FileNotFoundError:
        print(f"Missing file for {anc}")
    except Exception as e:
        print(f"Error processing {anc}: {e}")


In [None]:
# Choose the ancestry to inspect
anc = 'EUR'

# Path to the saved merged file
merged_path = f'/home/jupyter/PGLYRP2_NBA_R9/{anc}/{anc}_PGLYRP2_interaction.raw'

# Read the merged file
merged_df = pd.read_csv(merged_path, delim_whitespace=True)

# Display the first few rows
print(f"\nPreview of merged file for {anc}:")
print(merged_df.head())

In [None]:
%%R
file_path <- "/home/jupyter/PGLYRP2_NBA_R9/AAC/AAC_PGLYRP2_interaction.raw"
lines <- readLines(file_path)
cat("Line 140 content:\n", lines[140], "\n")
cat("Number of elements:", length(strsplit(lines[140], "\\s+")[[1]]), "\n")

In [None]:
%%R
# Vector of ancestries
ancestries <- c('AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH')

# Loop over each ancestry
for (ancestry in ancestries) {
  file_path <- paste0("/home/jupyter/PGLYRP2_NBA_R9/", ancestry, "/", ancestry, "_PGLYRP2_interaction.raw")
  
  # Check if file exists
  if (file.exists(file_path)) {
    cat("\nReading:", file_path, "\n")
    
    # Try reading the file with more robust options
    interaction_data <- tryCatch({
      read.table(file_path,
                 header = TRUE,
                 sep = "",           # treat any whitespace as delimiter
                 fill = TRUE,        # fill incomplete lines
                 comment.char = "",  # disable comments
                 quote = "",         # disable quote parsing
                 fileEncoding = "UTF-8")
    }, error = function(e) {
      cat("❌ Error reading file for ancestry:", ancestry, "\n")
      cat("   ➤ File path:", file_path, "\n")
      cat("   ➤ Error message:", e$message, "\n\n")
      return(NULL)
    })
    
    # Proceed only if successful
    if (!is.null(interaction_data)) {
      interaction_data$PHENOTYPE <- ifelse(interaction_data$PHENOTYPE == 2, 1, 0)
      interaction_data$SEX <- factor(interaction_data$SEX, levels = c(1, 2), labels = c("Male", "Female"))
      interaction_data$SEX <- relevel(interaction_data$SEX, ref = "Female")
      
      cat("✅ Successfully processed:", ancestry, "\n")
      print(head(interaction_data))
    }
  } else {
    cat("⚠️ File not found for ancestry:", ancestry, "\n")
  }
}

In [None]:
%%R
ancestries <- c('AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH')

for (ancestry in ancestries) {
  file_path <- paste0("/home/jupyter/PGLYRP2_NBA_R9/", ancestry, "/", ancestry, "_PGLYRP2_interaction.raw")
  
  if (file.exists(file_path)) {
    interaction_data <- read.table(file_path, header = TRUE, sep = "",fill = TRUE,comment.char = "",quote = "", fileEncoding = "UTF-8")
    
    interaction_data$PHENOTYPE <- ifelse(interaction_data$PHENOTYPE == 2, 1, 0)
    interaction_data$SEX <- factor(interaction_data$SEX, levels = c(1, 2), labels = c("Male", "Female"))
    interaction_data$SEX <- relevel(interaction_data$SEX, ref = "Female")

    colnames(interaction_data) <- make.names(colnames(interaction_data))

    # Fit logistic regression model with interaction term
    glm_interaction <- glm(PHENOTYPE ~ SEX * `chr19.15475861.A.T_T`+PC1+PC2+PC3+PC4+PC5, data = interaction_data, 
                           family = binomial)
    
    # Print ancestry and model summary
    cat("\nAncestry:", ancestry, "\n")
    print(summary(glm_interaction))
    
    # Print coefficients with their names
    cat("\nCoefficients and names:\n")
    coeffs <- coef(glm_interaction)
    for (name in names(coeffs)) {
      cat(name, ": ", coeffs[name], "\n")
    }
    
    # Print odds ratios with 95% confidence intervals
    cat("\nOdds Ratios and 95% CI:\n")
    odds_ratios <- exp(cbind(coef(glm_interaction), suppressMessages(confint(glm_interaction))))
    print(odds_ratios)
    
    cat("\n", strrep("=", 80), "\n")
    
  } else {
    cat("\nFile not found for ancestry:", ancestry, "\n")
  }
}


In [None]:
%%R
names(coef(glm_interaction))

## Copy files

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    !mkdir /home/jupyter/workspace/pglyrp2/pglyrp2/{ancestry}
    !cp /home/jupyter/PGLYRP2_NBA_R9/{ancestry}/* /home/jupyter/workspace/pglyrp2/pglyrp2/{ancestry}