# MAPT - Haplotype analyses

## GP2 NBA data release 7

## Project: Exploring MAPT-containing H1 and H2 haplotypes  in Parkinson's Disease across diverse populations 

Version: Python/3.10.12

Last Updated: 29-MAY-2024

Gene coordinates for the region of 17q21.31 (containing MAPT) from the UCSC Browser: chr17:42,800,001-46,800,000 (GRCh38/hg38)

Notebook overview: In this notebook we performed analyses looking at the frequency of haplotypes in PD cases and controls in MAPT using the tagging SNP rs1052553. In this notebook, we specifically looked at the AAC ancestry group but the analysis was repeated on the other ancestries available in GP2 (with the exception of the FIN due to low sample size).


1. Set up everything. 
2. Extract the rs1052553 SNP - H1/H2 haplotype 
3. Calculate HWE for the SNP
4. Get the frequency and number of individuals of H1 vs H2 without covariates
5. Get the frequency and number of individuals of H1/H1, H1/H2 and H2/H2 
6. Run association analysis with covariates for H1 vs H2
7. Testing the groups H1/H1 vs H1/H2 and H2/H2 (dominant model)
8. Testing the three groups: H1/H1 (reference) vs H1/H2 vs H2/H2



### Getting Started

Loading Python libraries and defining functions
Installing packages
Preparing input files:
- Copying files 
- Remove related individuals
- Remove non-PD case control individuals

#### Loading Python libraries and defining functions

In [None]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# Numpy for basics
import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

#Import Sys
import sys as sys

In [None]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command
    
def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

# Utility routine for printing a query before executing it
def bq_query(query):
    print(f'Executing: {query}', file=sys.stderr)
    return pd.read_gbq(query, project_id=BILLING_PROJECT_ID, dialect='standard')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
    </p>
    '''

    display(HTML(html))

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)

In [None]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## Print the information to check we are in the proper release and billing 
## This will be different for you, the user, depending on the billing project your workspace is on
print('Billing and Workspace')
print(f'Workspace Name: {WORKSPACE_NAME}')
print(f'Billing Project: {BILLING_PROJECT_ID}')
print(f'Workspace Bucket, where you can upload and download data: {WORKSPACE_BUCKET}')
print('')

## GP2 v7.0
## Explicitly define release v7.0 path 
GP2_RELEASE_PATH = 'gs://gp2tier2/path/to/release/7'
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
GP2_META_RELEASE_PATH = f'{GP2_RELEASE_PATH}/meta_data'
GP2_SUMSTAT_RELEASE_PATH = f'{GP2_RELEASE_PATH}/summary_statistics'

print('GP2 v7.0')
print(f'Path to GP2 v7.0 Clinical Data @ `GP2_CLINICAL_RELEASE_PATH`: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v7.0 Metadata @ `GP2_META_RELEASE_PATH`: {GP2_META_RELEASE_PATH}')
print(f'Path to GP2 v7.0 Raw Genotype Data @ `GP2_RAW_GENO_PATH`: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v7.0 Imputed Genotype Data @ `GP2_IMPUTED_GENO_PATH`: {GP2_IMPUTED_GENO_PATH}')
print(f'Path to GP2 v7.0 summary statistics: {GP2_SUMSTAT_RELEASE_PATH}')

#### Installing packages and softwares

In [None]:
%%bash
#Installing plink

mkdir -p ~/tools
cd ~/tools

if test -e /home/jupyter/tools/plink; then
echo "Plink1.9 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink \n    -------"
wget -N http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 
unzip -o plink_linux_x86_64_20190304.zip
echo -e "\n plink downloaded and unzipped in /home/jupyter/tools \n "

fi


if test -e /home/jupyter/tools/plink2; then
echo "Plink2 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink2 \n    -------"
wget -N https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_x86_64_20250129.zip
unzip -o plink2_linux_x86_64_20250129.zip
echo -e "\n plink2 downloaded and unzipped in /home/jupyter/tools \n "

fi

In [None]:
%%bash
ls /home/jupyter/tools/

In [None]:
%%bash

# chmod plink 1.9 
chmod u+x /home/jupyter/tools/plink

In [None]:
%%bash

# chmod plink 2.0
chmod u+x /home/jupyter/tools/plink2

#### Preparing input files

In [None]:
# Make a directory
print("Making a working directory")
WORK_DIR = f'/home/jupyter/Team6_haplo/'
shell_do(f'mkdir -p {WORK_DIR}')

##### Retreive the files needed, including the genotype (iusing the raw genotype files) and covariate files

In [None]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} ls {GP2_RAW_GENO_PATH}')

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {GP2_RAW_GENO_PATH}/AAC/AAC_* {WORK_DIR}')


Get the covariate file

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {GP2_CLINICAL_RELEASE_PATH}')

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {GP2_CLINICAL_RELEASE_PATH}/master_key_release7_final.csv {WORK_DIR}')


##### Remove related individuals

In [None]:
# Select the file that matches with your population
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {GP2_META_RELEASE_PATH}/related_samples/')

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {GP2_META_RELEASE_PATH}/related_samples/AAC_release7.related {WORK_DIR}')

In [None]:
!cat /home/jupyter/Team6_haplo/AAC_release7.related

The IDs are:
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
We select to remove individuals in the ID1 and only exclude one person in the pair

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR


cut -d, -f2 AAC_release7.related > related_ids.txt


In [None]:
!cat /home/jupyter/Team6_haplo/related_ids.txt

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile AAC_release7 \
--remove related_ids.txt \
--make-pgen \
--out AAC_release7_nonrelated

##### Remove non-PD case/control individuals

Double-check with the numbers found here for your ancestry group before moving on: https://gp2.org/the-components-of-gp2s-fifth-data-release/

The prune flag keeo only these with a plink phenotype of 1 or 0. We need to do this because the MAF will be different if these individuals are not removed (for the group all)

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile AAC_release7_nonrelated \
--prune \
--make-pgen \
--out AAC_release7_nonrelated_pdc

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR
head AAC_release7_nonrelated_pdc.pvar

### Extract the region of interest 

Here we are interested in the SNP rs1052553
- This SNP was the one that they used in the Nigerian MAPT paper
- This SNP will be used as a proxy for the H1/H2 haplotype
- rs1052553 coordinates in GRCh38: 17:45996523
- We will also add --mind to remove individuals that haven't been fully genotyped for this variant


In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile AAC_release7_nonrelated_pdc \
--chr 17 \
--from-bp 45996523  \
--to-bp 45996523 \
--mind \
--make-pgen \
--out haplo_h1h2

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR
head haplo_h1h2.pvar

As you can see, there are two variants here with the same coordinates (At least for the AAC population). This is because there were multipel probes for the same variant during genotyping - the results for the variants should be indentical though

### Calculate HWE

In [None]:
%%bash
#We will chack if the SNP deviate from HWE

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--hardy \
--keep-if PHENO1==1 \
--out haplo_h1h2


In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR
head haplo_h1h2.hardy

Add the p-value to the HWE sheet

#### Put together the covar file

In [None]:
clin = pd.read_csv('/home/jupyter/Team6_haplo/master_key_release7_final.csv')
clin.info()

In [None]:
gen = pd.read_csv('/home/jupyter/Team6_haplo/AAC_release7.psam', sep='\t')
gen.info()

In [None]:
pcs = pd.read_csv('/home/jupyter/Team6_haplo/AAC_release7.eigenvec', sep='\t')
pcs.info()

In [None]:
gen2 = pd.merge(gen, clin, left_on='#IID', right_on='GP2sampleID')
gen2.info()

In [None]:
gen3 = pd.merge(gen2, pcs, left_on='#IID', right_on='IID')
gen3.info()

In [None]:
plink_clin = gen3[['#IID', 'SEX', 'PHENO1', 'age_at_sample_collection', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', "PC6","PC7","PC8","PC9","PC10"]]
plink_clin.head()

In [None]:
#Set missing values to -9 (plink format)
plink_clin['PHENO1'] = plink_clin['PHENO1'].fillna(-9)
plink_clin['age_at_sample_collection'] = plink_clin['age_at_sample_collection'].fillna(-9)
plink_clin['SEX'] = plink_clin['SEX'].fillna(-9)

In [None]:
plink_clin.head()

In [None]:
#Rename age_at_sample_collection  
plink_clin = plink_clin.rename(columns={'age_at_sample_collection': 'AGE'})
plink_clin.head()

In [None]:
plink_clin.to_csv('/home/jupyter/Team6_haplo/covars.txt', sep='\t', index=False, na_rep='-9',)

In [None]:
%%bash
WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

ls

### Get the frequencies for H1 and H2 in cases and controls - "without covariates" (also N)

This includes all individuals, even the ones we are missing covariates for.
This part have been updated in this notebook (hopefully to the better!)

#### A) H1 vs H2

##### 1) Convert to plink1.9 binary files

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2  \
--make-bed \
--out haplo_h1h2_recode_bed

##### 2) Run --assoc to get the freq in cases and in controls and also the p-value for potential differences in the allele frequency between cases and controls

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink \
--bfile haplo_h1h2_recode_bed  \
--assoc \
--ci 0.95 \
--out haplo_h1h2

Obtain the number of cases and controls from the plink output. Here: 509 cases and 301 controls

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2.assoc

Add the numbers above to the H1 vs H2 (WITHOUT covariates, --assoc) sheet in our results document

Add the frequencies and the N cases and controls to the first sheet
F_A = Frequency affected (=PD)
F_U = Frequency unaffected (=Controls)

Below, we will also get the number of alleles (no need to add this to the sheet, not sure we need it but in case)

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink \
--bfile haplo_h1h2_recode_bed  \
--assoc counts \
--ci 0.95 \
--out haplo_h1h2_counts

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_counts.assoc

### Get the frequencies for H1/H1, H1/H2 and H2/H2 

H1/H1, H1/H2, H2/H2 groups:

#H1H1 = 0
#H1H2 = 1
#H2H2 = 2

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink \
--bfile haplo_h1h2_recode_bed \
--recode A \
--out haplo_h1h2_recodeA

In [None]:
!head /home/jupyter/Team6_haplo/haplo_h1h2_recodeA.raw

In [None]:
cas_haplos = pd.read_csv('/home/jupyter/Team6_haplo/haplo_h1h2_recodeA.raw', sep=' ')
cas_haplos.head()

In [None]:
#total number of samples
cas_haplos = pd.read_csv('/home/jupyter/Team6_haplo/haplo_h1h2_recodeA.raw', sep=' ')
cas_haplos.info()

In [None]:
#remove haplotype that are not defined under column chr17...

cas_haplos_clean = cas_haplos[~cas_haplos['rs1052553_G'].isna()]
cas_haplos_clean.info()

In [None]:
#select only PD cases (total N of PD cases)
cas_haplos_case = cas_haplos_clean[cas_haplos_clean['PHENOTYPE']==2]
cas_haplos_case.info()

In [None]:
#select only controls (Total N of controls)
cas_haplos_control = cas_haplos_clean[cas_haplos_clean['PHENOTYPE']==1]
cas_haplos_control.info()


In [None]:
#total no of H1H1 in both pd & ctrl (no need to write this down in the table)
cas_h1h1 = cas_haplos_clean[cas_haplos_clean['rs1052553_G'] == 0]
cas_h1h1.info()

In [None]:
#no of H1H1 in pd (case) only
cas_h1h1_cases = cas_haplos_case[cas_haplos_case['rs1052553_G'] == 0]
cas_h1h1_cases.info()

In [None]:
#no of H1H1 in ctrls only
cas_h1h1_controls = cas_haplos_control[cas_haplos_control['rs1052553_G'] == 0]
cas_h1h1_controls.info()

In [None]:
#total no of H1H2 in both pd & ctrl (no need to write this down in the table)
cas_h1h2 = cas_haplos_clean[cas_haplos_clean['rs1052553_G'] == 1]
cas_h1h2.info()

In [None]:
#no of H1H2 in pd (case) only
cas_h1h2_cases = cas_haplos_case[cas_haplos_case['rs1052553_G'] == 1]
cas_h1h2_cases.info()

In [None]:
#no of H1H2 in ctrls only
cas_h1h2_controls = cas_haplos_control[cas_haplos_control['rs1052553_G'] == 1]
cas_h1h2_controls.info()

In [None]:
#total no of H2H2 in both pd & ctrl (no need to write this down in the table)
cas_h2h2 = cas_haplos_clean[cas_haplos_clean['rs1052553_G'] == 2]
cas_h2h2.info()

In [None]:
#no of H2H2 in pd (case) only
cas_h2h2_cases = cas_haplos_case[cas_haplos_case['rs1052553_G'] == 2]
cas_h2h2_cases.info()

In [None]:
#no of H2H2 in ctrls only
cas_h2h2_controls = cas_haplos_control[cas_haplos_control['rs1052553_G'] == 2]
cas_h2h2_controls.info()

In [None]:
#save the cas_h1h1_cases to look at the AAO (H1/H1 vs H1/H2 and H2/H2)
cas_h1h1_cases.iloc[:, :2].to_csv('aac_h1h1_cases.csv', index=False)

### Analysing the association between PD and H1 vs H2 haplotypes. Run association analysis with covariates

Age, Sex, PC1-PC5

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--glm hide-covar firth-fallback pheno-ids \
--covar-name AGE,SEX,PC1,PC2,PC3,PC4,PC5 \
--pheno-name PHENO1 \
--pheno /home/jupyter/Team6_haplo/covars.txt \
--ci 0.95 \
--covar-variance-standardize \
--covar /home/jupyter/Team6_haplo/covars.txt \
--out haplo_h1h2_glm

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_glm.PHENO1.glm.logistic.hybrid

OBS_CT for the regression indicate the number of samples in the regression (not alleles as before). Here: 639 samples. (! note that it's not the same numbers when plink is loading the phenotypes sinces it loads all 509 cases and 301 controls = 810 samples). Therefore, we need to get info on how many cases and controls we have in the regression with covariates

The pheno-ids addition to the --glm gives us a list of the IDs for the individuals that were kept in the analysis. We can use this to extract the cases and controls that were included in the analysis to count them

In [None]:
covars = pd.read_csv('/home/jupyter/Team6_haplo/haplo_h1h2_glm.PHENO1.glm.logistic.hybrid.id', sep='\t')
covars.head()

len(covars.index)

As you can see, 639 out of 810 samples (in the AAC population) were included in the regression when the covariates were added. We will check the frequency for the 639 that were included in the cells above

#### Get the frequency of the H1 and H2 haplotypes for the individuals that were included in the regression analysis with covariates:

ALL

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--keep haplo_h1h2_glm.PHENO1.glm.logistic.hybrid.id \
--freq \
--make-pgen \
--out haplo_h1h2_covariates_N_all

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_covariates_N_all.afreq

CASES

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--keep haplo_h1h2_glm.PHENO1.glm.logistic.hybrid.id \
--keep-if PHENO1='2' \
--freq \
--make-pgen \
--out haplo_h1h2_covariates_N_cases

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_covariates_N_cases.afreq

Here we can see that the plink output gives us "411 cases and 0 controls remaining after main filters."

CONTROLS

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--keep haplo_h1h2_glm.PHENO1.glm.logistic.hybrid.id \
--keep-if PHENO1='1' \
--freq \
--make-pgen \
--out haplo_h1h2_covariates_N_controls

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_covariates_N_controls.afreq

As you can see, there were 411 cases and 228 controls (639 samples) in the adjusted analysis

### Testing H1/H1 vs H1/H2 and H2/H2 (dominant model)

The 'dominant' modifier specifies a model assuming full dominance for the A1 allele, i.e. the first genotype column is changed to 0..1..1 encoding. Similarly, 'recessive' makes the first genotype column use 0..0..1 encoding.

Hence, using the dominant modifier in plink group H1/H1 vs H1/H2 and H2/H2. We are doing this due to the few number of individuals having H2/H2 (but we are looking at all three groups a few cells down!)


In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--glm dominant hide-covar firth-fallback \
--covar-name AGE,SEX,PC1,PC2,PC3,PC4,PC5 \
--pheno-name PHENO1 \
--pheno /home/jupyter/Team6_haplo/covars.txt \
--ci 0.95 \
--covar-variance-standardize \
--covar /home/jupyter/Team6_haplo/covars.txt \
--out haplo_h1h2_gml_dominant

In [None]:
!cat /home/jupyter/Team6_haplo/haplo_h1h2_gml_dominant.PHENO1.glm.logistic.hybrid

Enter the results above to the H1/H1 vs H1/H2 and H2/H2 sheet

### Testing the three groups: H1/H1 (reference) vs H1/H2 vs H2/H2
#H1H1 = 0
#H1H2 = 1
#H2H2 = 2

We are evaluating the risk of having PD if you are a H1/H2 carrier or a H2/H2 carrier as compared to if you would be a H1/H1 carrier

In [None]:
cas_haplos = pd.read_csv('/home/jupyter/Team6_haplo/haplo_h1h2_recodeA.raw', sep=' ')
cas_haplos.head()

Set phenotype to 0 and 1

In [None]:
cas_haplos['PHENOTYPE'] -= 1
cas_haplos.head()

Change the haplotype groups to categorical variable

In [None]:
cas_haplos.rename(columns={"rs1052553_G": "Haplo"}, inplace=True)

In [None]:
cas_haplos.info() 

Add the covariates and update missing ages (-9) to NA. PLINK interpret -9 as a missing value but python does not

In [None]:
covars = pd.read_csv('/home/jupyter/Team6_haplo/covars.txt', sep='\t')
covars.head()

In [None]:
covars.replace(-9.0, np.nan, inplace=True)
covars.head()

In [None]:
haplo_groups = pd.merge(cas_haplos, covars, left_on='IID', right_on='#IID')
haplo_groups.info()

In [None]:
import statsmodels.formula.api as smf

In [None]:
haplo_log = smf.logit(formula = 'PHENOTYPE ~ C(Haplo) + SEX_x + AGE + PC1 + PC2 + PC3 + PC4 + PC5' , data = haplo_groups).fit() 
haplo_log.summary() 


To get the odds ratio (OR) and 95% confidence interval (CI) for the OR (above is for the coef)

In [None]:
params = haplo_log.params
conf = haplo_log.conf_int()
conf['Odds Ratio'] = params
conf.columns = ['2.5% CI', '97.5% CI', 'Odds Ratio']
print(np.exp(conf))

OBS! An error message can occur if you have no H2/H2 haplotype carriers

Please add the results to the H1/H1 vs H1/H2 vs H2/H2 sheet

### Test the H1 association to PD

Here we are flipping the minor and major allele to get the H2 haplotype to be the reference haplotype

#### Assoc

In [None]:
## Prepare allele.txt file to flip major/minor alleles
# Define the SNP and allele
snp = 'rs1052553'
allele = 'G'

# Specify the output file name
output_file = '/home/jupyter/Team6_haplo/alleles.txt'

# Write the SNP and allele to the file
with open(output_file, 'w') as f:
    f.write(f"{snp} {allele}\n")

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink \
--bfile haplo_h1h2_recode_bed  \
--assoc \
--a2-allele alleles.txt \
--ci 0.95 \
--out flip_haplo_h1h2

In [None]:
! cat /home/jupyter/Team6_haplo/flip_haplo_h1h2.assoc

#### glm

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile haplo_h1h2 \
--ref-allele alleles.txt \
--make-pgen \
--out flipped_plink_file

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile flipped_plink_file \
--glm hide-covar omit-ref firth-fallback pheno-ids \
--covar-name AGE,SEX,PC1,PC2,PC3,PC4,PC5 \
--pheno-name PHENO1 \
--pheno /home/jupyter/Team6_haplo/covars.txt \
--ci 0.95 \
--covar-variance-standardize \
--covar /home/jupyter/Team6_haplo/covars.txt \
--out flip_haplo_h1h2_glm

In [None]:
! cat /home/jupyter/Team6_haplo/flip_haplo_h1h2_glm.PHENO1.glm.logistic.hybrid

### GWAS for locus

The region to be used corresponds to 17q21.31 chr17:42800001-46800000 according to UCSC Genome Browser on Human (GRCh38/hg38)

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr17%3A42800001%2D46800000&hgsid=2300497464_YMBoqmHnJWaakVS6O5MFDdk8kbGB

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile AAC_release7_nonrelated_pdc \
--chr 17 \
--maf 0.01 \
--from-bp 42800001 \
--to-bp 46800000 \
--mind \
--make-pgen \
--out AAC_locus_17q

In [None]:
%%bash

WORK_DIR='/home/jupyter/Team6_haplo/'
cd $WORK_DIR

/home/jupyter/tools/plink2 \
--pfile AAC_locus_17q \
--maf 0.01 \
--glm hide-covar --ci 0.95 \
--covar /home/jupyter/Team6_haplo/covars.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--pheno-name PHENO1 \
--pheno /home/jupyter/Team6_haplo/covars.txt \
--covar-variance-standardize \
--out AAC_locus_17q_out

In [None]:
! head /home/jupyter/Team6_haplo/AAC_locus_17q_out.PHENO1.glm.logistic.hybrid

In [None]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp -r /home/jupyter/Team6_haplo/AAC_locus_17q_out.PHENO1.glm.logistic.hybrid {WORKSPACE_BUCKET}/PLOT_All_pops/AAC_locus_17q.txt')
