## __Single gene analysis__

```GP2 ❤️ Open Science 😍```

### Description

Using "individual level data"....:

__0. Getting Started__
- Loading Python libraries
- Defining functions
- Installing packages

__1. Copy data from workspace to cloud environment__

__2. Extract SH3GL2__

__3. Annotate SH3GL2 variants__

__4. Extract coding/non-syn variants__

__5. Calculate frequency in cases versus controls__

__6. Calculate frequency (homozygotes)in cases versus controls__

__7. Save out results__



### Loading Python libraries

In [1]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# Numpy for basics

import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

#Import Sys
import sys as sys

### Defining functions

In [2]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command
    
def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

# Utility routine for printing a query before executing it
def bq_query(query):
    print(f'Executing: {query}', file=sys.stderr)
    return pd.read_gbq(query, project_id=BILLING_PROJECT_ID, dialect='standard')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
https://notebooks.firecloud.org/proxy/terra-18d8e41c/saturn-c0307c3a-6cd0-4eca-9068-61afda6b299f/jupyter/notebooks/Endophilin-A/edit/SH3GL2_AMP-PD_ALF.ipynb#    </p>
    '''

    display(HTML(html))

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)

### Set paths

In [3]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## Print the information to check we are in the proper release and billing 
## This will be different for you, the user, depending on the billing project your workspace is on
print('Billing and Workspace')
print(f'Workspace Name: {WORKSPACE_NAME}')
print(f'Billing Project: {BILLING_PROJECT_ID}')
print(f'Workspace Bucket, where you can upload and download data: {WORKSPACE_BUCKET}')
print('')


## GP2 v5.0
## Explicitly define release v5.0 path 
GP2_RELEASE_PATH = 'gs://gp2tier2/release5_11052023'
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
print('GP2 v5.0')
print(f'Path to GP2 v5.0 Clinical Data: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v5.0 Raw Genotype Data: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v5.0 Imputed Genotype Data: {GP2_IMPUTED_GENO_PATH}')


## AMP-PD v3.0
## Explicitly define release v3.0 path 
AMP_RELEASE_PATH = 'gs://amp-pd-data/releases/2022_v3release_1115'
AMP_CLINICAL_RELEASE_PATH = f'{AMP_RELEASE_PATH}/clinical'

AMP_WGS_RELEASE_PATH = 'gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS'
AMP_WGS_RELEASE_PLINK_PATH = os.path.join(AMP_WGS_RELEASE_PATH, 'plink')

print('AMP-PD v3.0')
print(f'Path to AMP-PD v3.0 Clinical Data: {AMP_CLINICAL_RELEASE_PATH}')
print(f'Path to AMP-PD v3.0 WGS Data: {AMP_WGS_RELEASE_PLINK_PATH}')
print('')

Billing and Workspace
Workspace Name: Endophilin-A
Billing Project: terra-18d8e41c
Workspace Bucket, where you can upload and download data: gs://fc-f7a400c1-827e-48f8-b7b6-90c488a000a4

GP2 v5.0
Path to GP2 v5.0 Clinical Data: gs://gp2tier2/release5_11052023/clinical_data
Path to GP2 v5.0 Raw Genotype Data: gs://gp2tier2/release5_11052023/raw_genotypes
Path to GP2 v5.0 Imputed Genotype Data: gs://gp2tier2/release5_11052023/imputed_genotypes
AMP-PD v3.0
Path to AMP-PD v3.0 Clinical Data: gs://amp-pd-data/releases/2022_v3release_1115/clinical
Path to AMP-PD v3.0 WGS Data: gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink



### Install packages

### Install Plink 1.9 and Plink 2.0

In [248]:
%%bash

mkdir -p ~/tools
cd ~/tools

if test -e /home/jupyter/tools/plink; then
echo "Plink1.9 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink \n    -------"
wget -N http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 
unzip -o plink_linux_x86_64_20190304.zip
echo -e "\n plink downloaded and unzipped in /home/jupyter/tools \n "

fi


if test -e /home/jupyter/tools/plink2; then
echo "Plink2 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink2 \n    -------"
wget -N https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_avx2_20220603.zip
unzip -o plink2_linux_avx2_20220603.zip
echo -e "\n plink2 downloaded and unzipped in /home/jupyter/tools \n "

fi

Plink1.9 is already installed in /home/jupyter/tools/
Plink2 is already installed in /home/jupyter/tools/


In [249]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


### Remote restrictions

In [250]:
%%bash

# chmod plink 1.9 
chmod u+x /home/jupyter/tools/plink

In [251]:
%%bash

# chmod plink 2.0
chmod u+x /home/jupyter/tools/plink2

### Install ANNOVAR

In [252]:
%%bash

# Install ANNOVAR:
# https://www.openbioinformatics.org/annovar/annovar_download_form.php

if test -e /home/jupyter/tools/annovar; then

echo "annovar is already installed in /home/jupyter/tools/"
else
echo "annovar is not installed"
cd /home/jupyter/tools/

wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvfz annovar.latest.tar.gz

fi

annovar is already installed in /home/jupyter/tools/


In [253]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


### Install ANNOVAR: Download sources of annotation

In [254]:
%%bash

cd /home/jupyter/tools/annovar/

perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20140902 humandb/
#perl annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp147 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp30a humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad211_genome humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ljb26_all humandb/


NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneMrna.fa.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneVersion.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg38 build version, with files saved at the 'humandb' directory
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.idx.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished d

In [255]:
%%bash
ls /home/jupyter/tools/annovar/

annotate_variation.pl
coding_change.pl
convert2annovar.pl
example
humandb
retrieve_seq_from_fasta.pl
table_annovar.pl
variants_reduction.pl


## Copy data from AMP-PD bucket to workspace

In [4]:
# Make a directory
print("Making a working directory")
WORK_DIR = f'/home/jupyter/SH3GL2/'
shell_do(f'mkdir -p {WORK_DIR}') # f' means f-string - contains expressions to execute the code

Making a working directory


Executing: mkdir -p /home/jupyter/SH3GL2/


In [5]:
# Check directory where AMP-PD data is
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {AMP_WGS_RELEASE_PLINK_PATH}/pfiles')

Executing: gsutil -u terra-18d8e41c ls gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles


gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.log
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr11.pgen
gs://am

In [258]:
#Copy data
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_WGS_RELEASE_PLINK_PATH}/pfiles/chr9* {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9* /home/jupyter/SH3GL2/


Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.pgen...
Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.psam...
Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.pvar...
\ [3/3 files][  8.8 GiB/  8.8 GiB] 100% Done  52.2 MiB/s ETA 00:00:00           
Operation completed over 3 objects/8.8 GiB.                                      


### Create a covariate file with AMP-PD data

In [6]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_RELEASE_PATH}/amp_pd_case_control.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_CLINICAL_RELEASE_PATH}/Demographics.csv {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-data/releases/2022_v3release_1115/amp_pd_case_control.csv /home/jupyter/SH3GL2/


Copying gs://amp-pd-data/releases/2022_v3release_1115/amp_pd_case_control.csv...
/ [1/1 files][735.1 KiB/735.1 KiB] 100% Done                                    
Operation completed over 1 objects/735.1 KiB.                                    


Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-data/releases/2022_v3release_1115/clinical/Demographics.csv /home/jupyter/SH3GL2/


Copying gs://amp-pd-data/releases/2022_v3release_1115/clinical/Demographics.csv...
/ [1/1 files][622.2 KiB/622.2 KiB] 100% Done                                    
Operation completed over 1 objects/622.2 KiB.                                    


In [5]:
pd_case_control_df = pd.read_csv(f'{WORK_DIR}/amp_pd_case_control.csv')
#pd_case_control_df.head()

In [6]:
pd_case_control_latest_df = pd_case_control_df[['participant_id', 'diagnosis_latest', 'case_control_other_latest']].copy()

In [7]:
pd_case_control_latest_df.columns = ['ID', 'LATEST_DX', 'CASE_CONTROL']

In [8]:
case_con_reduced = pd_case_control_latest_df.copy()
case_con_reduced.drop_duplicates(subset=['ID'], inplace=True)

In [9]:
conditions = [
    (case_con_reduced['CASE_CONTROL'] == "Case"),
    (case_con_reduced['CASE_CONTROL'] == "Control")]

In [10]:
choices = [2,1]
case_con_reduced['PHENO'] = np.select(conditions, choices, default=-9).astype(np.int64)

In [11]:
case_con_reduced.reset_index(inplace=True)
case_con_reduced.drop(columns=["index"], inplace=True)
#case_con_reduced

In [12]:
demographics_df = pd.read_csv(f'{WORK_DIR}/Demographics.csv')
#demographics_df.head()

In [13]:
demographics_df.rename(columns = {'participant_id':'ID'}, inplace = True)
demographics_df.rename(columns = {'age_at_baseline':'BASELINE_AGE'}, inplace = True)
demographics_df.rename(columns = {'race':'RACE'}, inplace = True)

In [14]:
demographics_baseline_df = demographics_df \
.sort_values('visit_month', ascending=True) \
.drop_duplicates('ID').sort_index()

In [15]:
demographics_df_casecon = demographics_df.merge(case_con_reduced, on='ID', how='outer')
#demographics_df_casecon.head()

In [16]:
conditions = [
     (demographics_df_casecon['sex'] == "Male"),
     (demographics_df_casecon['sex'] == "Female")]

In [17]:
choices = [1,2]
demographics_df_casecon['SEX'] = np.select(conditions, choices, default=None).astype(np.int64)

In [18]:
demographics_df_casecon_toKeep_sex = demographics_df_casecon[['ID', 'ID', 'SEX'
                                                          ]].copy()
demographics_df_casecon_toKeep_sex.columns = ['FID','IID', 'SEX']
demographics_df_casecon_toKeep_sex.to_csv(f'{WORK_DIR}/SEX.txt',index=False)

In [19]:
demographics_df_casecon_toKeep_pheno = demographics_df_casecon[['ID', 'ID', 'PHENO'
                                                          ]].copy()
demographics_df_casecon_toKeep_pheno.columns = ['FID','IID', 'PHENO']
demographics_df_casecon_toKeep_pheno.to_csv(f'{WORK_DIR}/PHENO.txt',index=False)
demographics_df_casecon_toKeep_pheno.to_csv(f'{WORKSPACE_BUCKET}/PHENO_AMPPD.txt',index=False)

## Extract SH3GL2

In [49]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# SH3GL2: gene positions on hg38 (from https://useast.ensembl.org/index.html)
/home/jupyter/tools/plink2 \
--pfile chr9 \
--chr 9 \
--from-bp 17579066 \
--to-bp 17797124 \
--make-bed \
--max-alleles 2 \
--out temp_AMPPD

PLINK v2.00a3.3LM AVX2 Intel (3 Jun 2022)      www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to temp_AMPPD.log.
Options in effect:
  --chr 9
  --from-bp 17579066
  --make-bed
  --max-alleles 2
  --out temp_AMPPD
  --pfile chr9
  --to-bp 17797124

Start time: Thu Nov 23 12:26:28 2023
7450 MiB RAM detected; reserving 3725 MiB for main workspace.
Using up to 2 compute threads.
10418 samples (0 females, 0 males, 10418 ambiguous; 10418 founders) loaded from
chr9.psam.
6246283 out of 6777486 variants loaded from chr9.pvar.
Note: No phenotype data present.
14592 variants remaining after main filters.
Writing temp_AMPPD.fam ... done.
Writing temp_AMPPD.bim ... done.
Writing temp_AMPPD.bed ... 0%15%done.
End time: Thu Nov 23 12:27:47 2023


In [50]:
%%bash
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head temp_AMPPD.bim

9	.	0	17579072	A	G
9	.	0	17579076	A	T
9	.	0	17579077	G	T
9	.	0	17579086	A	G
9	.	0	17579091	G	C
9	.	0	17579098	T	C
9	.	0	17579102	G	A
9	.	0	17579107	T	C
9	.	0	17579112	T	C
9	.	0	17579119	C	G


In [51]:
bim = pd.read_csv("/home/jupyter/SH3GL2/temp_AMPPD.bim",sep="\t",names=['chr','rsID','pos','bp','A1','A2'])
bim['chr'] = bim['chr'].astype(str)
bim['bp'] = bim['bp'].astype(str)

In [52]:
bim['rsID'] = bim.chr.str.cat(bim.bp, sep=':')

In [55]:
bim.to_csv(f'{WORK_DIR}/temp_AMPPD.bim',sep="\t",index=False,header=None)

In [56]:
%%bash
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head temp_AMPPD.fam
#Store the first column of IDs in a new file
awk '{ print $1 }' temp_AMPPD.fam > IDs.txt

HB-PD_INVAB109VHC	HB-PD_INVAB109VHC	0	0	0	-9
HB-PD_INVAB289LG3	HB-PD_INVAB289LG3	0	0	0	-9
HB-PD_INVAC488AAF	HB-PD_INVAC488AAF	0	0	0	-9
HB-PD_INVAE296YP8	HB-PD_INVAE296YP8	0	0	0	-9
HB-PD_INVAJ549VWD	HB-PD_INVAJ549VWD	0	0	0	-9
HB-PD_INVAK106DV5	HB-PD_INVAK106DV5	0	0	0	-9
HB-PD_INVAK293NGR	HB-PD_INVAK293NGR	0	0	0	-9
HB-PD_INVAM835KNR	HB-PD_INVAM835KNR	0	0	0	-9
HB-PD_INVAM864YC9	HB-PD_INVAM864YC9	0	0	0	-9
HB-PD_INVAN082DKK	HB-PD_INVAN082DKK	0	0	0	-9


In [57]:
sex = pd.read_csv(f'{WORK_DIR}/SEX.txt')
pheno = pd.read_csv(f'{WORK_DIR}/PHENO.txt')
ids = pd.read_csv(f'{WORK_DIR}/IDs.txt',names=['FID'])

In [58]:
sex_new = pd.merge(sex,ids,on="FID")
sex_new.to_csv(f'{WORK_DIR}/SEX_new.txt',index=False, sep="\t")

In [59]:
pheno_new = pd.merge(pheno,ids,on="FID")
pheno_new.to_csv(f'{WORK_DIR}/PHENO_new.txt',index=False, sep="\t")

In [60]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# Update sex variable
/home/jupyter/tools/plink \
--bfile temp_AMPPD \
--update-sex SEX_new.txt  \
--pheno PHENO_new.txt \
--make-bed \
--out pheno_SH3GL2

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2.log.
Options in effect:
  --bfile temp_AMPPD
  --make-bed
  --out pheno_SH3GL2
  --pheno PHENO_new.txt
  --update-sex SEX_new.txt

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (0 males, 0 females, 10418 ambiguous) loaded from .fam.
Ambiguous sex IDs written to pheno_SH3GL2.nosex .
7512 phenotype values present after --pheno.
--update-sex: 10418 people updated, 1 ID not present.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38

### Visualize plink files bim, fam and bed

In [61]:
%%bash

# Visualize bim file
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head pheno_SH3GL2.bim

9	9:17579072	0	17579072	A	G
9	9:17579076	0	17579076	A	T
9	9:17579077	0	17579077	G	T
9	9:17579086	0	17579086	A	G
9	9:17579091	0	17579091	G	C
9	9:17579098	0	17579098	T	C
9	9:17579102	0	17579102	G	A
9	9:17579107	0	17579107	T	C
9	9:17579112	0	17579112	T	C
9	9:17579119	0	17579119	C	G


In [62]:
%%bash

# Visualize fam file
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head pheno_SH3GL2.fam

HB-PD_INVAB109VHC HB-PD_INVAB109VHC 0 0 2 1
HB-PD_INVAB289LG3 HB-PD_INVAB289LG3 0 0 2 1
HB-PD_INVAC488AAF HB-PD_INVAC488AAF 0 0 1 1
HB-PD_INVAE296YP8 HB-PD_INVAE296YP8 0 0 2 1
HB-PD_INVAJ549VWD HB-PD_INVAJ549VWD 0 0 2 1
HB-PD_INVAK106DV5 HB-PD_INVAK106DV5 0 0 1 1
HB-PD_INVAK293NGR HB-PD_INVAK293NGR 0 0 1 1
HB-PD_INVAM835KNR HB-PD_INVAM835KNR 0 0 2 1
HB-PD_INVAM864YC9 HB-PD_INVAM864YC9 0 0 2 1
HB-PD_INVAN082DKK HB-PD_INVAN082DKK 0 0 2 1


In [63]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# Turn binary files into VCF
/home/jupyter/tools/plink \
--bfile pheno_SH3GL2 \
--recode vcf-fid \
--allow-no-sex \
--out pheno_SH3GL2

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --out pheno_SH3GL2
  --recode vcf-fid

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%

## Annotate SH3GL2 variants

In [64]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR
export PATH=$PATH:/home/jupyter/tools/rvtests/third/tabix-0.2.6/

### Bgzip and Tabix
bgzip pheno_SH3GL2.vcf

tabix -f -p vcf pheno_SH3GL2.vcf.gz

[bgzip] can't create pheno_SH3GL2.vcf.gz: File exists


In [65]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

### Annotate variants using ANNOVAR: https://annovar.openbioinformatics.org/en/latest/ 
perl /home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
-out pheno_SH3GL2.annovar \
-remove -protocol refGene,clinvar_20140902 \
-operation g,f \
--nopolish \
-nastring . \
-vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 pheno_SH3GL2.vcf.gz > pheno_SH3GL2.annovar.avinput>
NOTICE: Finished reading 14599 lines from VCF file
NOTICE: A total of 14592 locus in VCF file passed QC threshold, representing 13709 SNPs (8549 transitions and 5160 transversions) and 883 indels/substitutions
NOTICE: Finished writing allele frequencies based on 142820362 SNP genotypes (89063482 transitions and 53756880 transversions) and 9199094 indels/substitutions for 10418 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile pheno_SH3GL2.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastring . -otherinfo>
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variati

## Extract coding and non-syn variants

In [5]:
# Visualize multianno file
SH3GL2 = pd.read_csv(f'{WORK_DIR}/pheno_SH3GL2.annovar.hg38_multianno.txt', sep = '\t')

In [6]:
SH3GL2['Chr_bp'] = SH3GL2['Chr'].astype(str) + ':' + SH3GL2['Start'].astype(str) + ':' + SH3GL2['End'].astype(str)

In [7]:
SH3GL2

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430,Chr_bp
0,9,17579072,17579072,G,A,UTR5,SH3GL2,NM_003026:c.-171G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579072:17579072
1,9,17579076,17579076,T,A,UTR5,SH3GL2,NM_003026:c.-167T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579076:17579076
2,9,17579077,17579077,T,G,UTR5,SH3GL2,NM_003026:c.-166T>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579077:17579077
3,9,17579086,17579086,G,A,UTR5,SH3GL2,NM_003026:c.-157G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579086:17579086
4,9,17579091,17579091,C,G,UTR5,SH3GL2,NM_003026:c.-152C>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579091:17579091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14587,9,17797040,17797040,G,A,UTR3,SH3GL2,NM_003026:c.*1297G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797040:17797040
14588,9,17797052,17797052,T,A,UTR3,SH3GL2,NM_003026:c.*1309T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797052:17797052
14589,9,17797079,17797079,C,T,UTR3,SH3GL2,NM_003026:c.*1336C>T,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797079:17797079
14590,9,17797093,17797093,G,A,UTR3,SH3GL2,NM_003026:c.*1350G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797093:17797093


In [8]:
SH3GL2 = SH3GL2.drop_duplicates(subset=['Chr_bp'])

In [9]:
SH3GL2

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430,Chr_bp
0,9,17579072,17579072,G,A,UTR5,SH3GL2,NM_003026:c.-171G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579072:17579072
1,9,17579076,17579076,T,A,UTR5,SH3GL2,NM_003026:c.-167T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579076:17579076
2,9,17579077,17579077,T,G,UTR5,SH3GL2,NM_003026:c.-166T>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579077:17579077
3,9,17579086,17579086,G,A,UTR5,SH3GL2,NM_003026:c.-157G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579086:17579086
4,9,17579091,17579091,C,G,UTR5,SH3GL2,NM_003026:c.-152C>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579091:17579091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14587,9,17797040,17797040,G,A,UTR3,SH3GL2,NM_003026:c.*1297G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797040:17797040
14588,9,17797052,17797052,T,A,UTR3,SH3GL2,NM_003026:c.*1309T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797052:17797052
14589,9,17797079,17797079,C,T,UTR3,SH3GL2,NM_003026:c.*1336C>T,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797079:17797079
14590,9,17797093,17797093,G,A,UTR3,SH3GL2,NM_003026:c.*1350G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17797093:17797093


In [10]:
# Filter all variants
all_sh3gl2 = SH3GL2[SH3GL2['Gene.refGene'] == 'SH3GL2']
all_sh3gl2.count()

Chr               14590
Start             14590
End               14590
Ref               14590
Alt               14590
                  ...  
Otherinfo10427    14590
Otherinfo10428    14590
Otherinfo10429    14590
Otherinfo10430    14590
Chr_bp            14590
Length: 10442, dtype: int64

In [11]:
all_sh3gl2["Func.refGene"].value_counts()

intronic    14425
UTR3           81
exonic         47
UTR5           37
Name: Func.refGene, dtype: int64

In [12]:
# Filter exonic variants
coding = SH3GL2[SH3GL2['Func.refGene'] == 'exonic']
coding.count()

Chr               47
Start             47
End               47
Ref               47
Alt               47
                  ..
Otherinfo10427    47
Otherinfo10428    47
Otherinfo10429    47
Otherinfo10430    47
Chr_bp            47
Length: 10442, dtype: int64

In [13]:
coding["ExonicFunc.refGene"].value_counts()

synonymous SNV       23
nonsynonymous SNV    22
stopgain              2
Name: ExonicFunc.refGene, dtype: int64

In [14]:
# Filter exonic and stopgain
coding_stopgain = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'stopgain')]

In [15]:
# Filter exonic and synonymous
coding_synonymous = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'synonymous SNV')]

In [16]:
# Filter exonic and non-syn 
coding_nonsynonymous = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'nonsynonymous SNV')]

In [17]:
coding_nonsynonymous.to_csv(f'{WORK_DIR}/coding_nonsynonymous.txt', sep = '\t', index = False)
coding_nonsynonymous

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430,Chr_bp
11378,9,17747120,17747120,A,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A100C:p.K34Q,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17747120:17747120
11379,9,17747126,17747126,A,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A106T:p.M36L,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17747126:17747126
13886,9,17786478,17786478,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G285C:p.E95D,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17786478:17786478
13887,9,17786494,17786494,G,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G301A:p.G101R,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17786494:17786494
13964,9,17787437,17787437,C,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.C389T:p.S130F,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17787437:17787437
13966,9,17787457,17787457,C,G,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.C409G:p.Q137E,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17787457:17787457
13967,9,17787498,17787498,T,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.T450A:p.D150E,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17787498:17787498
14101,9,17789477,17789477,G,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon6:c.G551A:p.R184H,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17789477:17789477
14231,9,17791290,17791290,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon7:c.G684C:p.Q228H,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17791290:17791290
14233,9,17791300,17791300,A,G,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon7:c.A694G:p.I232V,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17791300:17791300


In [18]:
# Filter intronic variants
intronic = SH3GL2[SH3GL2['Func.refGene'] == 'intronic']

In [19]:
# Filter UTR3 variants
UTR3 = SH3GL2[SH3GL2['Func.refGene'] == 'UTR3']

In [20]:
# Filter UTR5 variants
UTR5 = SH3GL2[SH3GL2['Func.refGene'] == 'UTR5']

## Calculate freq of coding and non-syn vars in cases versus controls

In [21]:
reduced_coding_nonsynonymous = coding_nonsynonymous[["Chr", "Start", "End", "Gene.refGene"]]
reduced_coding_nonsynonymous.to_csv(f'{WORK_DIR}/reduced_coding_nonsynonymous.txt', sep = '\t', index = False, header= False)

In [22]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head reduced_coding_nonsynonymous.txt

9	17747120	17747120	SH3GL2
9	17747126	17747126	SH3GL2
9	17786478	17786478	SH3GL2
9	17786494	17786494	SH3GL2
9	17787437	17787437	SH3GL2
9	17787457	17787457	SH3GL2
9	17787498	17787498	SH3GL2
9	17789477	17789477	SH3GL2
9	17791290	17791290	SH3GL2
9	17791300	17791300	SH3GL2


In [30]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# calculate freq for ALL variants

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --assoc --out assoc_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to assoc_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2
  --out assoc_pheno_SH3GL2

14998 MB RAM detected; reserving 7499 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%

In [31]:
# visualize freq file for ALL variants
all_SH3GL2_freq = pd.read_csv(f'{WORK_DIR}/assoc_pheno_SH3GL2.assoc', delim_whitespace=True)
all_SH3GL2_freq

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
0,9,9:17579072,17579072,A,0.000000,0.000120,G,0.8089,0.3685,0.000
1,9,9:17579076,17579076,A,0.000149,0.000000,T,1.2360,0.2662,
2,9,9:17579077,17579077,G,0.000000,0.000120,T,0.8089,0.3685,0.000
3,9,9:17579086,17579086,A,0.000298,0.000120,G,0.5849,0.4444,2.473
4,9,9:17579091,17579091,G,0.000000,0.000120,C,0.8089,0.3685,0.000
...,...,...,...,...,...,...,...,...,...,...
14587,9,9:17797040,17797040,A,0.000000,0.000120,G,0.8089,0.3685,0.000
14588,9,9:17797052,17797052,A,0.000000,0.000000,T,,,
14589,9,9:17797079,17797079,T,0.000149,0.000000,C,1.2360,0.2662,
14590,9,9:17797093,17797093,A,0.000149,0.000482,G,1.2360,0.2663,0.309


In [32]:
# visualize freq file for ALL variants comparing HC vs PD with p < 0.05 
all_SH3GL2_freq_p_0_05 = all_SH3GL2_freq[all_SH3GL2_freq['P']  <= 0.05]
all_SH3GL2_freq_p_0_05

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
77,9,9:17579692,17579692,T,0.336400,0.370800,G,19.190,0.000012,0.8602
87,9,9:17579765,17579765,G,0.293500,0.324500,C,16.590,0.000046,0.8651
109,9,9:17580008,17580008,C,0.000893,0.000000,T,7.421,0.006446,
148,9,9:17580618,17580618,T,0.328400,0.360600,A,17.030,0.000037,0.8670
166,9,9:17581041,17581041,A,0.326400,0.311200,G,3.965,0.046470,1.0730
...,...,...,...,...,...,...,...,...,...,...
14018,9,9:17788305,17788305,T,0.000893,0.002288,C,4.347,0.037080,0.3899
14040,9,9:17788598,17788598,G,0.000893,0.002288,C,4.347,0.037080,0.3899
14139,9,9:17789907,17789907,A,0.000893,0.000120,G,4.762,0.029090,7.4240
14336,9,9:17792820,17792820,C,0.001191,0.000241,T,5.040,0.024770,4.9500


In [33]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --assoc --out coding_nonsynonymous_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2
  --extract range reduced_coding_nonsynonymous.txt
  --out coding_nonsynonymous_pheno_SH3GL2

14998 MB RAM detected; reserving 7499 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%

In [34]:
SH3GL2_freq = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2.assoc', delim_whitespace=True)
SH3GL2_freq

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
0,9,9:17747120,17747120,C,0.0,0.0,A,,,
1,9,9:17747126,17747126,T,0.000149,0.0,A,1.236,0.2662,
2,9,9:17786478,17786478,C,0.0,0.00012,G,0.8089,0.3685,0.0
3,9,9:17786494,17786494,A,0.0,0.00012,G,0.8089,0.3685,0.0
4,9,9:17787437,17787437,T,0.0,0.0,C,,,
5,9,9:17787457,17787457,G,0.0,0.00012,C,0.8089,0.3685,0.0
6,9,9:17787498,17787498,A,0.0,0.00012,T,0.8089,0.3685,0.0
7,9,9:17789477,17789477,A,0.0,0.0,G,,,
8,9,9:17791290,17791290,C,0.0,0.0,G,,,
9,9,9:17791300,17791300,G,0.000149,0.0,A,1.236,0.2662,


In [35]:
# visualize freq file for nonsyn variants comparing HC vs PD with p < 0.05 
nonsyn_SH3GL2_freq_p_0_05 = SH3GL2_freq[SH3GL2_freq['P']  <= 0.05]
nonsyn_SH3GL2_freq_p_0_05

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR


In [36]:
reduced_intronic = intronic[["Chr", "Start", "End", "Gene.refGene"]]
reduced_intronic.to_csv(f'{WORK_DIR}/reduced_intronic.txt', sep = '\t', index = False, header= False)

In [37]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_intronic.txt --recode A --out intronic_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to intronic_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --extract range reduced_intronic.txt
  --out intronic_pheno_SH3GL2
  --recode A

14998 MB RAM detected; reserving 7499 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 751 variants excluded.
--extract range: 13841 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%4

In [38]:
SH3GL2_intronic = pd.read_csv(f'{WORK_DIR}/intronic_pheno_SH3GL2.raw', delim_whitespace=True)
SH3GL2_intronic.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579314_A,9:17579333_C,9:17579344_G,9:17579348_T,...,9:17795344_T,9:17795347_G,9:17795381_C,9:17795467_T,9:17795472_A,9:17795484_A,9:17795497_G,9:17795505_G,9:17795527_T,9:17795533_G
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# Explore homozygotes for 9:17579692_T in cases
SH3GL2_hom_cases_nalls_rs13294100 = SH3GL2_intronic[(SH3GL2_intronic['9:17579692_T'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_nalls_rs13294100.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579314_A,9:17579333_C,9:17579344_G,9:17579348_T,...,9:17795344_T,9:17795347_G,9:17795381_C,9:17795467_T,9:17795472_A,9:17795484_A,9:17795497_G,9:17795505_G,9:17795527_T,9:17795533_G
4899,LC-100008,LC-100008,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4926,LC-1080006,LC-1080006,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4935,LC-1120006,LC-1120006,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4973,LC-1370006,LC-1370006,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5055,LC-230005,LC-230005,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
SH3GL2_hom_cases_nalls_rs13294100.shape

(393, 13847)

In [41]:
'Calculate frequency against the total N of cases: {:f}'.format(393/3359)

'Calculate frequency against the total N of cases: 0.116999'

In [42]:
# Explore HMZs for 9:17579692_T in controls
SH3GL2_hom_controls_nalls_rs13294100 = SH3GL2_intronic[(SH3GL2_intronic['9:17579692_T'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_nalls_rs13294100.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579314_A,9:17579333_C,9:17579344_G,9:17579348_T,...,9:17795344_T,9:17795347_G,9:17795381_C,9:17795467_T,9:17795472_A,9:17795484_A,9:17795497_G,9:17795505_G,9:17795527_T,9:17795533_G
11,HB-PD_INVAR650VWG,HB-PD_INVAR650VWG,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,HB-PD_INVBD604UJD,HB-PD_INVBD604UJD,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20,HB-PD_INVBF956HZC,HB-PD_INVBF956HZC,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23,HB-PD_INVBL268JTC,HB-PD_INVBL268JTC,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,HB-PD_INVBX181DJ7,HB-PD_INVBX181DJ7,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
SH3GL2_hom_controls_nalls_rs13294100.shape

(581, 13847)

In [44]:
'Calculate frequency against the total N of controls: {:f}'.format(581/4153)

'Calculate frequency against the total N of controls: 0.139899'

In [45]:
# unknwon phenotype -9
SH3GL2_hom_U_nalls_rs13294100 = SH3GL2_intronic[(SH3GL2_intronic['9:17579692_T'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == -9)]
SH3GL2_hom_U_nalls_rs13294100.shape

(355, 13847)

In [46]:
# Explore homozygotes for 9:17727067_G in cases
SH3GL2_hom_cases_nalls_rs10756907 = SH3GL2_intronic[(SH3GL2_intronic['9:17727067_G'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_nalls_rs10756907.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579314_A,9:17579333_C,9:17579344_G,9:17579348_T,...,9:17795344_T,9:17795347_G,9:17795381_C,9:17795467_T,9:17795472_A,9:17795484_A,9:17795497_G,9:17795505_G,9:17795527_T,9:17795533_G
4949,LC-120008,LC-120008,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4962,LC-130004,LC-130004,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4967,LC-1310006,LC-1310006,0,0,2,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5010,LC-1620001,LC-1620001,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5033,LC-1970001,LC-1970001,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
SH3GL2_hom_cases_nalls_rs10756907.shape

(190, 13847)

In [48]:
'Calculate frequency against the total N of cases: {:f}'.format(190/3359)

'Calculate frequency against the total N of cases: 0.056564'

In [49]:
# Explore HMZs for 9:17727067_G in controls
SH3GL2_hom_controls_nalls_rs10756907 = SH3GL2_intronic[(SH3GL2_intronic['9:17727067_G'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_nalls_rs10756907.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579314_A,9:17579333_C,9:17579344_G,9:17579348_T,...,9:17795344_T,9:17795347_G,9:17795381_C,9:17795467_T,9:17795472_A,9:17795484_A,9:17795497_G,9:17795505_G,9:17795527_T,9:17795533_G
29,HB-PD_INVBT337PCG,HB-PD_INVBT337PCG,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45,HB-PD_INVCR113DZ8,HB-PD_INVCR113DZ8,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
78,HB-PD_INVEW090MBW,HB-PD_INVEW090MBW,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
171,HB-PD_INVNK524UDW,HB-PD_INVNK524UDW,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
174,HB-PD_INVNP408GYU,HB-PD_INVNP408GYU,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
SH3GL2_hom_controls_nalls_rs10756907.shape

(190, 13847)

In [51]:
'Calculate frequency against the total N of controls: {:f}'.format(190/4153)

'Calculate frequency against the total N of controls: 0.045750'

In [52]:
# unknwon phenotype -9
SH3GL2_hom_U_nalls_rs10756907 = SH3GL2_intronic[(SH3GL2_intronic['9:17727067_G'] == 2) & (SH3GL2_intronic['PHENOTYPE'] == -9)]
SH3GL2_hom_U_nalls_rs10756907.shape

(170, 13847)

In [53]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --recode A --out coding_nonsynonymous_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --extract range reduced_coding_nonsynonymous.txt
  --out coding_nonsynonymous_pheno_SH3GL2
  --recode A

14998 MB RAM detected; reserving 7499 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34

In [54]:
SH3GL2_recode = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2.raw', delim_whitespace=True)
SH3GL2_recode.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17747120_C,9:17747126_T,9:17786478_C,9:17786494_A,...,9:17793459_G,9:17793465_T,9:17793468_C,9:17793471_T,9:17795552_G,9:17795571_A,9:17795582_A,9:17795598_G,9:17795649_G,9:17795705_G
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
# Explore homozygotes for 9:17793465_T in cases
SH3GL2_hom_cases_G276V = SH3GL2_recode[(SH3GL2_recode['9:17793465_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_G276V.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17747120_C,9:17747126_T,9:17786478_C,9:17786494_A,...,9:17793459_G,9:17793465_T,9:17793468_C,9:17793471_T,9:17795552_G,9:17795571_A,9:17795582_A,9:17795598_G,9:17795649_G,9:17795705_G


In [56]:
SH3GL2_hom_cases_G276V.shape

(0, 28)

In [57]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [58]:
# Explore HMZs for 9:17793465_T in controls
SH3GL2_hom_controls_G276V = SH3GL2_recode[(SH3GL2_recode['9:17793465_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_G276V.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17747120_C,9:17747126_T,9:17786478_C,9:17786494_A,...,9:17793459_G,9:17793465_T,9:17793468_C,9:17793471_T,9:17795552_G,9:17795571_A,9:17795582_A,9:17795598_G,9:17795649_G,9:17795705_G


In [59]:
SH3GL2_hom_controls_G276V.shape

(0, 28)

In [60]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [61]:
# unknwon phenotype -9
SH3GL2_hom_U_G276V = SH3GL2_recode[(SH3GL2_recode['9:17793465_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_G276V.shape

(1, 28)

## Pearson correlation and conditional analyses rs13294100 (Nalls et al 2019) and rs150543523 (AMP PD coding variant)

In [62]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --chr 9 --from-bp 17579692 --to-bp 17579692 --recode A --out pheno_SH3GL2_rs13294100_nalls --double-id

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2_rs13294100_nalls.log.
Options in effect:
  --bfile pheno_SH3GL2
  --chr 9
  --double-id
  --from-bp 17579692
  --out pheno_SH3GL2_rs13294100_nalls
  --recode A
  --to-bp 17579692

14998 MB RAM detected; reserving 7499 MB for main workspace.
1 out of 14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%

In [63]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --chr 9 --from-bp 17793465 --to-bp 17793465 --recode A --out pheno_SH3GL2_rs150543523_G276V --double-id

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2_rs150543523_G276V.log.
Options in effect:
  --bfile pheno_SH3GL2
  --chr 9
  --double-id
  --from-bp 17793465
  --out pheno_SH3GL2_rs150543523_G276V
  --recode A
  --to-bp 17793465

14998 MB RAM detected; reserving 7499 MB for main workspace.
1 out of 14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%4

In [64]:
rs13294100 = pd.read_csv("/home/jupyter/SH3GL2/pheno_SH3GL2_rs13294100_nalls.raw",sep=" ")
rs150543523 = pd.read_csv("/home/jupyter/SH3GL2/pheno_SH3GL2_rs150543523_G276V.raw",sep=" ")

In [65]:
rs13294100

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17579692_T
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,1
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,0
...,...,...,...,...,...,...,...
10413,SY-PDZV652BE6,SY-PDZV652BE6,0,0,1,2,0
10414,SY-PDZX554VNB,SY-PDZX554VNB,0,0,1,2,1
10415,SY-PDZX943HWN,SY-PDZX943HWN,0,0,1,2,0
10416,SY-PDZY968RFA,SY-PDZY968RFA,0,0,1,2,1


In [66]:
rs150543523

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17793465_T
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,0
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,0
...,...,...,...,...,...,...,...
10413,SY-PDZV652BE6,SY-PDZV652BE6,0,0,1,2,0
10414,SY-PDZX554VNB,SY-PDZX554VNB,0,0,1,2,0
10415,SY-PDZX943HWN,SY-PDZX943HWN,0,0,1,2,0
10416,SY-PDZY968RFA,SY-PDZY968RFA,0,0,1,2,0


In [67]:
total = pd.merge(rs13294100, rs150543523, on="FID")

In [68]:
total = total[(total['PHENOTYPE_x']==1) | (total['PHENOTYPE_x']==2)]

In [69]:
total = total[["FID",'PHENOTYPE_x',"9:17579692_T","9:17793465_T"]]

In [70]:
total.columns = ["FID",'PHENOTYPE',"Nalls_9_17579692_T","G276V_9_17793465_T"]
total

Unnamed: 0,FID,PHENOTYPE,Nalls_9_17579692_T,G276V_9_17793465_T
0,HB-PD_INVAB109VHC,1,0,0
1,HB-PD_INVAB289LG3,1,1,0
2,HB-PD_INVAC488AAF,1,0,0
3,HB-PD_INVAE296YP8,1,0,0
4,HB-PD_INVAJ549VWD,1,0,0
...,...,...,...,...
10413,SY-PDZV652BE6,2,0,0
10414,SY-PDZX554VNB,2,1,0
10415,SY-PDZX943HWN,2,0,0
10416,SY-PDZY968RFA,2,1,0


In [71]:
total.to_csv(f'{WORKSPACE_BUCKET}/AMPPD_WGS_Nalls_G276V_rs13294100.csv')

In [72]:
%load_ext rpy2.ipython

In [73]:
%%R

BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')
WORKSPACE_NAMESPACE <- Sys.getenv('WORKSPACE_NAMESPACE')
WORKSPACE_NAME <- Sys.getenv('WORKSPACE_NAME')
WORKSPACE_BUCKET <- Sys.getenv('WORKSPACE_BUCKET')

In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
  libraries ‘/home/jupyter/packages’, ‘/usr/lib/R/site-library’ contain no packages


In [74]:
%%R

gcs_read_file <- function(path) {
    pipe(str_glue('gsutil -u {BILLING_PROJECT_ID} cat {path}'))
}

In [75]:
%%R

shell_do <- function(command) {
    print(paste('Executing: ', command))
    system(command, intern = TRUE)
}

In [76]:
%%R
library(stringr)

totalPath <- file.path({WORKSPACE_BUCKET}, 'AMPPD_WGS_Nalls_G276V_rs13294100.csv')
tmp <- gcs_read_file(totalPath)
tmp
total <- read.csv(tmp,header=TRUE)
head(total)

  X               FID PHENOTYPE Nalls_9_17579692_T G276V_9_17793465_T
1 0 HB-PD_INVAB109VHC         1                  0                  0
2 1 HB-PD_INVAB289LG3         1                  1                  0
3 2 HB-PD_INVAC488AAF         1                  0                  0
4 3 HB-PD_INVAE296YP8         1                  0                  0
5 4 HB-PD_INVAJ549VWD         1                  0                  0
6 5 HB-PD_INVAK106DV5         1                  1                  0


In [77]:
%%R

correlation.model <- lm(Nalls_9_17579692_T ~ G276V_9_17793465_T, data = total)
summary(correlation.model)


Call:
lm(formula = Nalls_9_17579692_T ~ G276V_9_17793465_T, data = total)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9535 -0.7095  0.2905  0.2905  1.2905 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.709466   0.007887  89.951   <2e-16 ***
G276V_9_17793465_T 0.244023   0.104249   2.341   0.0193 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6816 on 7510 degrees of freedom
Multiple R-squared:  0.0007291,	Adjusted R-squared:  0.000596 
F-statistic: 5.479 on 1 and 7510 DF,  p-value: 0.01927



In [78]:
%%R
dep_corr<- glm(PHENOTYPE ~ Nalls_9_17579692_T + G276V_9_17793465_T, data=total)
summary(dep_corr)


Call:
glm(formula = PHENOTYPE ~ Nalls_9_17579692_T + G276V_9_17793465_T, 
    data = total)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5467  -0.4361  -0.3993   0.5271   0.6007  

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.472899   0.008283 177.830  < 2e-16 ***
Nalls_9_17579692_T -0.036815   0.008407  -4.379 1.21e-05 ***
G276V_9_17793465_T  0.073832   0.075982   0.972    0.331    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.246652)

    Null deviance: 1857.0  on 7511  degrees of freedom
Residual deviance: 1852.1  on 7509  degrees of freedom
AIC: 10808

Number of Fisher Scoring iterations: 2



In [79]:
%%R
summary(dep_corr)$coefficients

                      Estimate  Std. Error     t value     Pr(>|t|)
(Intercept)         1.47289902 0.008282644 177.8295677 0.000000e+00
Nalls_9_17579692_T -0.03681502 0.008407444  -4.3788602 1.209197e-05
G276V_9_17793465_T  0.07383158 0.075982402   0.9716931 3.312345e-01


In [80]:
%%R
exp(0.07383158)

[1] 1.076625


## Pearson correlation and conditional analyses rs10756907 (Nalls et al 2019) and rs150543523 (AMP PD coding variant)

In [81]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --chr 9 --from-bp 17727067 --to-bp 17727067 --recode A --out pheno_SH3GL2_rs10756907_nalls --double-id

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2_rs10756907_nalls.log.
Options in effect:
  --bfile pheno_SH3GL2
  --chr 9
  --double-id
  --from-bp 17727067
  --out pheno_SH3GL2_rs10756907_nalls
  --recode A
  --to-bp 17727067

14998 MB RAM detected; reserving 7499 MB for main workspace.
1 out of 14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%

In [82]:
rs10756907 = pd.read_csv("/home/jupyter/SH3GL2/pheno_SH3GL2_rs10756907_nalls.raw",sep=" ")
rs150543523 = pd.read_csv("/home/jupyter/SH3GL2/pheno_SH3GL2_rs150543523_G276V.raw",sep=" ")

In [83]:
total = pd.merge(rs10756907, rs150543523, on="FID")

In [84]:
total = total[(total['PHENOTYPE_x']==1) | (total['PHENOTYPE_x']==2)]
total

Unnamed: 0,FID,IID_x,PAT_x,MAT_x,SEX_x,PHENOTYPE_x,9:17727067_G,IID_y,PAT_y,MAT_y,SEX_y,PHENOTYPE_y,9:17793465_T
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,1,HB-PD_INVAB109VHC,0,0,2,1,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,0,HB-PD_INVAB289LG3,0,0,2,1,0
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0,HB-PD_INVAC488AAF,0,0,1,1,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,1,HB-PD_INVAE296YP8,0,0,2,1,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,1,HB-PD_INVAJ549VWD,0,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10413,SY-PDZV652BE6,SY-PDZV652BE6,0,0,1,2,1,SY-PDZV652BE6,0,0,1,2,0
10414,SY-PDZX554VNB,SY-PDZX554VNB,0,0,1,2,1,SY-PDZX554VNB,0,0,1,2,0
10415,SY-PDZX943HWN,SY-PDZX943HWN,0,0,1,2,1,SY-PDZX943HWN,0,0,1,2,0
10416,SY-PDZY968RFA,SY-PDZY968RFA,0,0,1,2,0,SY-PDZY968RFA,0,0,1,2,0


In [85]:
total = total[["FID","PHENOTYPE_x","9:17727067_G","9:17793465_T"]]

In [86]:
total.columns = ["FID","Phenotype","Nalls_9_17727067_G","G276V_9_17793465_T"]
total

Unnamed: 0,FID,Phenotype,Nalls_9_17727067_G,G276V_9_17793465_T
0,HB-PD_INVAB109VHC,1,1,0
1,HB-PD_INVAB289LG3,1,0,0
2,HB-PD_INVAC488AAF,1,0,0
3,HB-PD_INVAE296YP8,1,1,0
4,HB-PD_INVAJ549VWD,1,1,0
...,...,...,...,...
10413,SY-PDZV652BE6,2,1,0
10414,SY-PDZX554VNB,2,1,0
10415,SY-PDZX943HWN,2,1,0
10416,SY-PDZY968RFA,2,0,0


In [87]:
total.to_csv(f'{WORKSPACE_BUCKET}/AMPPD_WGS_Nalls_G276V_rs10756907.csv')

In [88]:
%%R
library(stringr)

totalPath <- file.path({WORKSPACE_BUCKET}, 'AMPPD_WGS_Nalls_G276V_rs10756907.csv')
tmp <- gcs_read_file(totalPath)
tmp
total <- read.csv(tmp,header=TRUE)
head(total)

  X               FID Phenotype Nalls_9_17727067_G G276V_9_17793465_T
1 0 HB-PD_INVAB109VHC         1                  1                  0
2 1 HB-PD_INVAB289LG3         1                  0                  0
3 2 HB-PD_INVAC488AAF         1                  0                  0
4 3 HB-PD_INVAE296YP8         1                  1                  0
5 4 HB-PD_INVAJ549VWD         1                  1                  0
6 5 HB-PD_INVAK106DV5         1                  1                  0


In [89]:
%%R

correlation.model <- lm(Nalls_9_17727067_G ~ G276V_9_17793465_T, data = total)
summary(correlation.model)


Call:
lm(formula = Nalls_9_17727067_G ~ G276V_9_17793465_T, data = total)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4442 -0.4442 -0.4442  0.5558  1.5558 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.444236   0.006824  65.103   <2e-16 ***
G276V_9_17793465_T -0.211678   0.090189  -2.347   0.0189 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5897 on 7510 degrees of freedom
Multiple R-squared:  0.000733,	Adjusted R-squared:  0.0005999 
F-statistic: 5.509 on 1 and 7510 DF,  p-value: 0.01895



In [90]:
%%R
dep_corr<- glm(Phenotype ~ Nalls_9_17727067_G + G276V_9_17793465_T, data=total)
summary(dep_corr)


Call:
glm(formula = Phenotype ~ Nalls_9_17727067_G + G276V_9_17793465_T, 
    data = total)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5271  -0.4379  -0.4379   0.5621   0.5621  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.437856   0.007195 199.850   <2e-16 ***
Nalls_9_17727067_G 0.020089   0.009728   2.065   0.0389 *  
G276V_9_17793465_T 0.069100   0.076058   0.909   0.3636    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.2471415)

    Null deviance: 1857.0  on 7511  degrees of freedom
Residual deviance: 1855.8  on 7509  degrees of freedom
AIC: 10823

Number of Fisher Scoring iterations: 2



In [91]:
%%R
summary(dep_corr)$coefficients

                     Estimate  Std. Error     t value   Pr(>|t|)
(Intercept)        1.43785556 0.007194687 199.8496335 0.00000000
Nalls_9_17727067_G 0.02008945 0.009727724   2.0651750 0.03894069
G276V_9_17793465_T 0.06910038 0.076057904   0.9085233 0.36363096


In [92]:
%%R
exp(0.06910038)

[1] 1.071544


## Save out results..!

In [None]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp -r {WORK_DIR} {WORKSPACE_BUCKET}')