## __Single gene analysis__

```GP2 ❤️ Open Science 😍```

### Description

Using "individual level data"....:

__0. Getting Started__
- Loading Python libraries
- Defining functions
- Installing packages

__1. Copy data from workspace to cloud environment__

__2. Extract SH3GL2__

__3. Annotate SH3GL2 variants__

__4. Extract coding/non-syn variants__

__5. Calculate frequency in cases versus controls__

__6. Calculate frequency (homozygotes)in cases versus controls__

__7. Save out results__



### Loading Python libraries

In [1]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# Numpy for basics

import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

#Import Sys
import sys as sys

### Defining functions

In [2]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command
    
def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

# Utility routine for printing a query before executing it
def bq_query(query):
    print(f'Executing: {query}', file=sys.stderr)
    return pd.read_gbq(query, project_id=BILLING_PROJECT_ID, dialect='standard')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
https://notebooks.firecloud.org/proxy/terra-18d8e41c/saturn-c0307c3a-6cd0-4eca-9068-61afda6b299f/jupyter/notebooks/Endophilin-A/edit/SH3GL2_AMP-PD_ALF.ipynb#    </p>
    '''

    display(HTML(html))

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)

### Set paths

In [3]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## Print the information to check we are in the proper release and billing 
## This will be different for you, the user, depending on the billing project your workspace is on
print('Billing and Workspace')
print(f'Workspace Name: {WORKSPACE_NAME}')
print(f'Billing Project: {BILLING_PROJECT_ID}')
print(f'Workspace Bucket, where you can upload and download data: {WORKSPACE_BUCKET}')
print('')


## GP2 v5.0
## Explicitly define release v5.0 path 
GP2_RELEASE_PATH = 'gs://gp2tier2/release5_11052023'
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
print('GP2 v5.0')
print(f'Path to GP2 v5.0 Clinical Data: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v5.0 Raw Genotype Data: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v5.0 Imputed Genotype Data: {GP2_IMPUTED_GENO_PATH}')


## AMP-PD v3.0
## Explicitly define release v3.0 path 
AMP_RELEASE_PATH = 'gs://amp-pd-data/releases/2022_v3release_1115'
AMP_CLINICAL_RELEASE_PATH = f'{AMP_RELEASE_PATH}/clinical'

AMP_WGS_RELEASE_PATH = 'gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS'
AMP_WGS_RELEASE_PLINK_PATH = os.path.join(AMP_WGS_RELEASE_PATH, 'plink')

print('AMP-PD v3.0')
print(f'Path to AMP-PD v3.0 Clinical Data: {AMP_CLINICAL_RELEASE_PATH}')
print(f'Path to AMP-PD v3.0 WGS Data: {AMP_WGS_RELEASE_PLINK_PATH}')
print('')

Billing and Workspace
Workspace Name: Endophilin-A
Billing Project: terra-18d8e41c
Workspace Bucket, where you can upload and download data: gs://fc-f7a400c1-827e-48f8-b7b6-90c488a000a4

GP2 v5.0
Path to GP2 v5.0 Clinical Data: gs://gp2tier2/release5_11052023/clinical_data
Path to GP2 v5.0 Raw Genotype Data: gs://gp2tier2/release5_11052023/raw_genotypes
Path to GP2 v5.0 Imputed Genotype Data: gs://gp2tier2/release5_11052023/imputed_genotypes
AMP-PD v3.0
Path to AMP-PD v3.0 Clinical Data: gs://amp-pd-data/releases/2022_v3release_1115/clinical
Path to AMP-PD v3.0 WGS Data: gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink



### Install packages

### Install Plink 1.9 and Plink 2.0

In [5]:
%%bash

mkdir -p ~/tools
cd ~/tools

if test -e /home/jupyter/tools/plink; then
echo "Plink1.9 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink \n    -------"
wget -N http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 
unzip -o plink_linux_x86_64_20190304.zip
echo -e "\n plink downloaded and unzipped in /home/jupyter/tools \n "

fi


if test -e /home/jupyter/tools/plink2; then
echo "Plink2 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink2 \n    -------"
wget -N https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_avx2_20220603.zip
unzip -o plink2_linux_avx2_20220603.zip
echo -e "\n plink2 downloaded and unzipped in /home/jupyter/tools \n "

fi

Plink1.9 is already installed in /home/jupyter/tools/
Plink2 is already installed in /home/jupyter/tools/


In [6]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


### Remote restrictions

In [7]:
%%bash

# chmod plink 1.9 
chmod u+x /home/jupyter/tools/plink

In [8]:
%%bash

# chmod plink 2.0
chmod u+x /home/jupyter/tools/plink2

### Install ANNOVAR

In [9]:
%%bash

# Install ANNOVAR:
# https://www.openbioinformatics.org/annovar/annovar_download_form.php

if test -e /home/jupyter/tools/annovar; then

echo "annovar is already installed in /home/jupyter/tools/"
else
echo "annovar is not installed"
cd /home/jupyter/tools/

wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvfz annovar.latest.tar.gz

fi

annovar is already installed in /home/jupyter/tools/


In [10]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


### Install ANNOVAR: Download sources of annotation

In [11]:
%%bash

cd /home/jupyter/tools/annovar/

perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20140902 humandb/
#perl annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp147 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp30a humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad211_genome humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ljb26_all humandb/


NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneMrna.fa.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneVersion.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg38 build version, with files saved at the 'humandb' directory
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.idx.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished d

In [12]:
%%bash
ls /home/jupyter/tools/annovar/

annotate_variation.pl
coding_change.pl
convert2annovar.pl
example
humandb
retrieve_seq_from_fasta.pl
table_annovar.pl
variants_reduction.pl


## Copy data from AMP-PD bucket to workspace

In [4]:
# Make a directory
print("Making a working directory")
WORK_DIR = f'/home/jupyter/SH3GL2/'
shell_do(f'mkdir -p {WORK_DIR}') # f' means f-string - contains expressions to execute the code

Making a working directory


Executing: mkdir -p /home/jupyter/SH3GL2/


In [14]:
# Check directory where AMP-PD data is
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {AMP_WGS_RELEASE_PLINK_PATH}/pfiles')

Executing: gsutil -u terra-18d8e41c ls gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles


gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.log
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/all_chrs_merged.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr1.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.pgen
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.psam
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr10.pvar
gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr11.pgen
gs://am

In [15]:
#Copy data
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_WGS_RELEASE_PLINK_PATH}/pfiles/chr9* {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9* /home/jupyter/SH3GL2/


Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.pgen...
Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.psam...
Copying gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles/chr9.pvar...
/ [3/3 files][  8.8 GiB/  8.8 GiB] 100% Done   1.8 MiB/s ETA 00:00:00           
Operation completed over 3 objects/8.8 GiB.                                      


### Create a covariate file with AMP-PD data

In [16]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_RELEASE_PATH}/amp_pd_case_control.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {AMP_CLINICAL_RELEASE_PATH}/Demographics.csv {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-data/releases/2022_v3release_1115/amp_pd_case_control.csv /home/jupyter/SH3GL2/


Copying gs://amp-pd-data/releases/2022_v3release_1115/amp_pd_case_control.csv...
/ [1/1 files][735.1 KiB/735.1 KiB] 100% Done                                    
Operation completed over 1 objects/735.1 KiB.                                    


Executing: gsutil -u terra-18d8e41c -m cp -r gs://amp-pd-data/releases/2022_v3release_1115/clinical/Demographics.csv /home/jupyter/SH3GL2/


Copying gs://amp-pd-data/releases/2022_v3release_1115/clinical/Demographics.csv...
/ [1/1 files][622.2 KiB/622.2 KiB] 100% Done                                    
Operation completed over 1 objects/622.2 KiB.                                    


In [17]:
pd_case_control_df = pd.read_csv(f'{WORK_DIR}/amp_pd_case_control.csv')
pd_case_control_df.head()

Unnamed: 0,participant_id,diagnosis_at_baseline,diagnosis_latest,case_control_other_at_baseline,case_control_other_latest
0,BF-1001,No PD Nor Other Neurological Disorder,No PD Nor Other Neurological Disorder,Control,Control
1,BF-1002,Idiopathic PD,Idiopathic PD,Case,Case
2,BF-1003,Idiopathic PD,Idiopathic PD,Case,Case
3,BF-1004,Idiopathic PD,Idiopathic PD,Case,Case
4,BF-1005,No PD Nor Other Neurological Disorder,No PD Nor Other Neurological Disorder,Control,Control


In [18]:
pd_case_control_latest_df = pd_case_control_df[['participant_id', 'diagnosis_latest', 'case_control_other_latest']].copy()

In [19]:
pd_case_control_latest_df.columns = ['ID', 'LATEST_DX', 'CASE_CONTROL']

In [20]:
case_con_reduced = pd_case_control_latest_df.copy()
case_con_reduced.drop_duplicates(subset=['ID'], inplace=True)

In [21]:
conditions = [
    (case_con_reduced['CASE_CONTROL'] == "Case"),
    (case_con_reduced['CASE_CONTROL'] == "Control")]

In [22]:
choices = [2,1]
case_con_reduced['PHENO'] = np.select(conditions, choices, default=-9).astype(np.int64)

In [23]:
case_con_reduced.reset_index(inplace=True)
case_con_reduced.drop(columns=["index"], inplace=True)

In [24]:
case_con_reduced.drop(columns=['CASE_CONTROL'], inplace=True)
case_con_reduced.head()

Unnamed: 0,ID,LATEST_DX,PHENO
0,BF-1001,No PD Nor Other Neurological Disorder,1
1,BF-1002,Idiopathic PD,2
2,BF-1003,Idiopathic PD,2
3,BF-1004,Idiopathic PD,2
4,BF-1005,No PD Nor Other Neurological Disorder,1


In [25]:
demographics_df = pd.read_csv(f'{WORK_DIR}/Demographics.csv')
demographics_df.head()

Unnamed: 0,participant_id,GUID,visit_name,visit_month,age_at_baseline,sex,ethnicity,race,education_level_years
0,BF-1001,PDNW781VHY,M0,0,55,Male,Not Hispanic or Latino,White,12-16 years
1,BF-1002,PDCB969UGG,M0,0,66,Female,Not Hispanic or Latino,White,12-16 years
2,BF-1003,PDLW805AHT,M0,0,61,Male,Not Hispanic or Latino,White,12-16 years
3,BF-1004,PDKW284DYW,M0,0,62,Male,Not Hispanic or Latino,White,12-16 years
4,BF-1005,PDTM274KX6,M0,0,61,Female,Not Hispanic or Latino,White,12-16 years


In [26]:
demographics_df.rename(columns = {'participant_id':'ID'}, inplace = True)
demographics_df.rename(columns = {'age_at_baseline':'BASELINE_AGE'}, inplace = True)
demographics_df.rename(columns = {'race':'RACE'}, inplace = True)

In [27]:
demographics_baseline_df = demographics_df \
.sort_values('visit_month', ascending=True) \
.drop_duplicates('ID').sort_index()

In [28]:
demographics_df_casecon = demographics_df.merge(case_con_reduced, on='ID', how='outer')
demographics_df_casecon.head()

Unnamed: 0,ID,GUID,visit_name,visit_month,BASELINE_AGE,sex,ethnicity,RACE,education_level_years,LATEST_DX,PHENO
0,BF-1001,PDNW781VHY,M0,0,55,Male,Not Hispanic or Latino,White,12-16 years,No PD Nor Other Neurological Disorder,1
1,BF-1002,PDCB969UGG,M0,0,66,Female,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2
2,BF-1003,PDLW805AHT,M0,0,61,Male,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2
3,BF-1004,PDKW284DYW,M0,0,62,Male,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2
4,BF-1005,PDTM274KX6,M0,0,61,Female,Not Hispanic or Latino,White,12-16 years,No PD Nor Other Neurological Disorder,1


In [29]:
conditions = [
     (demographics_df_casecon['sex'] == "Male"),
     (demographics_df_casecon['sex'] == "Female")]

In [30]:
choices = [1,2]
demographics_df_casecon['SEX'] = np.select(conditions, choices, default=None).astype(np.int64)

In [31]:
demographics_df_casecon.drop(columns=['sex'], inplace=True)
demographics_df_casecon.head()

Unnamed: 0,ID,GUID,visit_name,visit_month,BASELINE_AGE,ethnicity,RACE,education_level_years,LATEST_DX,PHENO,SEX
0,BF-1001,PDNW781VHY,M0,0,55,Not Hispanic or Latino,White,12-16 years,No PD Nor Other Neurological Disorder,1,1
1,BF-1002,PDCB969UGG,M0,0,66,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2,2
2,BF-1003,PDLW805AHT,M0,0,61,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2,1
3,BF-1004,PDKW284DYW,M0,0,62,Not Hispanic or Latino,White,12-16 years,Idiopathic PD,2,1
4,BF-1005,PDTM274KX6,M0,0,61,Not Hispanic or Latino,White,12-16 years,No PD Nor Other Neurological Disorder,1,2


In [32]:
demographics_df_casecon_toKeep_sex = demographics_df_casecon[['ID', 'ID', 'SEX'
                                                          ]].copy()
demographics_df_casecon_toKeep_sex.columns = ['FID','IID', 'SEX']
demographics_df_casecon_toKeep_sex.head()
demographics_df_casecon_toKeep_sex.to_csv(f'{WORK_DIR}/SEX.txt',index=False)

In [33]:
demographics_df_casecon_toKeep_pheno = demographics_df_casecon[['ID', 'ID', 'PHENO'
                                                          ]].copy()
demographics_df_casecon_toKeep_pheno.columns = ['FID','IID', 'PHENO']
demographics_df_casecon_toKeep_pheno.head()
demographics_df_casecon_toKeep_pheno.to_csv(f'{WORK_DIR}/PHENO.txt',index=False)

## Extract SH3GL2

In [92]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# SH3GL2: gene positions on hg38 (from https://useast.ensembl.org/index.html)
/home/jupyter/tools/plink2 \
--pfile chr9 \
--chr 9 \
--from-bp 17579066 \
--to-bp 17797124 \
--make-bed \
--max-alleles 2 \
--out temp_AMPPD

PLINK v2.00a3.3LM AVX2 Intel (3 Jun 2022)      www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to temp_AMPPD.log.
Options in effect:
  --chr 9
  --from-bp 17579066
  --make-bed
  --max-alleles 2
  --out temp_AMPPD
  --pfile chr9
  --to-bp 17797124

Start time: Mon Jul  3 10:41:08 2023
7450 MiB RAM detected; reserving 3725 MiB for main workspace.
Using up to 2 compute threads.
10418 samples (0 females, 0 males, 10418 ambiguous; 10418 founders) loaded from
chr9.psam.
6246283 out of 6777486 variants loaded from chr9.pvar.
Note: No phenotype data present.
14592 variants remaining after main filters.
Writing temp_AMPPD.fam ... done.
Writing temp_AMPPD.bim ... done.
Writing temp_AMPPD.bed ... 0%15%done.
End time: Mon Jul  3 10:42:17 2023


In [93]:
%%bash
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head temp_AMPPD.bim

9	.	0	17579072	A	G
9	.	0	17579076	A	T
9	.	0	17579077	G	T
9	.	0	17579086	A	G
9	.	0	17579091	G	C
9	.	0	17579098	T	C
9	.	0	17579102	G	A
9	.	0	17579107	T	C
9	.	0	17579112	T	C
9	.	0	17579119	C	G


In [100]:
bim = pd.read_csv("/home/jupyter/SH3GL2/temp_AMPPD.bim",sep="\t",names=['chr','rsID','pos','bp','A1','A2'])
bim['chr'] = bim['chr'].astype(str)
bim['bp'] = bim['bp'].astype(str)

In [104]:
bim['rsID'] = bim.chr.str.cat(bim.bp, sep=':')

In [112]:
bim.head()

Unnamed: 0,chr,rsID,pos,bp,A1,A2
0,9,9:17579072,0,17579072,A,G
1,9,9:17579076,0,17579076,A,T
2,9,9:17579077,0,17579077,G,T
3,9,9:17579086,0,17579086,A,G
4,9,9:17579091,0,17579091,G,C


In [125]:
bim.to_csv(f'{WORK_DIR}/temp_AMPPD.bim',sep="\t",index=False,header=None)

In [126]:
%%bash
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head temp_AMPPD.fam
#Store the first column of IDs in a new file
awk '{ print $1 }' temp_AMPPD.fam > IDs.txt

HB-PD_INVAB109VHC	HB-PD_INVAB109VHC	0	0	0	-9
HB-PD_INVAB289LG3	HB-PD_INVAB289LG3	0	0	0	-9
HB-PD_INVAC488AAF	HB-PD_INVAC488AAF	0	0	0	-9
HB-PD_INVAE296YP8	HB-PD_INVAE296YP8	0	0	0	-9
HB-PD_INVAJ549VWD	HB-PD_INVAJ549VWD	0	0	0	-9
HB-PD_INVAK106DV5	HB-PD_INVAK106DV5	0	0	0	-9
HB-PD_INVAK293NGR	HB-PD_INVAK293NGR	0	0	0	-9
HB-PD_INVAM835KNR	HB-PD_INVAM835KNR	0	0	0	-9
HB-PD_INVAM864YC9	HB-PD_INVAM864YC9	0	0	0	-9
HB-PD_INVAN082DKK	HB-PD_INVAN082DKK	0	0	0	-9


In [127]:
sex = pd.read_csv(f'{WORK_DIR}/SEX.txt')
pheno = pd.read_csv(f'{WORK_DIR}/PHENO.txt')
ids = pd.read_csv(f'{WORK_DIR}/IDs.txt',names=['FID'])

In [128]:
sex_new = pd.merge(sex,ids,on="FID")
sex_new.to_csv(f'{WORK_DIR}/SEX_new.txt',index=False, sep="\t")

In [129]:
pheno_new = pd.merge(pheno,ids,on="FID")
pheno_new.to_csv(f'{WORK_DIR}/PHENO_new.txt',index=False, sep="\t")

In [130]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# Update sex variable
/home/jupyter/tools/plink \
--bfile temp_AMPPD \
--update-sex SEX_new.txt  \
--pheno PHENO_new.txt \
--make-bed \
--out pheno_SH3GL2

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2.log.
Options in effect:
  --bfile temp_AMPPD
  --make-bed
  --out pheno_SH3GL2
  --pheno PHENO_new.txt
  --update-sex SEX_new.txt

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (0 males, 0 females, 10418 ambiguous) loaded from .fam.
Ambiguous sex IDs written to pheno_SH3GL2.nosex .
7512 phenotype values present after --pheno.
--update-sex: 10418 people updated, 1 ID not present.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38

### Visualize plink files bim, fam and bed

In [131]:
%%bash

# Visualize bim file
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head pheno_SH3GL2.bim

9	9:17579072	0	17579072	A	G
9	9:17579076	0	17579076	A	T
9	9:17579077	0	17579077	G	T
9	9:17579086	0	17579086	A	G
9	9:17579091	0	17579091	G	C
9	9:17579098	0	17579098	T	C
9	9:17579102	0	17579102	G	A
9	9:17579107	0	17579107	T	C
9	9:17579112	0	17579112	T	C
9	9:17579119	0	17579119	C	G


In [132]:
%%bash

# Visualize fam file
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head pheno_SH3GL2.fam

HB-PD_INVAB109VHC HB-PD_INVAB109VHC 0 0 2 1
HB-PD_INVAB289LG3 HB-PD_INVAB289LG3 0 0 2 1
HB-PD_INVAC488AAF HB-PD_INVAC488AAF 0 0 1 1
HB-PD_INVAE296YP8 HB-PD_INVAE296YP8 0 0 2 1
HB-PD_INVAJ549VWD HB-PD_INVAJ549VWD 0 0 2 1
HB-PD_INVAK106DV5 HB-PD_INVAK106DV5 0 0 1 1
HB-PD_INVAK293NGR HB-PD_INVAK293NGR 0 0 1 1
HB-PD_INVAM835KNR HB-PD_INVAM835KNR 0 0 2 1
HB-PD_INVAM864YC9 HB-PD_INVAM864YC9 0 0 2 1
HB-PD_INVAN082DKK HB-PD_INVAN082DKK 0 0 2 1


In [133]:
%%bash

# Visualize bed file
WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head pheno_SH3GL2.bed

l�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

In [134]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# Turn binary files into VCF
/home/jupyter/tools/plink \
--bfile pheno_SH3GL2 \
--recode vcf-fid \
--allow-no-sex \
--out pheno_SH3GL2

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --out pheno_SH3GL2
  --recode vcf-fid

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%

## Annotate SH3GL2 variants

In [135]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR
export PATH=$PATH:/home/jupyter/tools/rvtests/third/tabix-0.2.6/

### Bgzip and Tabix
bgzip pheno_SH3GL2.vcf

tabix -f -p vcf pheno_SH3GL2.vcf.gz

[bgzip] can't create pheno_SH3GL2.vcf.gz: File exists


In [136]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

### Annotate variants using ANNOVAR: https://annovar.openbioinformatics.org/en/latest/ 
perl /home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
-out pheno_SH3GL2.annovar \
-remove -protocol refGene,clinvar_20140902 \
-operation g,f \
--nopolish \
-nastring . \
-vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 pheno_SH3GL2.vcf.gz > pheno_SH3GL2.annovar.avinput>
NOTICE: Finished reading 14599 lines from VCF file
NOTICE: A total of 14592 locus in VCF file passed QC threshold, representing 13709 SNPs (8549 transitions and 5160 transversions) and 883 indels/substitutions
NOTICE: Finished writing allele frequencies based on 142820362 SNP genotypes (89063482 transitions and 53756880 transversions) and 9199094 indels/substitutions for 10418 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile pheno_SH3GL2.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastring . -otherinfo>
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variati

## Extract coding and non-syn variants

In [5]:
# Visualize multianno file
SH3GL2 = pd.read_csv(f'{WORK_DIR}/pheno_SH3GL2.annovar.hg38_multianno.txt', sep = '\t')
SH3GL2

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10421,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430
0,9,17579072,17579072,G,A,UTR5,SH3GL2,NM_003026:c.-171G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
1,9,17579076,17579076,T,A,UTR5,SH3GL2,NM_003026:c.-167T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
2,9,17579077,17579077,T,G,UTR5,SH3GL2,NM_003026:c.-166T>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
3,9,17579086,17579086,G,A,UTR5,SH3GL2,NM_003026:c.-157G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
4,9,17579091,17579091,C,G,UTR5,SH3GL2,NM_003026:c.-152C>G,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14587,9,17797040,17797040,G,A,UTR3,SH3GL2,NM_003026:c.*1297G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14588,9,17797052,17797052,T,A,UTR3,SH3GL2,NM_003026:c.*1309T>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14589,9,17797079,17797079,C,T,UTR3,SH3GL2,NM_003026:c.*1336C>T,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14590,9,17797093,17797093,G,A,UTR3,SH3GL2,NM_003026:c.*1350G>A,.,.,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0


In [7]:
# Filter all variants
all_sh3gl2 = SH3GL2[SH3GL2['Gene.refGene'] == 'SH3GL2']
all_sh3gl2.count()

Chr               14592
Start             14592
End               14592
Ref               14592
Alt               14592
                  ...  
Otherinfo10426    14592
Otherinfo10427    14592
Otherinfo10428    14592
Otherinfo10429    14592
Otherinfo10430    14592
Length: 10441, dtype: int64

In [8]:
# Filter exonic variants
coding = SH3GL2[SH3GL2['Func.refGene'] == 'exonic']
coding.count()

Chr               47
Start             47
End               47
Ref               47
Alt               47
                  ..
Otherinfo10426    47
Otherinfo10427    47
Otherinfo10428    47
Otherinfo10429    47
Otherinfo10430    47
Length: 10441, dtype: int64

In [9]:
coding.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10421,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430
37,9,17579272,17579272,C,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon1:c.C30T:p.F10F,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
38,9,17579287,17579287,G,A,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon1:c.G45A:p.Q15Q,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
11378,9,17747120,17747120,A,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A100C:p.K34Q,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
11379,9,17747126,17747126,A,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A106T:p.M36L,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
12306,9,17761487,17761487,T,A,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon3:c.T165A:p.I55I,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
12307,9,17761502,17761502,C,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon3:c.C180T:p.P60P,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13881,9,17786418,17786418,A,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon4:c.A225T:p.S75S,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13882,9,17786424,17786424,C,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon4:c.C231T:p.I77I,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13883,9,17786430,17786430,C,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon4:c.C237T:p.G79G,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13884,9,17786442,17786442,G,T,exonic,SH3GL2,.,synonymous SNV,SH3GL2:NM_003026:exon4:c.G249T:p.G83G,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0


In [10]:
# Filter exonic and stopgain
coding_stopgain = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'stopgain')]
coding_stopgain.count()

Chr               2
Start             2
End               2
Ref               2
Alt               2
                 ..
Otherinfo10426    2
Otherinfo10427    2
Otherinfo10428    2
Otherinfo10429    2
Otherinfo10430    2
Length: 10441, dtype: int64

In [11]:
# Filter exonic and synonymous
coding_synonymous = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'synonymous SNV')]
coding_synonymous.count()

Chr               23
Start             23
End               23
Ref               23
Alt               23
                  ..
Otherinfo10426    23
Otherinfo10427    23
Otherinfo10428    23
Otherinfo10429    23
Otherinfo10430    23
Length: 10441, dtype: int64

In [12]:
# Filter exonic and missense
coding_missense = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'missense SNV')]
coding_missense.count()

Chr               0
Start             0
End               0
Ref               0
Alt               0
                 ..
Otherinfo10426    0
Otherinfo10427    0
Otherinfo10428    0
Otherinfo10429    0
Otherinfo10430    0
Length: 10441, dtype: int64

In [9]:
# Filter exonic and non-syn 
coding_nonsynonymous = SH3GL2[(SH3GL2['Func.refGene'] == 'exonic') & (SH3GL2['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
coding_nonsynonymous.count()

Chr               22
Start             22
End               22
Ref               22
Alt               22
                  ..
Otherinfo10426    22
Otherinfo10427    22
Otherinfo10428    22
Otherinfo10429    22
Otherinfo10430    22
Length: 10441, dtype: int64

In [10]:
coding_nonsynonymous.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10421,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430
11378,9,17747120,17747120,A,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A100C:p.K34Q,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
11379,9,17747126,17747126,A,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A106T:p.M36L,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13886,9,17786478,17786478,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G285C:p.E95D,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13887,9,17786494,17786494,G,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G301A:p.G101R,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13964,9,17787437,17787437,C,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.C389T:p.S130F,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13966,9,17787457,17787457,C,G,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.C409G:p.Q137E,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13967,9,17787498,17787498,T,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.T450A:p.D150E,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14101,9,17789477,17789477,G,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon6:c.G551A:p.R184H,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14231,9,17791290,17791290,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon7:c.G684C:p.Q228H,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
14233,9,17791300,17791300,A,G,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon7:c.A694G:p.I232V,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0


In [12]:
coding_nonsynonymous.to_csv(f'{WORK_DIR}/coding_nonsynonymous.txt', sep = '\t', index = False)
coding_nonsynonymous.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo10421,Otherinfo10422,Otherinfo10423,Otherinfo10424,Otherinfo10425,Otherinfo10426,Otherinfo10427,Otherinfo10428,Otherinfo10429,Otherinfo10430
11378,9,17747120,17747120,A,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A100C:p.K34Q,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
11379,9,17747126,17747126,A,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A106T:p.M36L,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13886,9,17786478,17786478,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G285C:p.E95D,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13887,9,17786494,17786494,G,A,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G301A:p.G101R,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
13964,9,17787437,17787437,C,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon5:c.C389T:p.S130F,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0


In [None]:
# Filter intronic variants
intronic = SH3GL2[SH3GL2['Func.refGene'] == 'intronic']
intronic.count()

In [6]:
# Filter UTR3 variants
UTR3 = SH3GL2[SH3GL2['Func.refGene'] == 'UTR3']
UTR3.count()

Chr               81
Start             81
End               81
Ref               81
Alt               81
                  ..
Otherinfo10426    81
Otherinfo10427    81
Otherinfo10428    81
Otherinfo10429    81
Otherinfo10430    81
Length: 10441, dtype: int64

In [7]:
# Filter UTR5 variants
UTR5 = SH3GL2[SH3GL2['Func.refGene'] == 'UTR5']
UTR5.count()

Chr               37
Start             37
End               37
Ref               37
Alt               37
                  ..
Otherinfo10426    37
Otherinfo10427    37
Otherinfo10428    37
Otherinfo10429    37
Otherinfo10430    37
Length: 10441, dtype: int64

## Calculate freq of coding and non-syn vars in cases versus controls

In [13]:
reduced_coding_nonsynonymous = coding_nonsynonymous[["Chr", "Start", "End", "Gene.refGene"]]
reduced_coding_nonsynonymous.to_csv(f'{WORK_DIR}/reduced_coding_nonsynonymous.txt', sep = '\t', index = False, header= False)

In [14]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

head reduced_coding_nonsynonymous.txt

9	17747120	17747120	SH3GL2
9	17747126	17747126	SH3GL2
9	17786478	17786478	SH3GL2
9	17786494	17786494	SH3GL2
9	17787437	17787437	SH3GL2
9	17787457	17787457	SH3GL2
9	17787498	17787498	SH3GL2
9	17789477	17789477	SH3GL2
9	17791290	17791290	SH3GL2
9	17791300	17791300	SH3GL2


In [15]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# calculate freq for ALL variants

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --assoc --out assoc_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to assoc_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2
  --out assoc_pheno_SH3GL2

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%6

In [16]:
# visualize freq file for ALL variants
all_SH3GL2_freq = pd.read_csv(f'{WORK_DIR}/assoc_pheno_SH3GL2.assoc', delim_whitespace=True)
all_SH3GL2_freq.head()

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
0,9,9:17579072,17579072,A,0.0,0.00012,G,0.8089,0.3685,0.0
1,9,9:17579076,17579076,A,0.000149,0.0,T,1.236,0.2662,
2,9,9:17579077,17579077,G,0.0,0.00012,T,0.8089,0.3685,0.0
3,9,9:17579086,17579086,A,0.000298,0.00012,G,0.5849,0.4444,2.473
4,9,9:17579091,17579091,G,0.0,0.00012,C,0.8089,0.3685,0.0


In [17]:
# visualize freq file for ALL variants comparing HC vs PD with p < 0.05 
all_SH3GL2_freq_p_0_05 = all_SH3GL2_freq[all_SH3GL2_freq['P']  <= 0.05]
all_SH3GL2_freq_p_0_05

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
77,9,9:17579692,17579692,T,0.336400,0.370800,G,19.190,0.000012,0.8602
87,9,9:17579765,17579765,G,0.293500,0.324500,C,16.590,0.000046,0.8651
109,9,9:17580008,17580008,C,0.000893,0.000000,T,7.421,0.006446,
148,9,9:17580618,17580618,T,0.328400,0.360600,A,17.030,0.000037,0.8670
166,9,9:17581041,17581041,A,0.326400,0.311200,G,3.965,0.046470,1.0730
...,...,...,...,...,...,...,...,...,...,...
14018,9,9:17788305,17788305,T,0.000893,0.002288,C,4.347,0.037080,0.3899
14040,9,9:17788598,17788598,G,0.000893,0.002288,C,4.347,0.037080,0.3899
14139,9,9:17789907,17789907,A,0.000893,0.000120,G,4.762,0.029090,7.4240
14336,9,9:17792820,17792820,C,0.001191,0.000241,T,5.040,0.024770,4.9500


In [18]:
# visualize freq file for ALL variants only in PD 
all_SH3GL2_freq_cases_only = all_SH3GL2_freq[all_SH3GL2_freq['F_U'] == 0]
all_SH3GL2_freq_cases_only.count()

CHR      5953
SNP      5953
BP       5953
A1       5953
F_A      5953
F_U      5953
A2       5953
CHISQ    3478
P        3478
OR          0
dtype: int64

In [19]:
all_SH3GL2_freq_cases_only.head()

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR
1,9,9:17579076,17579076,A,0.000149,0.0,T,1.236,0.2662,
5,9,9:17579098,17579098,T,0.0,0.0,C,,,
6,9,9:17579102,17579102,G,0.0,0.0,A,,,
8,9,9:17579112,17579112,T,0.000149,0.0,C,1.236,0.2662,
9,9,9:17579119,17579119,C,0.000149,0.0,G,1.236,0.2662,


In [20]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

# calculate freq for ALL RARE variants

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --make-bed --max-maf 0.05 --out assoc_pheno_SH3GL2_rare_0.05 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to assoc_pheno_SH3GL2_rare_0.05.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --make-bed
  --max-maf 0.05
  --out assoc_pheno_SH3GL2_rare_0.05

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%

In [21]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --assoc --out coding_nonsynonymous_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2
  --extract range reduced_coding_nonsynonymous.txt
  --out coding_nonsynonymous_pheno_SH3GL2

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%

In [22]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --make-bed --max-maf 0.05 --out coding_nonsynonymous_rare_0.05 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_rare_0.05.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --extract range reduced_coding_nonsynonymous.txt
  --make-bed
  --max-maf 0.05
  --out coding_nonsynonymous_rare_0.05

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%

In [23]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR
 
/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --assoc --ci 0.95 --adjust --out coding_nonsynonymous_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2.log.
Options in effect:
  --adjust
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2
  --ci 0.95
  --extract range reduced_coding_nonsynonymous.txt
  --out coding_nonsynonymous_pheno_SH3GL2

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%3

In [24]:
SH3GL2_freq = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2.assoc', delim_whitespace=True)
SH3GL2_freq

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR,SE,L95,U95
0,9,9:17747120,17747120,C,0.0,0.0,A,,,,,,
1,9,9:17747126,17747126,T,0.000149,0.0,A,1.236,0.2662,,,,
2,9,9:17786478,17786478,C,0.0,0.00012,G,0.8089,0.3685,0.0,inf,0.0,
3,9,9:17786494,17786494,A,0.0,0.00012,G,0.8089,0.3685,0.0,inf,0.0,
4,9,9:17787437,17787437,T,0.0,0.0,C,,,,,,
5,9,9:17787457,17787457,G,0.0,0.00012,C,0.8089,0.3685,0.0,inf,0.0,
6,9,9:17787498,17787498,A,0.0,0.00012,T,0.8089,0.3685,0.0,inf,0.0,
7,9,9:17789477,17789477,A,0.0,0.0,G,,,,,,
8,9,9:17791290,17791290,C,0.0,0.0,G,,,,,,
9,9,9:17791300,17791300,G,0.000149,0.0,A,1.236,0.2662,,,,


In [25]:
SH3GL2_bonf = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2.assoc.adjusted', delim_whitespace=True)
SH3GL2_bonf

Unnamed: 0,CHR,SNP,UNADJ,GC,BONF,HOLM,SIDAK_SS,SIDAK_SD,FDR_BH,FDR_BY
0,9,9:17793459,0.2034,0.4399,1,1,0.9791,0.9791,0.4176,1
1,9,9:17747126,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
2,9,9:17791300,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
3,9,9:17793419,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
4,9,9:17795552,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
5,9,9:17795571,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
6,9,9:17795598,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
7,9,9:17795649,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
8,9,9:17795705,0.2662,0.4995,1,1,0.9948,0.9929,0.4176,1
9,9,9:17786478,0.3685,0.5849,1,1,0.9996,0.9929,0.4176,1


## Calculate freq of HMZ in cases versus controls

In [26]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2 --extract range reduced_coding_nonsynonymous.txt --recode A --out coding_nonsynonymous_pheno_SH3GL2 --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2.log.
Options in effect:
  --allow-no-sex
  --bfile pheno_SH3GL2
  --extract range reduced_coding_nonsynonymous.txt
  --out coding_nonsynonymous_pheno_SH3GL2
  --recode A

7450 MB RAM detected; reserving 3725 MB for main workspace.
14592 variants loaded from .bim file.
10418 people (5782 males, 4636 females) loaded from .fam.
7512 phenotype values loaded from .fam.
--extract range: 14570 variants excluded.
--extract range: 22 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 10418 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%

In [27]:
SH3GL2_recode = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2.raw', delim_whitespace=True)
SH3GL2_recode.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,9:17747120_C,9:17747126_T,9:17786478_C,9:17786494_A,...,9:17793459_G,9:17793465_T,9:17793468_C,9:17793471_T,9:17795552_G,9:17795571_A,9:17795582_A,9:17795598_G,9:17795649_G,9:17795705_G
0,HB-PD_INVAB109VHC,HB-PD_INVAB109VHC,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HB-PD_INVAB289LG3,HB-PD_INVAB289LG3,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HB-PD_INVAC488AAF,HB-PD_INVAC488AAF,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HB-PD_INVAE296YP8,HB-PD_INVAE296YP8,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HB-PD_INVAJ549VWD,HB-PD_INVAJ549VWD,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,HB-PD_INVAK106DV5,HB-PD_INVAK106DV5,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,HB-PD_INVAK293NGR,HB-PD_INVAK293NGR,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,HB-PD_INVAM835KNR,HB-PD_INVAM835KNR,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,HB-PD_INVAM864YC9,HB-PD_INVAM864YC9,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,HB-PD_INVAN082DKK,HB-PD_INVAN082DKK,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Explore homozygotes for chr9:17793459 in cases
SH3GL2_hom_cases_01 = SH3GL2_recode[(SH3GL2_recode['rs142575982_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_01.head()

KeyError: 'rs142575982_G'

In [25]:
SH3GL2_hom_cases_01.shape

(0, 28)

In [26]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [27]:
# Explore HMZs for chr9:17793459 in controls
SH3GL2_hom_controls_01 = SH3GL2_recode[(SH3GL2_recode['rs142575982_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_01.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4
2985,LB-06152,LB-06152,0,0,2,1,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0


In [28]:
SH3GL2_hom_controls_01.shape

(1, 28)

In [29]:
'Calculate frequency against the total N of controls: {:f}'.format(1/4153)

'Calculate frequency against the total N of controls: 0.000241'

In [30]:
# unknwon phenotype -9
SH3GL2_hom_U_01 = SH3GL2_recode[(SH3GL2_recode['rs142575982_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_01.shape

(0, 28)

In [31]:
# Explore homozygotes for chr9:17793465 in cases
SH3GL2_hom_cases_02 = SH3GL2_recode[(SH3GL2_recode['rs150543523_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_02.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [32]:
SH3GL2_hom_cases_02.shape

(0, 28)

In [33]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [34]:
# Explore HMZs for chr9:17793465 in controls
SH3GL2_hom_controls_02 = SH3GL2_recode[(SH3GL2_recode['rs150543523_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_02.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [35]:
SH3GL2_hom_controls_02.shape

(0, 28)

In [36]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [37]:
# unknwon phenotype -9
SH3GL2_hom_U_02 = SH3GL2_recode[(SH3GL2_recode['rs150543523_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_02.shape

(1, 28)

In [38]:
# Explore homozygotes for chr9:17793471 in cases
SH3GL2_hom_cases_03 = SH3GL2_recode[(SH3GL2_recode['rs200824412_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_03.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [39]:
SH3GL2_hom_cases_03.shape

(0, 28)

In [40]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [41]:
# Explore HMZs for chr9:17793471 in controls
SH3GL2_hom_controls_03 = SH3GL2_recode[(SH3GL2_recode['rs200824412_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_03.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [42]:
SH3GL2_hom_controls_03.shape

(0, 28)

In [43]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [44]:
# unknwon phenotype -9
SH3GL2_hom_U_03 = SH3GL2_recode[(SH3GL2_recode['rs200824412_T'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_03.shape

(0, 28)

In [45]:
# Explore homozygotes for chr9:17795552  in cases
SH3GL2_hom_cases_04 = SH3GL2_recode[(SH3GL2_recode['rs370568983_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_04.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [46]:
SH3GL2_hom_cases_04.shape

(0, 28)

In [47]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [48]:
# Explore HMZs for chr9:17795552  in controls
SH3GL2_hom_controls_04 = SH3GL2_recode[(SH3GL2_recode['rs370568983_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_04.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [49]:
SH3GL2_hom_controls_04.shape

(0, 28)

In [50]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [51]:
# unknwon phenotype -9
SH3GL2_hom_U_04 = SH3GL2_recode[(SH3GL2_recode['rs370568983_G'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_04.shape

(0, 28)

In [52]:
# Explore homozygotes for chr9:17795571 in cases
SH3GL2_hom_cases_05 = SH3GL2_recode[(SH3GL2_recode['rs370370071_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_05.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [53]:
SH3GL2_hom_cases_05.shape

(0, 28)

In [54]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [55]:
# Explore HMZs for chr9:17795571 in controls
SH3GL2_hom_controls_05 = SH3GL2_recode[(SH3GL2_recode['rs370370071_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_05.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [56]:
SH3GL2_hom_controls_05.shape

(0, 28)

In [57]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [58]:
# unknwon phenotype -9
SH3GL2_hom_U_05 = SH3GL2_recode[(SH3GL2_recode['rs370370071_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_05.shape

(0, 28)

In [59]:
# Explore homozygotes for chr9:17795582 in cases
SH3GL2_hom_cases_06 = SH3GL2_recode[(SH3GL2_recode['rs139383722_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 2)]
SH3GL2_hom_cases_06.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [60]:
SH3GL2_hom_cases_06.shape

(0, 28)

In [61]:
'Calculate frequency against the total N of cases: {:f}'.format(0/3359)

'Calculate frequency against the total N of cases: 0.000000'

In [62]:
# Explore HMZs for chr9:17795582 in controls
SH3GL2_hom_controls_06 = SH3GL2_recode[(SH3GL2_recode['rs139383722_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == 1)]
SH3GL2_hom_controls_06.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4


In [63]:
SH3GL2_hom_controls_06.shape

(0, 28)

In [64]:
'Calculate frequency against the total N of controls: {:f}'.format(0/4153)

'Calculate frequency against the total N of controls: 0.000000'

In [65]:
# unknwon phenotype -9
SH3GL2_hom_U_06 = SH3GL2_recode[(SH3GL2_recode['rs139383722_A'] == 2) & (SH3GL2_recode['PHENOTYPE'] == -9)]
SH3GL2_hom_U_06.shape

(0, 28)

## COMP HET

In [66]:
# Identifying individuals het for the alternative allele for ALL the variants
het = SH3GL2_recode[(SH3GL2_recode['rs142575982_G'] == 1) | (SH3GL2_recode['rs150543523_T'] == 1) | (SH3GL2_recode['rs200824412_T'] == 1) | (SH3GL2_recode['rs370568983_G'] == 1) | (SH3GL2_recode['rs370370071_A'] == 1) | (SH3GL2_recode['rs139383722_A'] == 1)]
het = het.dropna()
print(het.shape)

(70, 28)


In [67]:
het.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs142575982_G,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4
6,HB-PD_INVAK293NGR,HB-PD_INVAK293NGR,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
238,HB-PD_INVUZ481JFB,HB-PD_INVUZ481JFB,0,0,2,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
311,HB-PD_INVZX511PA8,HB-PD_INVZX511PA8,0,0,2,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
472,LB-00265,LB-00265,0,0,1,-9,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
907,LB-00771,LB-00771,0,0,1,-9,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [68]:

het['compound'] = het['rs142575982_G'] + het['rs150543523_T'] + het['rs200824412_T'] + het['rs370568983_G'] + het['rs370370071_A'] + het['rs139383722_A']
het['compound'].value_counts()

1    70
Name: compound, dtype: int64

In [69]:
het.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4,compound
6,HB-PD_INVAK293NGR,HB-PD_INVAK293NGR,0,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
238,HB-PD_INVUZ481JFB,HB-PD_INVUZ481JFB,0,0,2,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
311,HB-PD_INVZX511PA8,HB-PD_INVZX511PA8,0,0,2,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
472,LB-00265,LB-00265,0,0,1,-9,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
907,LB-00771,LB-00771,0,0,1,-9,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [70]:
comp_het_info_cases = het[(het['PHENOTYPE'] == 2) & (het['compound'] == 2)]
print(comp_het_info_cases.shape)

(0, 29)


In [71]:
comp_het_info_control = het[(het['PHENOTYPE'] == 1) & (het['compound'] == 2)]
print(comp_het_info_control.shape)

(0, 29)


In [72]:
comp_het_info_cases.to_csv('/home/jupyter/SH3GL2/comp_het_info_cases.txt', sep = '\t', index=True)
comp_het_info_cases.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4,compound


In [73]:
comp_het_info_unknown = het[(het['PHENOTYPE'] == -9) & (het['compound'] == 2)]
print(comp_het_info_unknown.shape)

(0, 29)


In [74]:
comp_het_info_unknown.to_csv('/home/jupyter/SH3GL2/comp_het_info_unknown.txt', sep = '\t', index=True)
comp_het_info_unknown.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,._C,._T,._C.1,._A,...,rs150543523_T,._C.3,rs200824412_T,rs370568983_G,rs370370071_A,rs139383722_A,._G.2,._G.3,._G.4,compound


## Save out results..!

In [121]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp -r {WORK_DIR} {WORKSPACE_BUCKET}')

Executing: gsutil -mu terra-18d8e41c cp -r /home/jupyter/SH3GL2/ gs://fc-f7a400c1-827e-48f8-b7b6-90c488a000a4


Copying file:///home/jupyter/SH3GL2/amp_pd_case_control.csv [Content-Type=text/csv]...
Copying file:///home/jupyter/SH3GL2/assoc_pheno_SH3GL2_rare_0.05.bim [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2/comp_het_info_unknown.txt [Content-Type=text/plain]...
Copying file:///home/jupyter/SH3GL2/chr9.psam [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2/assoc_pheno_PTPA.log [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2/comp_het_info_cases.txt [Content-Type=text/plain]...
Copying file:///home/jupyter/SH3GL2/pheno_SH3GL2.fam [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2/reduced_coding_nonsynonymous.txt [Content-Type=text/plain]...
Copying file:///home/jupyter/SH3GL2/IDs.txt [Content-Type=text/plain]...        
Copying file:///home/jupyter/SH3GL2/temp.log [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2/SEX.txt [Content-Type=text/plain]