# SH3GL2 - Single gene analysis in GP2 data

## Description

Using "individual level data"

### 0. Getting Started

- Loading Python libraries
- Defining functions
- Installing packages

### 1. Copy data from workspace to cloud environment

### 2. Extract SH3GL2

### 3. Annotate SH3GL2 variants

### 4. Extract coding/non-syn variants

### 5. Calculate frequency in cases versus controls

### 6. Calculate frequency (homozygotes) in cases versus controls

### 7. Save out results

## Getting Started

### Loading Python libraries

### Defining functions

In [1]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# Numpy for basics
import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

#Import Sys
import sys as sys

In [2]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command
    
def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

# Utility routine for printing a query before executing it
def bq_query(query):
    print(f'Executing: {query}', file=sys.stderr)
    return pd.read_gbq(query, project_id=BILLING_PROJECT_ID, dialect='standard')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
    </p>
    '''

    display(HTML(html))

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)

### Set paths

In [3]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## Print the information to check we are in the proper release and billing 
## This will be different for you, the user, depending on the billing project your workspace is on
print('Billing and Workspace')
print(f'Workspace Name: {WORKSPACE_NAME}')
print(f'Billing Project: {BILLING_PROJECT_ID}')
print(f'Workspace Bucket, where you can upload and download data: {WORKSPACE_BUCKET}')
print('')


## GP2 v5.0
## Explicitly define release v5.0 path 
GP2_RELEASE_PATH = 'gs://gp2tier2/release5_11052023'
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_META_RELEASE_PATH = f'{GP2_RELEASE_PATH}/meta_data'
GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
print('GP2 v5.0')
print(f'Path to GP2 v5.0 Clinical Data: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v5.0 Raw Genotype Data: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v5.0 Imputed Genotype Data: {GP2_IMPUTED_GENO_PATH}')


## AMP-PD v3.0
## Explicitly define release v3.0 path 
AMP_RELEASE_PATH = 'gs://amp-pd-data/releases/2022_v3release_1115'
AMP_CLINICAL_RELEASE_PATH = f'{AMP_RELEASE_PATH}/clinical'

AMP_WGS_RELEASE_PATH = 'gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS'
AMP_WGS_RELEASE_PLINK_PATH = os.path.join(AMP_WGS_RELEASE_PATH, 'plink')
AMP_WGS_RELEASE_PLINK_PFILES = os.path.join(AMP_WGS_RELEASE_PLINK_PATH, 'pfiles')

print('AMP-PD v3.0')
print(f'Path to AMP-PD v3.0 Clinical Data: {AMP_CLINICAL_RELEASE_PATH}')
print(f'Path to AMP-PD v3.0 WGS Data: {AMP_WGS_RELEASE_PLINK_PATH}')
print(f'Path to AMP-PD v3.0 WGS Data: {AMP_WGS_RELEASE_PLINK_PFILES}')
print('')

Billing and Workspace
Workspace Name: Endophilin-A
Billing Project: terra-18d8e41c
Workspace Bucket, where you can upload and download data: gs://fc-f7a400c1-827e-48f8-b7b6-90c488a000a4

GP2 v5.0
Path to GP2 v5.0 Clinical Data: gs://gp2tier2/release5_11052023/clinical_data
Path to GP2 v5.0 Raw Genotype Data: gs://gp2tier2/release5_11052023/raw_genotypes
Path to GP2 v5.0 Imputed Genotype Data: gs://gp2tier2/release5_11052023/imputed_genotypes
AMP-PD v3.0
Path to AMP-PD v3.0 Clinical Data: gs://amp-pd-data/releases/2022_v3release_1115/clinical
Path to AMP-PD v3.0 WGS Data: gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink
Path to AMP-PD v3.0 WGS Data: gs://amp-pd-genomics/releases/2022_v3release_1115/wgs-WB-DWGS/plink/pfiles



### Install packages

#### Install Plink 1.9 and Plink 2.0

In [4]:
%%bash

mkdir -p ~/tools
cd ~/tools

if test -e /home/jupyter/tools/plink; then
echo "Plink1.9 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink \n    -------"
wget -N http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 
unzip -o plink_linux_x86_64_20190304.zip
echo -e "\n plink downloaded and unzipped in /home/jupyter/tools \n "

fi


if test -e /home/jupyter/tools/plink2; then
echo "Plink2 is already installed in /home/jupyter/tools/"

else
echo -e "Downloading plink2 \n    -------"
wget -N https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_avx2_20220603.zip
unzip -o plink2_linux_avx2_20220603.zip
echo -e "\n plink2 downloaded and unzipped in /home/jupyter/tools \n "

fi

Plink1.9 is already installed in /home/jupyter/tools/
Plink2 is already installed in /home/jupyter/tools/


In [5]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


#### Remote restrictions

In [6]:
%%bash

# chmod plink 1.9 
chmod u+x /home/jupyter/tools/plink

In [7]:
%%bash

# chmod plink 2.0
chmod u+x /home/jupyter/tools/plink2

### Install ANNOVAR

In [8]:
%%bash

# Install ANNOVAR:
# https://www.openbioinformatics.org/annovar/annovar_download_form.php

if test -e /home/jupyter/tools/annovar; then

echo "annovar is already installed in /home/jupyter/tools/"
else
echo "annovar is not installed"
cd /home/jupyter/tools/

wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvfz annovar.latest.tar.gz

fi

annovar is already installed in /home/jupyter/tools/


In [9]:
%%bash
ls /home/jupyter/tools/

annovar
annovar.latest.tar.gz
LICENSE
plink
plink2
plink2_linux_avx2_20220603.zip
plink_linux_x86_64_20190304.zip
prettify
toy.map
toy.ped


#### Install ANNOVAR: Download sources of annotation 

In [10]:
%%bash

cd /home/jupyter/tools/annovar/

perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20140902 humandb/
#perl annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp147 humandb/ 
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp30a humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad211_genome humandb/
#perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ljb26_all humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneMrna.fa.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_refGeneVersion.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg38 build version, with files saved at the 'humandb' directory
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20140902.txt.idx.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished d

In [11]:
%%bash
ls /home/jupyter/tools/annovar/

annotate_variation.pl
coding_change.pl
convert2annovar.pl
example
humandb
retrieve_seq_from_fasta.pl
table_annovar.pl
variants_reduction.pl


## AMR

## Copy data from GP2 bucket to workspace

In [4]:
# Make a directory
print("Making a working directory")
WORK_DIR = f'/home/jupyter/SH3GL2_GP2/'
shell_do(f'mkdir -p {WORK_DIR}') # f' means f-string - contains expressions to execute the code

Making a working directory


Executing: mkdir -p /home/jupyter/SH3GL2_GP2/


In [5]:
# Check directory where GP2 data is
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {GP2_IMPUTED_GENO_PATH}/AMR/')

Executing: gsutil -u terra-18d8e41c ls gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/


gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5.log
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5.pgen
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5.psam
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5.pvar
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5_no_mac_filter.log
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5_no_mac_filter.pgen
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5_no_mac_filter.psam
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr10_AMR_release5_no_mac_filter.pvar
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr11_AMR_release5.log
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr11_AMR_release5.pgen
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr11_AMR_release5.psam
gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr11_AMR_release5.pvar
gs://gp2tie

In [6]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {GP2_IMPUTED_GENO_PATH}/AMR/chr9_AMR_release5_no_mac_filter* {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c -m cp -r gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr9_AMR_release5_no_mac_filter* /home/jupyter/SH3GL2_GP2/


Copying gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr9_AMR_release5_no_mac_filter.log...
Copying gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr9_AMR_release5_no_mac_filter.pgen...
Copying gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr9_AMR_release5_no_mac_filter.psam...
Copying gs://gp2tier2/release5_11052023/imputed_genotypes/AMR/chr9_AMR_release5_no_mac_filter.pvar...
\ [4/4 files][948.2 MiB/948.2 MiB] 100% Done  28.1 MiB/s ETA 00:00:00           
Operation completed over 4 objects/948.2 MiB.                                    


### Create a covariate file with GP2 data

In [7]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} ls {GP2_CLINICAL_RELEASE_PATH}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp -r {GP2_CLINICAL_RELEASE_PATH}/master_key_release5_final.csv {WORK_DIR}')

Executing: gsutil -u terra-18d8e41c ls gs://gp2tier2/release5_11052023/clinical_data


gs://gp2tier2/release5_11052023/clinical_data/master_key_release5_final.csv
gs://gp2tier2/release5_11052023/clinical_data/release5_data_dictionary.csv


Executing: gsutil -u terra-18d8e41c -m cp -r gs://gp2tier2/release5_11052023/clinical_data/master_key_release5_final.csv /home/jupyter/SH3GL2_GP2/


Copying gs://gp2tier2/release5_11052023/clinical_data/master_key_release5_final.csv...
/ [1/1 files][  2.5 MiB/  2.5 MiB] 100% Done                                    
Operation completed over 1 objects/2.5 MiB.                                      


In [8]:
cov = pd.read_csv(f'{WORK_DIR}/master_key_release5_final.csv')
cov.columns

Index(['GP2ID', 'GP2sampleID', 'manifest_id', 'phenotype', 'pheno_for_qc',
       'other_pheno', 'sex_for_qc', 'age', 'age_of_onset', 'age_at_diagnosis',
       'age_at_death', 'race_for_qc', 'family_history', 'region_for_qc',
       'study', 'pruned', 'pruned_reason', 'label', 'related'],
      dtype='object')

In [9]:
cov_reduced = cov[['GP2sampleID','GP2sampleID','sex_for_qc', 'age', 'phenotype']]
cov_reduced.columns = ['FID','IID', 'SEX','AGE','PHENO']
#cov_reduced

In [10]:
conditions = [
    (cov_reduced['PHENO'] == "PD"),
    (cov_reduced['PHENO'] == "Control")]

In [11]:
choices = [2,1]
cov_reduced['PHENOTYPE'] = np.select(conditions, choices, default=-9).astype(np.int64)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
cov_reduced.reset_index(inplace=True)
cov_reduced.drop(columns=["index"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [13]:
cov_reduced.drop(columns=['PHENO'], inplace=True)

In [14]:
sex = cov_reduced[['FID','IID','SEX']]
sex.to_csv(f'{WORK_DIR}/SEX.txt',index=False, sep="\t")

In [15]:
pheno = cov_reduced[['FID','IID','PHENOTYPE']]
pheno.to_csv(f'{WORK_DIR}/PHENO.txt',index=False, sep="\t")

## Extract SH3GL2

In [16]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

# SH3GL2: gene positions on hg38 (from https://useast.ensembl.org/index.html)
/home/jupyter/tools/plink2 \
--pfile chr9_AMR_release5_no_mac_filter \
--chr 9 \
--from-bp 17579066  \
--to-bp 17797124 \
--make-bed \
--out pheno_SH3GL2_AMR

PLINK v2.00a3.3LM AVX2 Intel (3 Jun 2022)      www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2_AMR.log.
Options in effect:
  --chr 9
  --from-bp 17579066
  --make-bed
  --out pheno_SH3GL2_AMR
  --pfile chr9_AMR_release5_no_mac_filter
  --to-bp 17797124

Start time: Thu Nov 16 11:57:54 2023
7450 MiB RAM detected; reserving 3725 MiB for main workspace.
Using up to 2 compute threads.
573 samples (249 females, 324 males; 573 founders) loaded from
chr9_AMR_release5_no_mac_filter.psam.
8856327 variants loaded from chr9_AMR_release5_no_mac_filter.pvar.
1 binary phenotype loaded (345 cases, 205 controls).
23310 variants remaining after main filters.
Writing pheno_SH3GL2_AMR.fam ... done.
Writing pheno_SH3GL2_AMR.bim ... done.
Writing pheno_SH3GL2_AMR.bed ... 0%done.
End time: Thu Nov 16 11:57:55 2023


### Visualize plink files bim, fam and bed

In [17]:
%%bash

# Visualize bim file
WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

head pheno_SH3GL2_AMR.bim

9	chr9:17579083:C:T	0	17579083	T	C
9	chr9:17579085:A:G	0	17579085	G	A
9	chr9:17579090:C:A	0	17579090	A	C
9	chr9:17579090:C:T	0	17579090	T	C
9	chr9:17579091:C:A	0	17579091	A	C
9	chr9:17579092:G:A	0	17579092	A	G
9	chr9:17579096:C:A	0	17579096	A	C
9	chr9:17579097:C:T	0	17579097	T	C
9	chr9:17579099:G:C	0	17579099	C	G
9	chr9:17579102:A:G	0	17579102	G	A


In [18]:
%%bash

# Visualize fam file
WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

head pheno_SH3GL2_AMR.fam

#Store second column of the temp.fam file in a new IDs.txt file
#awk '{print $2}' pheno_SH3GL2_AMR.fam > IDs.txt

#Count lines temp.fam file
wc -l pheno_SH3GL2_AMR.fam

0	APGS_000157_s1	0	0	1	2
0	APGS_000540_s1	0	0	1	2
0	APGS_000551_s1	0	0	1	2
0	APGS_000609_s1	0	0	1	2
0	APGS_000724_s1	0	0	1	2
0	APGS_000981_s1	0	0	1	2
0	APGS_001003_s1	0	0	2	2
0	BBDP_000070_s1	0	0	2	-9
0	BBDP_000167_s1	0	0	1	2
0	BBDP_000219_s1	0	0	1	1
573 pheno_SH3GL2_AMR.fam


In [19]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

# Turn binary files into VCF
/home/jupyter/tools/plink \
--bfile pheno_SH3GL2_AMR \
--recode vcf-fid \
--out pheno_SH3GL2_AMR_nomacfilter

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to pheno_SH3GL2_AMR_nomacfilter.log.
Options in effect:
  --bfile pheno_SH3GL2_AMR
  --out pheno_SH3GL2_AMR_nomacfilter
  --recode vcf-fid

7450 MB RAM detected; reserving 3725 MB for main workspace.
23310 variants loaded from .bim file.
573 people (324 males, 249 females) loaded from .fam.
550 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 573 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%

## Annotate SH3GL2 variants

In [20]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR
export PATH=$PATH:/home/jupyter/tools/rvtests/third/tabix-0.2.6/

### Bgzip and Tabix
bgzip pheno_SH3GL2_AMR_nomacfilter.vcf

tabix -f -p vcf pheno_SH3GL2_AMR_nomacfilter.vcf.gz

In [21]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

### Annotate variants using ANNOVAR: https://annovar.openbioinformatics.org/en/latest/ 
perl /home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2_AMR_nomacfilter.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
-out pheno_SH3GL2_AMR.annovar \
-remove -protocol refGene,clinvar_20140902 \
-operation g,f \
--nopolish \
-nastring . \
-vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 pheno_SH3GL2_AMR_nomacfilter.vcf.gz > pheno_SH3GL2_AMR.annovar.avinput>
NOTICE: Finished reading 23317 lines from VCF file
NOTICE: A total of 23310 locus in VCF file passed QC threshold, representing 21994 SNPs (13068 transitions and 8926 transversions) and 1316 indels/substitutions
NOTICE: Finished writing allele frequencies based on 12602562 SNP genotypes (7487964 transitions and 5114598 transversions) and 754068 indels/substitutions for 573 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl pheno_SH3GL2_AMR.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile pheno_SH3GL2_AMR.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastring . -otherinfo>
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system c

## Extract coding and non-syn variants

In [5]:
# Visualize multianno file
SH3GL2_AMR = pd.read_csv(f'{WORK_DIR}/pheno_SH3GL2_AMR.annovar.hg38_multianno.txt', sep = '\t')

In [6]:
SH3GL2_AMR['Chr_bp'] = SH3GL2_AMR['Chr'].astype(str) + ':' + SH3GL2_AMR['Start'].astype(str) + ':' + SH3GL2_AMR['End'].astype(str)

In [7]:
SH3GL2_AMR = SH3GL2_AMR.drop_duplicates(subset=['Chr_bp'])

In [8]:
SH3GL2_AMR = SH3GL2_AMR[SH3GL2_AMR['Gene.refGene']=='SH3GL2']

In [9]:
SH3GL2_AMR.count()

Chr             22496
Start           22496
End             22496
Ref             22496
Alt             22496
                ...  
Otherinfo582    22496
Otherinfo583    22496
Otherinfo584    22496
Otherinfo585    22496
Chr_bp          22496
Length: 597, dtype: int64

In [10]:
reduced_SH3GL2_AMR = SH3GL2_AMR[["Chr_bp", "Gene.refGene", 'Func.refGene','ExonicFunc.refGene']]
reduced_SH3GL2_AMR

Unnamed: 0,Chr_bp,Gene.refGene,Func.refGene,ExonicFunc.refGene
0,9:17579083:17579083,SH3GL2,UTR5,.
1,9:17579085:17579085,SH3GL2,UTR5,.
2,9:17579090:17579090,SH3GL2,UTR5,.
4,9:17579091:17579091,SH3GL2,UTR5,.
5,9:17579092:17579092,SH3GL2,UTR5,.
...,...,...,...,...
23304,9:17797040:17797040,SH3GL2,UTR3,.
23305,9:17797047:17797047,SH3GL2,UTR3,.
23306,9:17797052:17797052,SH3GL2,UTR3,.
23307,9:17797079:17797079,SH3GL2,UTR3,.


In [11]:
reduced_SH3GL2_AMR.to_csv(f'{WORKSPACE_BUCKET}/reduced_SH3GL2_AMR.txt', sep = '\t', index = False, header= True)

In [10]:
SH3GL2_AMR['Func.refGene'].value_counts()

intronic    22251
UTR3          130
exonic         62
UTR5           53
Name: Func.refGene, dtype: int64

In [11]:
SH3GL2_AMR['ExonicFunc.refGene'].value_counts()

.                    22434
nonsynonymous SNV       37
synonymous SNV          25
Name: ExonicFunc.refGene, dtype: int64

In [12]:
# Filter exonic variants
coding_AMR = SH3GL2_AMR[SH3GL2_AMR['Func.refGene'] == 'exonic']
coding_AMR.count()

Chr             62
Start           62
End             62
Ref             62
Alt             62
                ..
Otherinfo582    62
Otherinfo583    62
Otherinfo584    62
Otherinfo585    62
Chr_bp          62
Length: 597, dtype: int64

In [13]:
# Filter exonic and non-syn vars
coding_nonsynonymous_AMR = SH3GL2_AMR[(SH3GL2_AMR['Func.refGene'] == 'exonic') & (SH3GL2_AMR['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
coding_nonsynonymous_AMR.count()

Chr             37
Start           37
End             37
Ref             37
Alt             37
                ..
Otherinfo582    37
Otherinfo583    37
Otherinfo584    37
Otherinfo585    37
Chr_bp          37
Length: 597, dtype: int64

In [14]:
coding_nonsynonymous_AMR.to_csv(f'{WORK_DIR}/coding_nonsynonymous.txt', sep = '\t', index = False)
coding_nonsynonymous_AMR.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,Otherinfo577,Otherinfo578,Otherinfo579,Otherinfo580,Otherinfo581,Otherinfo582,Otherinfo583,Otherinfo584,Otherinfo585,Chr_bp
61,9,17579258,17579258,C,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon1:c.C16T:p.L6F,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17579258:17579258
18054,9,17747115,17747115,A,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon2:c.A95C:p.D32A,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17747115:17747115
19657,9,17761500,17761500,C,G,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon3:c.C178G:p.P60A,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17761500:17761500
22210,9,17786465,17786465,C,T,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.C272T:p.A91V,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17786465:17786465
22212,9,17786478,17786478,G,C,exonic,SH3GL2,.,nonsynonymous SNV,SH3GL2:NM_003026:exon4:c.G285C:p.E95D,...,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,9:17786478:17786478


In [15]:
# Calculate freq - cases vs controls

In [16]:
reduced_coding_nonsynonymous_AMR = coding_nonsynonymous_AMR[["Chr", "Start", "End", "Gene.refGene"]]
reduced_coding_nonsynonymous_AMR.to_csv(f'{WORK_DIR}/reduced_coding_nonsynonymous_AMR.txt', sep = '\t', index = False, header= False)

In [17]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

head reduced_coding_nonsynonymous_AMR.txt

9	17579258	17579258	SH3GL2
9	17747115	17747115	SH3GL2
9	17761500	17761500	SH3GL2
9	17786465	17786465	SH3GL2
9	17786478	17786478	SH3GL2
9	17786519	17786519	SH3GL2
9	17787464	17787464	SH3GL2
9	17789425	17789425	SH3GL2
9	17789473	17789473	SH3GL2
9	17789476	17789476	SH3GL2


In [18]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2_AMR --extract range reduced_coding_nonsynonymous_AMR.txt --assoc --ci 0.95 --out coding_nonsynonymous_pheno_SH3GL2_AMR --allow-no-sex

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2_AMR.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile pheno_SH3GL2_AMR
  --ci 0.95
  --extract range reduced_coding_nonsynonymous_AMR.txt
  --out coding_nonsynonymous_pheno_SH3GL2_AMR

14998 MB RAM detected; reserving 7499 MB for main workspace.
23310 variants loaded from .bim file.
573 people (324 males, 249 females) loaded from .fam.
550 phenotype values loaded from .fam.
--extract range: 23273 variants excluded.
--extract range: 37 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 573 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31

In [19]:
SH3GL2_AMR_freq = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2_AMR.assoc', delim_whitespace=True)
SH3GL2_AMR_freq

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR,SE,L95,U95
0,9,chr9:17579258:C:T,17579258,T,0.0,0,C,,,,,,
1,9,chr9:17747115:A:C,17747115,C,0.0,0,A,,,,,,
2,9,chr9:17761500:C:G,17761500,G,0.0,0,C,,,,,,
3,9,chr9:17786465:C:T,17786465,T,0.0,0,C,,,,,,
4,9,chr9:17786478:G:C,17786478,C,0.0,0,G,,,,,,
5,9,chr9:17786519:A:G,17786519,G,0.0,0,A,,,,,,
6,9,chr9:17787464:T:G,17787464,G,0.0,0,T,,,,,,
7,9,chr9:17789425:G:T,17789425,T,0.0,0,G,,,,,,
8,9,chr9:17789473:C:G,17789473,G,0.0,0,C,,,,,,
9,9,chr9:17789476:C:T,17789476,T,0.0,0,C,,,,,,


In [22]:
%%bash

WORK_DIR='/home/jupyter/SH3GL2_GP2/'
cd $WORK_DIR

/home/jupyter/tools/plink --bfile pheno_SH3GL2_AMR --extract range reduced_coding_nonsynonymous_AMR.txt --recode A --out coding_nonsynonymous_pheno_SH3GL2_AMR

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to coding_nonsynonymous_pheno_SH3GL2_AMR.log.
Options in effect:
  --bfile pheno_SH3GL2_AMR
  --extract range reduced_coding_nonsynonymous_AMR.txt
  --out coding_nonsynonymous_pheno_SH3GL2_AMR
  --recode A

14998 MB RAM detected; reserving 7499 MB for main workspace.
23310 variants loaded from .bim file.
573 people (324 males, 249 females) loaded from .fam.
550 phenotype values loaded from .fam.
--extract range: 23273 variants excluded.
--extract range: 37 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 573 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%

In [24]:

SH3GL2_AMR_recode = pd.read_csv(f'{WORK_DIR}/coding_nonsynonymous_pheno_SH3GL2_AMR.raw', delim_whitespace=True)
SH3GL2_AMR_recode.head()

Unnamed: 0,FID,IID,PAT,MAT,SEX,PHENOTYPE,chr9:17579258:C:T_T,chr9:17747115:A:C_C,chr9:17761500:C:G_G,chr9:17786465:C:T_T,...,chr9:17793471:C:T_T,chr9:17793474:A:T_T,chr9:17793477:C:T_T,chr9:17795552:A:G_G,chr9:17795560:G:T_T,chr9:17795571:G:A_A,chr9:17795624:G:A_A,chr9:17795649:A:G_G,chr9:17795682:A:G_G,chr9:17795685:G:A_A
0,0,APGS_000157_s1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,APGS_000540_s1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,APGS_000551_s1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,APGS_000609_s1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,APGS_000724_s1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Save out results..!

In [55]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp -r {WORK_DIR} {WORKSPACE_BUCKET}')

Executing: gsutil -mu terra-18d8e41c cp -r /home/jupyter/SH3GL2_GP2/ gs://fc-f7a400c1-827e-48f8-b7b6-90c488a000a4


Copying file:///home/jupyter/SH3GL2_GP2/pheno_SH3GL2_EUR.annovar.avinput [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/SH3GL2_GP2/pheno_SH3GL2_AJ.annovar.hg38_multianno.vcf [Content-Type=text/vcard]...
Copying file:///home/jupyter/SH3GL2_GP2/chr9_AJ_release5.pgen [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of c