# Intermediate Bioinformatics for Parkinson’s Disease Genetics

- **Module III:** Exploring copy number variation and runs of homozygosity from genotyping data

- **Authors:** Julie Lake on behalf of the Global Parkinson's Genetics Program (GP2) from Aligning Science Across Parkinson's (ASAP)

- **Estimated Computation and Runtime:**
    - **Estimated Specifications:** 4 CPUs, 15 GB memory, 250 GB Persistent Disk
    - **Estimated Runtime:** 1 h total
    
    
- **Date Last Updated:** 13-MARCH-2022

- **Update Description:** Updated notebook

## Quick Description:

Compared to single nucleotide polymorphisms, or SNPs, CNVs often encompass larger genomic regions which can cause changes in gene dosage, regulation, or structure that have important functional consequences. Although most genetic discoveries in Parkinson’s disease have focused on single nucleotide variants, several studies have identified both inherited and de-novo CNVs that are associated with Parkinson’s disease. 

This notebook will show you how to potentially infer probabilistic estimates for the presence of copy number variation assessing two parameters (log R ratio and B allele frequencies) from genotyping data. 


## Background/Motivation:

What is copy number variation? A CNV occurs when there is an increase or decrease in the number of chromosomal copies in a given region of the genome that is greater than one thousand base pairs in length. CNVs represent an important form of human genetic variation and, as you can see in this diagram, can vary widely in size. 

Some CNVs are limited to the duplication or deletion of a single gene or part of a gene, while others can result in more severe changes such as gene triplications or changes that encompass several genes. Some CNVs also span regions where no known genes exist.


## Workflow Summary:

0. Getting started
1. Extract SNP metrics from raw genotype idats
2. Data quality control analyses
3. Extract specific genes identifying SNPs within ranges of Log R Ratio and B-Allele Frequency
4. Calculate a probabalistic CNV dosage and plot 

## Workflow:

### [0. Getting Started](#0)

This section goes through:
* Setting up Python libraries, data path variables, and functions
* Copying data to workspace
* Check format of imported file

### [1. Extract SNP metrics from raw genotype files](#1)

This section goes through:
* Make a directory
* Extract metrics from raw genotype data

### [2. Identify genes within ranges of Log R Ratio and B-Allele Frequency](#2)

This section goes through:
* Break down Log R ratio and B Allele Frequency per gene.

### [3. Calculate a probabilistic CNV dosage and plot the data](#3)

This section goes through:
* Get max dosage per category of CNV
* Extract the sample with the highest Log R ratio deletion dosage
* Plot B Allele frequencies and Log R ratios

## 0. Getting started
<a id="0"></a>

### Set up environment

In [2]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# numpy for basics
import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

In [None]:
! pip install plotly

In [None]:
import plotly.express as px
import plotly.graph_objects as go

### Set up billing project and data path variables

In [None]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## AMP-PD v2.5
## Explicitly define release v2.5 path 
AMP_RELEASE_PATH = 'gs://amp-pd-data/releases/2021_v2-5release_0510'
AMP_CLINICAL_RELEASE_PATH = f'{AMP_RELEASE_PATH}/clinical'

AMP_WGS_RELEASE_PATH = 'gs://amp-pd-genomics/releases/2021_v2-5release_0510/wgs'
AMP_WGS_RELEASE_PLINK_PATH = os.path.join(AMP_WGS_RELEASE_PATH, 'plink')
AMP_WGS_RELEASE_GATK_PATH = os.path.join(AMP_WGS_RELEASE_PATH, 'gatk')

## Print the information to check we are in the proper release and billing 
## This will be different for you, the user, depending on the billing project your workspace is on
print('Billing and Workspace')
print(f'Workspace Name: {WORKSPACE_NAME}')
print(f'Billing Project: {BILLING_PROJECT_ID}')
print(f'Workspace Bucket, where you can upload and download data: {WORKSPACE_BUCKET}')
print('')

print('AMP-PD v2.5')
print(f'Path to AMP-PD v2.5 Clinical Data: {AMP_CLINICAL_RELEASE_PATH}')
print(f'Path to AMP-PD v2.5 WGS Data: {AMP_WGS_RELEASE_PLINK_PATH}')
print('')

## GP2 v2.0
GP2_RELEASE_PATH = 'gs://gp2tier2/release2_06052022'
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_META_RELEASE_PATH = f'{GP2_RELEASE_PATH}/meta_data'
GP2_SUMSTAT_RELEASE_PATH = f'{GP2_RELEASE_PATH}/summary_statistics'

GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
print('GP2 v2.0')
print(f'Path to GP2 v2.0 Clinical Data: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v2.0 Raw Genotype Data: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v2.0 Imputed Genotype Data: {GP2_IMPUTED_GENO_PATH}')


### Set up functions

In [4]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}')
    !$command

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
    </p>
    '''

    display(HTML(html))
    
# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)
    
# Get the data from a query
def bq_query(query):
    """Return the contents of a query against BigQuery"""
    return pd.read_gbq(
        query,
        project_id=BILLING_PROJECT_ID,
        dialect='standard')

## 1. Extract SNP metrics from raw genotype data
<a id="1"></a>

In [None]:
# Create a folder on your workspace
print("Making a working directory")
WORK_DIR = f'/home/jupyter/CNVs/'
shell_do(f'mkdir -p {WORK_DIR}') # f' stands for f-string which contains expressions inside brackets

In [None]:
# Import manually files that you would like to copy in Terra and then transfer them into your notebook workspace
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/park2_snp_metrics.csv {WORK_DIR}')

In [9]:
metrics_path = '/home/jupyter/CNVs/park2_snp_metrics.csv'
gene_df = pd.read_csv(metrics_path)

gene_start = 161347417 - 250000
gene_end = 162727766 + 250000

## 2. Identify genes within ranges of Log R Ratio and B-Allele Frequency
<a id="2"></a>

In [None]:
# Normally, we would loop through a list of genes but here is an example of a single gene
# Break down L2R and BAF per gene.
min_variants = 10

print("Remember, we are only calling CNVs for genes with more than " + str(min_variants) + " variants.")
results = []

for sample in gene_df['Sample_ID'].unique():
    code = sample.split('_')[0]

    df = gene_df.loc[gene_df['Sample_ID']==sample].copy()
    
    if df.shape[0] < min_variants:
        print(f"This gene in sample {sample} does not meet the minimum variant count requirement.")
        results.append((sample, 'PARK2', df.shape[0], np.nan, np.nan, np.nan))
    else:
        df['BAF_insertion'] = np.where( (df['BAlleleFreq'].between(0.65, 0.85, inclusive='neither')) | (df['BAlleleFreq'].between(0.15, 0.35, inclusive='neither')), 1, 0)
        df['L2R_deletion'] = np.where( df['LogRRatio'] < -0.2, 1, 0)
        df['L2R_insertion'] = np.where( df['LogRRatio'] > 0.2, 1, 0)
        PERCENT_BAF_INSERTION = df['BAF_insertion'].mean()
        PERCENT_L2R_DELETION = df['L2R_deletion'].mean()
        PERCENT_L2R_INSERTION = df['L2R_insertion'].mean()
        results.append((sample, 'PARK2', df.shape[0], PERCENT_BAF_INSERTION, PERCENT_L2R_DELETION, PERCENT_L2R_INSERTION))

output = pd.DataFrame(results, columns=('Sample_ID', 'GENE', 'NUM_VARIANTS', 'PERCENT_BAF_INSERTION', 'PERCENT_L2R_DELETION','PERCENT_L2R_INSERTION'))


## 3.  Calculate a probabilistic CNV dosage and plot the data
<a id="3"></a>

In [None]:
# Get max dosage per category of CNV
baf_ins_max = output['PERCENT_BAF_INSERTION'].max()
l2r_ins_max = output['PERCENT_L2R_INSERTION'].max()
l2r_del_max = output['PERCENT_L2R_DELETION'].max()


print(f'BAF Insertion Max: {baf_ins_max}')
print(f'L2R Insertion Max: {l2r_ins_max}')
print(f'L2R Deletion Max: {l2r_del_max}')

# Extract the sample with the high L2R deletion dosage
plot_sample = output.loc[output['PERCENT_L2R_DELETION']==l2r_del_max, 'Sample_ID'].values[0]
plot_sample


In [12]:
# View the sample with largest cnv dosage (highest probability of cnv)
plot_df = gene_df.loc[gene_df['Sample_ID']==plot_sample].copy()

In [None]:
# Plot B Allele frequencies and Log R ratios

gene_label = 'PARK2'
low_X = gene_start
high_X = gene_end
BAF_fig = px.scatter(plot_df, x='position', y='BAlleleFreq', color='LogRRatio', color_continuous_scale='IceFire')
BAF_fig.update_xaxes(range=[low_X, high_X])

BAF_fig.add_shape(type="line",
    x0=gene_start, y0=0.5, x1=gene_end, y1=0.5,
    line=dict(color="Black",width=3)
)

annot_x = (gene_end + gene_start)/2
annotation = {
    # x -> location for x
    'x': annot_x,
    # y -> location for y
    'y': 0.55,
    'text': gene_label,  # text
    'showarrow': True,  # would you want to see arrow
    'arrowhead': 3,  # which type for arrowhead
    'font': {'size': 10, 'color': 'black'}  # font style
}

BAF_fig.add_annotation(annotation)
BAF_fig.update_layout(title='BAF', width=1200, height=450)
BAF_fig.show()

low_X = gene_start
high_X = gene_end 
LRR_fig = px.scatter(plot_df, x='position', y='LogRRatio', color='BAlleleFreq', color_continuous_scale='Twilight')
LRR_fig.update_xaxes(range=[low_X, high_X])

LRR_fig.add_shape(type="line",
    x0=gene_start, y0=0.0, x1=gene_end, y1=0.0,
    line=dict(color="Black",width=3)
)

annot_x = (gene_end + gene_start)/2
annotation = {
    # x -> location for x
    'x': annot_x,
    # y -> location for y
    'y': 0.1,
    'text': gene_label,  # text
    'showarrow': True,  # would you want to see arrow
    'arrowhead': 3,  # which type for arrowhead
    'font': {'size': 10, 'color': 'black'}  # font style
}

LRR_fig.add_annotation(annotation)
LRR_fig.update_layout(title='LRR', width=1200, height=450)
LRR_fig.show()