# Intermediate Bioinformatics for Parkinson’s Disease Genetics


 - **Module II:** Investigating the cumulative effect of genetic variation on Parkinson’s disease (Haplotypes and Rare variant analyses)
 
 - **Author(s):** Kajsa Brolin on behalf of the Global Parkinson's Genetics Program (GP2) from Aligning Science Across Parkinson's (ASAP)
 
 - **Estimated Computation and Runtime:**
   - Estimated Specifications: 4 CPUs/15 GB, persistent disk size 300 GB
   - Estimated Runtime:  8.5 h total. The identification of haplotype blocks for the different populations take the most time, ~7h. Output files are uploaded that can be used to reduce runtime to only 1.5h). 
   
 - **Application configuration:** Legacy R/Bioconductor (R 4.1.1, Bioconductor 3.13, Python 3.7.10)
 
 - **Date Last Updated:** 15-AUGUST-2022
     - Update Description: The --kernel flag was removed from the pathway-based analysis. 

## Quick Description:

This notebook sets up to identify haplotype blocks and haplotypes in your data set and to analyse their associaton to PD. As an axample, this notebook specifically focus on haplotypes in SNCA. This notebook also sets up to run gene-based rare variant analyses (also with SNCA as an example) as well as pathway-based burden analyses using two lysosomal pathways as an example.

## Background/Motivation:

Haplotype analyses have been crucial to detect genetic risk factors in PD. They vary across populations and this will be key for further fine-mapping analyses since the variation of haplotypes across populations can help further narrow down loci of interest, or even unravel rare variants linked to PD. Identification of rare variants associated with PD is difficult for several reasons (e.g. very large sample sizes are needed!). Given the limitations for studying rare variants, there are current methods focused on exploring the cumulative effect of multiple rare variants at a gene, multi-gene or region level. Examples of these tests are burden tests and the non-burden sequence kernel association test that we have covered in this course.

!NOTE!
Information from https://www.cog-genomics.org/plink/1.9/ld regarding haplotypes in PLINK:
The .blocks file is valid input for PLINK 1.07's --hap command. However, the --hap... family of flags has not been reimplemented in PLINK 1.9 due to poor phasing accuracy (and, consequently, inferior haplotype likelihood/frequency estimates) relative to other software; for now, we recommend using BEAGLE 3.3.2 instead of PLINK for case/control haplotype association analysis. 

Here, we will look at som examples of haplotype and rare variant analyses

## Workflow summary

  0. Get started - Set up environment, download needed softwares, and import files
  
**Haplotype analyses**
  1. Identifying haplotype blocks and haplotypes in your dataset
  2. Analyzing associations between identified haplotypes in SNCA and PD risk in your dataset
  3. Comparing the size of haplotype blocks at PD risk loci across different populations
  
**Rare variant analyses**

  4. Gene/Region-based analysis (including annotation using ANNOVAR)
  5. Pathway-based analysis

## Workflow

### [0. Get started - Set up environment, download needed softwares, and import files](#0)

This section goes through:
* Setting up Python libraries, data path variables, billing variables, and functions
* Setting up billing variables
* Installing PLINK v1.9
* Install rpy2 to run R in python
* Install RVTESTS
* Install ANNOVAR
* Download files that are needed in ANNOVAR
* Copy the needed data/files to your workspace

### [1. Haplotype analyses - Identifying haplotype blocks and haplotypes in your dataset](#1)

This section goes through:
* How to extract the region of interest
* Calculate haplotype blocks

### [2. Haplotype analyses - Analyzing associations between identified haplotypes in SNCA and PD risk in your dataset](#2)

This section goes through:
* Recode your data to ped and map format
* Set up libraries and load packages for R analyses
* Add SNP name to the PED file
* Create haplotype files
* Run association analyses (both non-adjusted and adjusted) for each haplotype

### [3. Haplotype analyses - Comparing the size of haplotype blocks at PD risk loci across different populations](#3)

This section goes through:
* Extract cases from both the 1000G YRI and ACB populations
* Extract common SNPs between both datasets
* Calculate haplotype blocks in both cohorts (This step takes quite long time)
* Compare the length and number of SNPs of haplotypes at the 92 risk loci between the two cohorts

### [4. Rare variant analyses - Gene/Region-based analysis](#4)

This section goes through:
* How to extract the region of interest (again...different data though!)
* Recode the PLINK files to vcf (+bgzip and tabix the vcf file)
* Annotate the vcf file using ANNOVAR
* Clean up the output file, generate a list of non-coding variants and extract the them from the data
* Recode to generate a .ped file with phenotype info to use in RVTESTS
* Bgzip, tabix and run rare variant analyses in RVTESTS

### [5. Rare variant analyses - Pathway-based analysis](#5)

This section goes through:
* Run gene-set based pathway test in RVESTs
* Recode the PLINK format file to vcf and bgzip
* Run the pathway-based burden analysis in RVTESTS

## 0. Getting started
<a id="0"></a>

Set up cells copied from the GP2 Beginners Bioinformatics course, https://github.com/GP2-TNC-WG/GP2-Bioinformatics-course/blob/master/Introduction.md

In [1]:
# Use the os package to interact with the environment
import os

# Bring in Pandas for Dataframe functionality
import pandas as pd

# numpy for basics
import numpy as np

# Use StringIO for working with file contents
from io import StringIO

# Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

# BigQuery for querying data
from google.cloud import bigquery

Set up billing project and data path variables

In [2]:
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

print(BILLING_PROJECT_ID)

terra-9b559320


Set up useful functions

In [3]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}')
    !$command
    
def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
    </p>
    '''

    display(HTML(html))

**Haplotype analyses:**
* Programs that you need:
    - PLINK v1.9
    - R
* Files that you will need:
    - QC:ed pre-imputation PLINK binary files containing cases and controls (.fam, .bim, .bed files)
    - Covariate file: covariateFile_Haplo.txt
    - QC:ed pre-imputation PLINK binary files containing cases of different ancestries (.fam, .bim, .bed files)
    - PD loci file, txt file containing SNP, CHR, and BP for 92 reported PD risk loci: 92SNPs_haplotype.txt
    
**Rare variant analyses:**
* Programs that you need:
    - PLINK v1.9
    - RVTESTS
    - ANNOVAR
* Files that you will need:
    - Imputed QC:ed PLINK binary files with case control status (IMPUTED.HARDCALLS.Demo_formatted.bed, bim and fam files). These have been filtered for MAF > 0.001 and Rsq (imputation quality score) > 0.8 (Genome build GRCh37)
    - RefFlat file for RVTESTS: refFlat_hg19.txt. Can be downloaded from: http://qbrc.swmed.edu/zhanxw/seqminer/data/refFlat_hg19.txt.gz (for GRCh37)
    - Covariate file (we will use the same as for the haplotype analyses): covariateFile_Haplo.txt
    - SetFile: Needed to run pathway based burden analysis. Each pathway has it's own line, and gene coordinates follow in this format (no spaces between commas) 16:68245303-68261058,15:82659280-82709946

**Install PLINK v1.9**

In [4]:
%%capture
%%bash

if test -e /home/jupyter/plink; then

echo "Plink is already installed in /home/jupyter/"
else
echo "Plink is not installed"
cd /home/jupyter

wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 

unzip -o plink_linux_x86_64_20190304.zip

fi

**Install rpy2 to run R in python**

In [5]:
!pip install rpy2

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [6]:
#Import rpy2
import rpy2.rinterface

In [4]:
%load_ext rpy2.ipython

**Install RVTESTS**

In [8]:
%%bash

if test -e ~/home/jupyter/rvtests; then

echo "rvtests is already installed"
else
echo "rvtests is not installed"
cd /home/jupyter

git clone https://github.com/zhanxw/rvtests

fi

rvtests is not installed


Cloning into 'rvtests'...


In [9]:
%%capture
%%bash

cd /home/jupyter/rvtests
make

**Install ANNOVAR**

Important! You need to add the download link after registration on the annovar website: https://www.openbioinformatics.org/annovar/annovar_download_form.php

In [10]:
%%capture
%%bash

if test -e /home/jupyter/annovar; then

echo "annovar is already installed in /home/jupyter"
else
echo "annovar is not installed"
cd /home/jupyter/

wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvfz annovar.latest.tar.gz

fi

* Download files that are needed in ANNOVAR:
- These files are quite large and take some timw to download

In [11]:
%%capture
%%bash
cd /home/jupyter/annovar/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/ 
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/ 
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp30a humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad211_genome humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb26_all humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar clinvar_20140902 humandb/

**Check that you can see all the install programs**

In [12]:
%%bash
cd /home/jupyter/
ls

annovar
annovar.latest.tar.gz
GP2 Bioinformatics Course 2
jupyter.log
LICENSE
lost+found
packages
plink
plink_linux_x86_64_20190304.zip
prettify
rvtests
toy.map
toy.ped


**Copy the needed data/files to your workspace**

General tip: save data to your Persistent Disk (either /home/jupyter-user/ or /home/jupyter/) - you can create directories within this. That way you can delete and recreate your Cloud Environment if there are any issues, and the data saved on the Persistent Disk will be retained. See: https://support.terra.bio/hc/en-us/articles/360047318551-Detachable-Persistent-Disks-

The files needed for the tutorial are found in this module. They need to be uploaded manually to your bucket before importing them to the workspace.

The files have previously undergone QC and been randomly assigned case/control status. Data in the covariate file is has also been generated randomly and do not represent true data. We will import all files needed for the different haplotype analyses at this step.

* Make directions for the files

In [5]:
%%bash
mkdir -p /home/jupyter/1000G_TEST/
mkdir -p /home/jupyter/Burden/

* Copy files into your workspace > Data > Files
* Import files from the workspace bucket into the workspace

In [6]:
#HAPLOTYPE ANALYSES
#QC:ed case/control files (non-imputed)
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/FILTERED.test_formatted.* /home/jupyter/1000G_TEST/')
#Covariate file
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/covariateFile_Haplo.txt /home/jupyter/1000G_TEST/')
#QC:ed case files (non-imputed) with different ancestry (ACB population)
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/ACB_ALL_1000.* /home/jupyter/1000G_TEST/')
#QC:ed case files (non-imputed) with different ancestry (YRI population)
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/YRI_ALL_1000g.* /home/jupyter/1000G_TEST/')
#PD risk loci file:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/92SNPs_haplotype.txt /home/jupyter/1000G_TEST/')

#RARE VARIANT ANALYSES
#Imputed QC:ed PLINK binary files with case control status (IMPUTED.HARDCALLS.Demo_formatted)
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/IMPUTED.HARDCALLS.Demo_formatted.* /home/jupyter/Burden/')
#RefFlat txt file:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/refFlat_hg19.txt /home/jupyter/Burden/')
#Covariate file
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/covariateFile_Haplo.txt /home/jupyter/Burden/')
#SetFile
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/lysosomal-setfile-example.txt /home/jupyter/Burden/')


Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/FILTERED.test_formatted.* /home/jupyter/1000G_TEST/
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/FILTERED.test_formatted.bim...
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/FILTERED.test_formatted.bed...
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/FILTERED.test_formatted.fam...
\ [3/3 files][108.0 MiB/108.0 MiB] 100% Done                                    
Operation completed over 3 objects/108.0 MiB.                                    
Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/covariateFile_Haplo.txt /home/jupyter/1000G_TEST/
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/covariateFile_Haplo.txt...
/ [1/1 files][ 55.9 KiB/ 55.9 KiB] 100% Done                                    
Operation completed over 1 objects/55.9 KiB.                                     
Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a60

* Check that you can see the data in the folders:

In [7]:
%%bash
ls /home/jupyter/1000G_TEST/

92SNPs_haplotype.txt
ACB_ALL_1000.bed
ACB_ALL_1000.bim
ACB_ALL_1000.fam
covariateFile_Haplo.txt
FILTERED.test_formatted.bed
FILTERED.test_formatted.bim
FILTERED.test_formatted.fam
YRI_ALL_1000g.bed
YRI_ALL_1000g.bim
YRI_ALL_1000g.fam


In [8]:
%%bash
ls /home/jupyter/Burden/

covariateFile_Haplo.txt
IMPUTED.HARDCALLS.Demo_formatted.bed
IMPUTED.HARDCALLS.Demo_formatted.bim
IMPUTED.HARDCALLS.Demo_formatted.fam
lysosomal-setfile-example.txt
refFlat_hg19.txt


## 1. Haplotype analysis - Identifying haplotype blocks and haplotypes in your dataset
<a id="1"></a>

* For this, we will focus on a specific genomic region and will the region of the gene SNCA as an example

* SNCA position on GRCh37/hg19:chr4:90645250-90759466 (Ensembl)

* First, we create a new folder for the analyses

In [9]:
%%bash
mkdir -p /home/jupyter/1000G_TEST/SNCA_hap

**Extract the region of interest**

In [10]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/FILTERED.test_formatted \
--chr 4 \
--from-bp 90645250 \
--to-bp 90759466 \
--maf 0.01 \
--make-bed \
--out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/FILTERED.test_formatted --chr 4 --from-bp 90645250 --to-bp 90759466 --maf 0.01 --make-bed --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA.log.
Options in effect:
  --bfile /home/jupyter/1000G_TEST/FILTERED.test_formatted
  --chr 4
  --from-bp 90645250
  --maf 0.01
  --make-bed
  --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
  --to-bp 90759466

15000 MB RAM detected; reserving 7500 MB for main workspace.
43 out of 869296 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders and 0 nonfoun

**Calculate haplotype blocks**

The flag --blocks in PLINK estimates haplotype blocks via Haploview's interpretation of the block definition suggested by Gabriel S et al. (2002) The Structure of Haplotype Blocks in the Human Genome. Each block's variant IDs are written to plink.blocks, and a longer report with position information is written to plink.blocks.det.

In [11]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA \
--maf 0.01 \
--blocks \
--out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA_blocks')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA --maf 0.01 --blocks --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA_blocks
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA_blocks.log.
Options in effect:
  --bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
  --blocks
  --maf 0.01
  --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA_blocks

15000 MB RAM detected; reserving 7500 MB for main workspace.
40 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161

Two files will be generated (besides the log file):
* FILTERED.test_formatted.SNCA_blocks.blocks
* FILTERED.test_formatted.SNCA_blocks.blocks.det

.blocks:
The identified blocks. Contain one line per block, each with an asterisk followed by variant IDs, i.e.,
* 4:90645674 4:90653134 4:90663542 4:90663670 4:90668019

.block.det:
A more detailed file about the blocks. Have a header line, followed by one line per block with the following six fields:

* CHR	Chromosome code
* BP1	First base-pair coordinate
* BP2	Last base-pair coordinate
* KB	Block length in kbs
* NSNPS Number of variants in block
* SNPS '|'-delimited variant IDs


## 2. Haplotype analysis - Analyzing associations between identified haplotypes in SNCA and PD risk in your dataset
<a id="2"></a>

* Recode your data to ped and map format (we need those files for the analyses in R)

In [12]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA \
--recode \
--out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA --recode --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA.log.
Options in effect:
  --bfile /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
  --out /home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA
  --recode

15000 MB RAM detected; reserving 7500 MB for main workspace.
40 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839

* Set up libraries for R analyses:
- Make directory R_packages if it does not already exist

In [13]:
%%bash
mkdir -p /home/jupyter/R_packages

In [14]:
%%R
pack <- "/home/jupyter/R_packages"

install.packages("dplyr", lib = pack)
install.packages("data.table", lib = pack)
install.packages("arsenal", lib = pack)
install.packages("haplo.stats", lib = pack)

R[write to console]: also installing the dependencies ‘glue’, ‘cli’, ‘rlang’, ‘vctrs’


R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/glue_1.6.2.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 106510 bytes (104 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =

gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c glue.c -o glue.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c init.c -o init.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c trim.c -o trim.o
gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o glue.so glue.o init.o trim.o -L/usr/lib/R/lib -lR


installing to /home/jupyter/R_packages/00LOCK-glue/00new/glue/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (glue)
* installing *source* package ‘rlang’ ...
** package ‘rlang’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang/    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c capture.c -o capture.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang/    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c internal.c -o internal.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I./rlang/     -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c rlang-rcc.cpp -o rlang-rcc.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang/    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror

installing to /home/jupyter/R_packages/00LOCK-rlang/00new/rlang/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (rlang)
* installing *source* package ‘cli’ ...
** package ‘cli’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c ansi.c -o ansi.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c cleancall.c -o cleancall.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c diff.c -o diff.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c errors.c -o errors.o
gcc -std=gnu99 -I"/usr/share/R/include

installing to /home/jupyter/R_packages/00LOCK-cli/00new/cli/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (cli)
* installing *source* package ‘vctrs’ ...
** package ‘vctrs’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c altrep-rle.c -o altrep-rle.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c altrep.c -o altrep.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c arg-counter.c -o arg-counter.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG -I./rlang    -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong

installing to /home/jupyter/R_packages/00LOCK-vctrs/00new/vctrs/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (vctrs)
* installing *source* package ‘dplyr’ ...
** package ‘dplyr’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c chop.cpp -o chop.o
g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c filter.cpp -o filter.o
g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c funs.cpp -o funs.o
g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c group_by.cpp -o group_by.o
g++ -std=gnu++14 -I"/usr

installing to /home/jupyter/R_packages/00LOCK-dplyr/00new/dplyr/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (dplyr)
R[write to console]: 

R[write to console]: 
R[write to console]: The downloaded source packages are in
	‘/tmp/RtmpVYkzrU/downloaded_packages’
R[write to console]: 
R[write to console]: 

R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/data.table_1.14.2.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 5301817 bytes (5.1 MB)

R[write to console]: =

gcc -std=gnu99 6.5.0
zlib 1.2.11 is available ok
R CMD SHLIB supports OpenMP without any extra hint
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG     -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c assign.c -o assign.o


** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG     -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c between.c -o between.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG     -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c bmerge.c -o bmerge.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG     -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c chmatch.c -o chmatch.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG     -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c cj.c -o cj.o
gcc 

installing to /home/jupyter/R_packages/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)
R[write to console]: 

R[write to console]: 
R[write to console]: The downloaded source packages are in
	‘/tmp/RtmpVYkzrU/downloaded_packages’
R[write to console]: 
R[write to console]: 

R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/arsenal_3.6.3.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 672939 bytes (657 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c C_FORTRAN_interface.c -o C_FORTRAN_interface.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c miwa.c -o miwa.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c mvt.f -o mvt.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c mvtnorm-init.c -o mvtnorm-init.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/

installing to /home/jupyter/R_packages/00LOCK-mvtnorm/00new/mvtnorm/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (mvtnorm)
* installing *source* package ‘TH.data’ ...
** package ‘TH.data’ successfully unpacked and MD5 sums checked
** using staged installation
** data
*** moving datasets to lazyload DB
** demo
** inst
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of tempo

gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c all_missing.c -o all_missing.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c any_infinite.c -o any_infinite.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c any_missing.c -o any_missing.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c any_nan.c -o any_nan.o
gc

installing to /home/jupyter/R_packages/00LOCK-checkmate/00new/checkmate/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (checkmate)
* installing *source* package ‘SparseM’ ...
** package ‘SparseM’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c bckslv.f -o bckslv.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c chol.f -o chol.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c chol2csr.f -o chol2csr.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c cholesky.f -o cholesky.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c csr.f -o csr.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c extract.f -o extract.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpi

installing to /home/jupyter/R_packages/00LOCK-SparseM/00new/SparseM/libs
** R
** data
** demo
** inst
** byte-compile and prepare package for lazy loading


Creating a generic function for ‘diag’ from package ‘base’ in package ‘SparseM’
Creating a generic function for ‘diag<-’ from package ‘base’ in package ‘SparseM’
Creating a generic function for ‘norm’ from package ‘base’ in package ‘SparseM’
Creating a new generic function for ‘backsolve’ in package ‘SparseM’
Creating a generic function for ‘forwardsolve’ from package ‘base’ in package ‘SparseM’
Creating a generic function for ‘model.response’ from package ‘stats’ in package ‘SparseM’


** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (SparseM)
* installing *source* package ‘polspline’ ...
** package ‘polspline’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c allpack.f -o allpack.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c hareall.c -o hareall.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c heftall.c -o heftall.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c lsdall.c -o lsdall.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base

installing to /home/jupyter/R_packages/00LOCK-polspline/00new/polspline/libs
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (polspline)
* installing *source* package ‘quantreg’ ...
** package ‘quantreg’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c akj.f -o akj.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c boot.f -o boot.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c bound.f -o bound.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c boundc.f -o boundc.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c brute.f -o brute.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c chlfct.f -o chlfct.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/

installing to /home/jupyter/R_packages/00LOCK-quantreg/00new/quantreg/libs
** R
** data
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (quantreg)
* installing *source* package ‘multcomp’ ...
** package ‘multcomp’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loa

gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c Hmisc.c -o Hmisc.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c cidxcn.f -o cidxcn.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c cidxcp.f -o cidxcp.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c hoeffd.f -o hoeffd.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c init.c -o init.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -

installing to /home/jupyter/R_packages/00LOCK-Hmisc/00new/Hmisc/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (Hmisc)
* installing *source* package ‘rms’ ...
** package ‘rms’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c init.c -o init.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c lrmfit.f -o lrmfit.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c mlmats.f -o mlmats.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c ormuv.f -o ormuv.o
gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong  -c robcovf.f -o robcovf.o
gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o rms.so init.o lrmfit.o mlmats.o ormuv.o robcovf.o -lgf

installing to /home/jupyter/R_packages/00LOCK-rms/00new/rms/libs
** R
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (rms)
* installing *source* package ‘haplo.stats’ ...
** package ‘haplo.stats’ successfully unpacked and MD5 sums checked
** using staged installation
** libs


gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c groupsum.c -o groupsum.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c haplo.stats_init.c -o haplo.stats_init.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c haplo_em_pin.c -o haplo_em_pin.o
gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-5XUBcI/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c louis_info.c -o louis

installing to /home/jupyter/R_packages/00LOCK-haplo.stats/00new/haplo.stats/libs
** R
** data
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (haplo.stats)
R[write to console]: 

R[write to console]: 
R[write to console]: The downloaded source packages are in
	‘/tmp/RtmpVYkzrU/downloaded_packages’
R[write to console]: 
R[write to console]: 



- Load packages:

In [15]:
%%R
pack <- "/home/jupyter/R_packages"
suppressPackageStartupMessages(library(dplyr, lib.loc = pack))
suppressPackageStartupMessages(library(data.table, lib.loc = pack))
suppressPackageStartupMessages(library(arsenal, lib.loc = pack))
suppressPackageStartupMessages(library(haplo.stats, lib.loc = pack))

* Add SNP name to the PED file:
- Since the .ped file contain 6+2V fields (where V is the number of variants) but where the variant names are in the .map file, we need to create a list with the rsIDs/SNP names for all SNPs, with 2 alleles

In [16]:
%%R
PED <- fread("/home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA.ped")

In [17]:
%%R
FILTERED.SNCA_1 <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA.map", header=FALSE)
FILTERED.SNCA_2 <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/FILTERED.test_formatted.SNCA.map", header=FALSE)
FILTERED.SNCA_1$V2 <- paste0(FILTERED.SNCA_1$V2, sep="_", "1")
FILTERED.SNCA_2$V2 <- paste0(FILTERED.SNCA_2$V2, sep="_", "2")

In [18]:
%%R
#Set column names
colnames(FILTERED.SNCA_1) <- c("CHR", "SNP", "CM", "POS")
colnames(FILTERED.SNCA_2) <- c("CHR", "SNP", "CM", "POS")
FILTERED.SNCA_2alleles <- rbind(FILTERED.SNCA_1, FILTERED.SNCA_2)

In [19]:
%%R
#Sort file and create geno matrix:
#The variants should appear in order of genomic postition in the ped file, therefore, we sort the alleles based on position
SNCA_pos <- FILTERED.SNCA_2alleles[order(FILTERED.SNCA_2alleles$POS),]
SNCA_alleles <- SNCA_pos[,c("SNP")]
write.table(SNCA_alleles, file="/home/jupyter/1000G_TEST/SNCA_hap/SNCA_alleles.txt", quote = F, sep = "\t", row.names = F, col.names = F)

#Add allele names to the PED file
colnames(PED) <- c("FID", "IID", "PAT","MAT", "SEX", "PHENO", SNCA_alleles)
write.table(PED, file="/home/jupyter/1000G_TEST/SNCA_hap/Geno_matrix_SNCA.tab", quote = F, row.names = F, sep = '\t')

In [20]:
%%R
#Import covariate file and set which variables you would like to adjust for
SampleInfo_Adjustment <- read.delim("/home/jupyter/1000G_TEST/covariateFile_Haplo.txt")
adj <- data.frame(SampleInfo_Adjustment[,c("SEX","AGE","PC1", "PC2", "PC3", "PC4", "PC5")])

**Create haplotype block files (N=4 blocks)**

Information on the different variants in each haplotype block was retrieved from the .blocks file

* H1 (Haplotype block 1)
- SNPs in this haplotype block (N=5): 4:90645674, 4:90653134, 4:90663542, 4:90663670, 4:90668019

In [21]:
%%R
Geno_matrix_SNCA <- read.delim("/home/jupyter/1000G_TEST/SNCA_hap/Geno_matrix_SNCA.tab", check.names = FALSE)
#Colnames:
SNCA_alleles <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/SNCA_alleles.txt", quote="\"", comment.char="")
H1_a <- SNCA_alleles[grepl("4:90645674|4:90653134|4:90663542|4:90663670|4:90668019", SNCA_alleles$V1), ]
H1_SNCA <- Geno_matrix_SNCA[,c("FID", "IID", "PAT", "MAT", "SEX", "PHENO", H1_a)]

In [22]:
%%R
#Extract only genotype data:
SNCA_alleles <- readLines("/home/jupyter/1000G_TEST/SNCA_hap/Geno_matrix_SNCA.tab")
geno <- data.frame(H1_SNCA[,c(7:16)], check.names = FALSE)

In [23]:
%%R
#Set variables for running the association analyses in haplo.stats

#Label the SNPs:
label <- c("4_90645674", "4_90653134", "4_90663542", "4_90663670", "4_90668019")
#Set binary pheno (0=control, 1=patient):
H1_SNCA$PHENO_01 <- H1_SNCA$PHENO-1
y.bin <- 1*(H1_SNCA$PHENO_01=="1")

**Run association analyses (both non-adjusted and adjusted) for each haplotype**

In [24]:
%%R
#Non-adjusted:
H1 <- haplo.cc(y=y.bin, geno=geno, locus.label= label, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H1, nlines=10, digits=2)
#Sort the output on p-value:
H1_cc.df <- H1$cc.df
H1_cc.df_sort <- H1_cc.df[order(H1_cc.df$`p-val`),]

-------------------------------------------------------------------------------- 
                            Global Score Statistics                              
-------------------------------------------------------------------------------- 
global-stat = 2.6, df = 4, p-val = 0.64

-------------------------------------------------------------------------------- 
                         Counts for Cases and Controls                           
-------------------------------------------------------------------------------- 
control    case 
    202     202 


   4_90645674 4_90653134 4_90663542 4_90663670 4_90668019 Hap-Score p-val
1           C          C          C          T          A     -0.77  0.44
2           C          C          T          T          A     -0.15  0.88
8           T          C          T          G          G      0.15  0.88
7           T          C          T          G          A      0.91  0.36
4           C          T          T          T          A    

In [25]:
%%R
#Adjusted for age, sex, PC1-5:
H1_adj <- haplo.cc(y=y.bin, geno=geno, locus.label= label, x.adj=adj, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H1_adj, nlines=10, digits=2)
H1_adj_cc.df <- H1_adj$cc.df
H1_adj_cc.df_sort <- H1_adj_cc.df[order(H1_adj_cc.df$`p-val`),]

-------------------------------------------------------------------------------- 
                            Global Score Statistics                              
-------------------------------------------------------------------------------- 
global-stat = 0.4, df = 4, p-val = 0.98

-------------------------------------------------------------------------------- 
                         Counts for Cases and Controls                           
-------------------------------------------------------------------------------- 
control    case 
    202     202 


   4_90645674 4_90653134 4_90663542 4_90663670 4_90668019 Hap-Score p-val
4           C          T          T          T          A     -0.34  0.73
2           C          C          T          T          A     -0.16  0.87
8           T          C          T          G          G      0.10  0.92
1           C          C          C          T          A      0.23  0.82
7           T          C          T          G          A    

Here we tested the association between PD and the identified haplotypes in haplotype block 1, first non-adjusted and then adjusted for age, sex and PC1-5. Each line represent a haplotype in the block and the base for each SNP that is in it. A MAF>1% was used, therefore, some haplotypes have not been includd in the analyses (rows containing NA's), control.hf = haplotype frequency for the control group, case.hf = haplotype frequency for the patient group. 

We will continue with the rest of the haplotype blocks below:

* H2 (Haplotype block 2)
- SNPs in this haplotype block (N=7): 4:90668614 4:90672457 4:90678541 4:90681236 4:90684122 4:90690329 4:90709741

In [26]:
%%R
#Run the same analyses as above for haplotype block 2:

#Colnames:
SNCA_alleles <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/SNCA_alleles.txt", quote="\"", comment.char="")
H2_a <- SNCA_alleles[grepl("4:90668614|4:90672457|4:90678541|4:90681236|4:90684122|4:90690329|4:90709741", SNCA_alleles$V1), ]
H2_SNCA <- Geno_matrix_SNCA[,c("FID", "IID", "PAT", "MAT", "SEX", "PHENO", H2_a)]

#Extract only genotype data:
geno <- data.frame(H2_SNCA[,c(7:20)], check.names = FALSE)

#Label the SNPs:
label <- c("4_90668614","4_90672457","4_90678541","4_90681236","4_90684122","4_90690329","4_90709741")

#Set binary pheno (0 ctrl, 1 pat):
H2_SNCA$PHENO_01 <- H2_SNCA$PHENO-1
y.bin <- 1*(H2_SNCA$PHENO_01=="1")

#Non-adjusted:
H2 <- haplo.cc(y=y.bin, geno=geno, locus.label= label, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H2, nlines=10, digits=2)

#Sort the output on p-value:
H2_cc.df <- H2$cc.df
H2_cc.df_sort <- H2_cc.df[order(H2_cc.df$`p-val`),]

#Adjusted for age, sex, PC1-5:
H2_adj <- haplo.cc(y=y.bin, geno=geno, locus.label= label, x.adj=adj, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H2_adj, nlines=10, digits=2)
H2_adj_cc.df <- H2_adj$cc.df
H2_adj_cc.df_sort <- H2_adj_cc.df[order(H2_adj_cc.df$`p-val`),]

-------------------------------------------------------------------------------- 
                            Global Score Statistics                              
-------------------------------------------------------------------------------- 
global-stat = 3.8, df = 5, p-val = 0.58

-------------------------------------------------------------------------------- 
                         Counts for Cases and Controls                           
-------------------------------------------------------------------------------- 
control    case 
    202     202 


   4_90668614 4_90672457 4_90678541 4_90681236 4_90684122 4_90690329 4_90709741
4           C          A          A          C          A          C          G
12          T          T          G          A          A          C          A
2           C          A          A          A          A          C          G
10          T          A          G          A          G          T          A
7           T          A       

* H3 (Haplotype block 3)
- SNPs in this haplotype block (N=7): 4:90716852 4:90718995 4:90722145 4:90734535 4:90736006 4:90736113 4:90739539

In [27]:
%%R

#Colnames:
SNCA_alleles <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/SNCA_alleles.txt", quote="\"", comment.char="")
H3_a <- SNCA_alleles[grepl("4:90716852|4:90718995|4:90722145|4:90734535|4:90736006|4:90736113|4:90739539", SNCA_alleles$V1), ]
H3_SNCA <- Geno_matrix_SNCA[,c("FID", "IID", "PAT", "MAT", "SEX", "PHENO", H3_a)]

#Extract only genotype data:
geno <- data.frame(H3_SNCA[,c(7:20)], check.names = FALSE)

#Label the SNPs:
label <- c("4_90716852","4_90718995","4_90722145","4_90734535","4_90736006","4_90736113","4_90739539")

#Set binary pheno (0 ctrl, 1 pat):
H3_SNCA$PHENO_01 <- H3_SNCA$PHENO-1
y.bin <- 1*(H3_SNCA$PHENO_01=="1")

#Non-adjusted:
H3 <- haplo.cc(y=y.bin, geno=geno, locus.label= label, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H3, nlines=10, digits=2)

#Sort the output on p-value:
H3_cc.df <- H3$cc.df
H3_cc.df_sort <- H3_cc.df[order(H3_cc.df$`p-val`),]

#Adjusted for age, sex, PC1-5:
H3_adj <- haplo.cc(y=y.bin, geno=geno, locus.label= label, x.adj=adj, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H3_adj, nlines=10, digits=2)
H3_adj_cc.df <- H3_adj$cc.df
H3_adj_cc.df_sort <- H3_adj_cc.df[order(H3_adj_cc.df$`p-val`),]

-------------------------------------------------------------------------------- 
                            Global Score Statistics                              
-------------------------------------------------------------------------------- 
global-stat = 3.9, df = 5, p-val = 0.56

-------------------------------------------------------------------------------- 
                         Counts for Cases and Controls                           
-------------------------------------------------------------------------------- 
control    case 
    202     202 


   4_90716852 4_90718995 4_90722145 4_90734535 4_90736006 4_90736113 4_90739539
8           T          C          A          T          A          T          C
2           C          C          C          T          A          C          C
6           T          C          A          G          C          C          A
9           T          C          A          T          C          C          A
10          T          C       

* H4 (Haplotype block 4)
- SNPs in this haplotype block (N=8): 4:90753339 4:90754292 4:90755939 4:90757272 4:90757735 4:90757840 4:90757947 4:90758945

In [28]:
%%R

#Colnames:
SNCA_alleles <- read.table("/home/jupyter/1000G_TEST/SNCA_hap/SNCA_alleles.txt", quote="\"", comment.char="")
H4_a <- SNCA_alleles[grepl("4:90753339|4:90754292|4:90755939|4:90757272|4:90757735|4:90757840|4:90757947|4:90758945", SNCA_alleles$V1), ]
H4_SNCA <- Geno_matrix_SNCA[,c("FID", "IID", "PAT", "MAT", "SEX", "PHENO", H4_a)]

#Extract only genotype data:
geno <- data.frame(H4_SNCA[,c(7:22)], check.names = FALSE)

#Label the SNPs:
label <- c("4_90753339","4_90754292","4_90755939","4_90757272","4_90757735","4_90757840","4_90757947","4_90758945")

#Set binary pheno (0 ctrl, 1 pat):
H4_SNCA$PHENO_01 <- H4_SNCA$PHENO-1
y.bin <- 1*(H4_SNCA$PHENO_01=="1")

#Non-adjusted:
H4 <- haplo.cc(y=y.bin, geno=geno, locus.label= label, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H4, nlines=10, digits=2)

#Sort the output on p-value:
names(H4)
H4_cc.df <- H4$cc.df
H4_cc.df_sort <- H4_cc.df[order(H4_cc.df$`p-val`),]

#Adjusted for age, sex, PC1-5:
H4_adj <- haplo.cc(y=y.bin, geno=geno, locus.label= label, x.adj=adj, control = haplo.glm.control(haplo.freq.min = 0.01))
print(H4_adj, nlines=10, digits=2)
H4_adj_cc.df <- H4_adj$cc.df
H4_adj_cc.df_sort <- H4_adj_cc.df[order(H4_adj_cc.df$`p-val`),]

-------------------------------------------------------------------------------- 
                            Global Score Statistics                              
-------------------------------------------------------------------------------- 
global-stat = 2.9, df = 7, p-val = 0.89

-------------------------------------------------------------------------------- 
                         Counts for Cases and Controls                           
-------------------------------------------------------------------------------- 
control    case 
    202     202 


   4_90753339 4_90754292 4_90755939 4_90757272 4_90757735 4_90757840 4_90757947
7           A          T          A          G          C          T          T
9           A          T          G          G          A          C          T
6           A          T          A          G          C          C          T
4           A          C          A          G          C          T          T
13          T          T       

As you might notice, none of the haplotypes in none of the haplotype blocks was significantly associated with PD here. Not so surprising though since the cases and controls where randomized in this data :)

## 3. Haplotype analysis - Comparing the size of haplotype blocks at PD risk loci across different populations
<a id="3"></a>

* As an example, we will use the 1000G YRI and ACB populations

**Extract cases from both the 1000G YRI and ACB populations:**

In [29]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/ACB_ALL_1000 \
--maf 0.01 \
--filter-cases \
--allow-no-sex \
--make-bed \
--out /home/jupyter/1000G_TEST/CASES.0.01.ACB')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/ACB_ALL_1000 --maf 0.01 --filter-cases --allow-no-sex --make-bed --out /home/jupyter/1000G_TEST/CASES.0.01.ACB
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/CASES.0.01.ACB.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/1000G_TEST/ACB_ALL_1000
  --filter-cases
  --maf 0.01
  --make-bed
  --out /home/jupyter/1000G_TEST/CASES.0.01.ACB

15000 MB RAM detected; reserving 7500 MB for main workspace.
34580269 variants loaded from .bim file.
96 people (0 males, 0 females, 96 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/jupyter/1000G_TEST/CASES.0.01.ACB.nosex .
96 phenotype values loaded from .fam.
1 person removed due to case/control status (--filter-cases).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 95 found

In [30]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/YRI_ALL_1000g \
--maf 0.01 \
--filter-cases \
--allow-no-sex \
--make-bed \
--out /home/jupyter/1000G_TEST/CASES.0.01.YRI')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/YRI_ALL_1000g --maf 0.01 --filter-cases --allow-no-sex --make-bed --out /home/jupyter/1000G_TEST/CASES.0.01.YRI
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/CASES.0.01.YRI.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/1000G_TEST/YRI_ALL_1000g
  --filter-cases
  --maf 0.01
  --make-bed
  --out /home/jupyter/1000G_TEST/CASES.0.01.YRI

15000 MB RAM detected; reserving 7500 MB for main workspace.
34580269 variants loaded from .bim file.
108 people (0 males, 0 females, 108 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/jupyter/1000G_TEST/CASES.0.01.YRI.nosex .
108 phenotype values loaded from .fam.
0 people removed due to case/control status (--filter-cases).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 108

As you can see, only one individual in the ACB dataset was a control (good for us!)

**Extract common SNPs between both datasets:**

In [31]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/CASES.0.01.ACB \
--bmerge /home/jupyter/1000G_TEST/CASES.0.01.YRI \
--allow-no-sex \
--make-bed \
--out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/CASES.0.01.ACB --bmerge /home/jupyter/1000G_TEST/CASES.0.01.YRI --allow-no-sex --make-bed --out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/1000G_TEST/CASES.0.01.ACB
  --bmerge /home/jupyter/1000G_TEST/CASES.0.01.YRI
  --make-bed
  --out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged

15000 MB RAM detected; reserving 7500 MB for main workspace.
95 people loaded from /home/jupyter/1000G_TEST/CASES.0.01.ACB.fam.
108 people to be merged from /home/jupyter/1000G_TEST/CASES.0.01.YRI.fam.
Of these, 108 are new, while 0 are present in the base dataset.
15946782 markers loaded from /home/jupyter/1000G_TEST/CASES.0.01.ACB.bim.
14

Set a maf of 1% and exclude regions to speed up the coming haplotype block analyses. For this, import a file with regions to exclude:

In [32]:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/regions_exclude_haplotypes_size.txt /home/jupyter/1000G_TEST/')

Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/regions_exclude_haplotypes_size.txt /home/jupyter/1000G_TEST/
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/regions_exclude_haplotypes_size.txt...
/ [1/1 files][   1010 B/   1010 B] 100% Done                                    
Operation completed over 1 objects/1010.0 B.                                     


In [33]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged \
--geno 0.01 \
--make-bed \
--autosome \
--exclude range /home/jupyter/1000G_TEST/regions_exclude_haplotypes_size.txt \
--out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged --geno 0.01 --make-bed --autosome --exclude range /home/jupyter/1000G_TEST/regions_exclude_haplotypes_size.txt --out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01.log.
Options in effect:
  --autosome
  --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged
  --exclude range /home/jupyter/1000G_TEST/regions_exclude_haplotypes_size.txt
  --geno 0.01
  --make-bed
  --out /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01

15000 MB RAM detected; reserving 7500 MB for main workspace.
16977600 variants loaded from .bim file.
203 people (0 males, 0 females, 203 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/jupyter/1

In [34]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01 \
--keep /home/jupyter/1000G_TEST/CASES.0.01.ACB.fam \
--make-bed \
--allow-no-sex \
--out /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01 --keep /home/jupyter/1000G_TEST/CASES.0.01.ACB.fam --make-bed --allow-no-sex --out /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01
  --keep /home/jupyter/1000G_TEST/CASES.0.01.ACB.fam
  --make-bed
  --out /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO

15000 MB RAM detected; reserving 7500 MB for main workspace.
6933222 variants loaded from .bim file.
203 people (0 males, 0 females, 203 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO.nosex .
203 phenotype values loaded from .fam.
--keep: 95 people remaining.
Using 1 thre

In [35]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01 \
--keep /home/jupyter/1000G_TEST/CASES.0.01.YRI.fam \
--make-bed \
--allow-no-sex \
--out /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO')

Executing: /home/jupyter/plink --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01 --keep /home/jupyter/1000G_TEST/CASES.0.01.YRI.fam --make-bed --allow-no-sex --out /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/1000G_TEST/variants.CASES_ACB_YRI_merged_geno0.01
  --keep /home/jupyter/1000G_TEST/CASES.0.01.YRI.fam
  --make-bed
  --out /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO

15000 MB RAM detected; reserving 7500 MB for main workspace.
6933222 variants loaded from .bim file.
203 people (0 males, 0 females, 203 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO.nosex .
203 phenotype values loaded from .fam.
--keep: 108 people remaining.
Using 1 thr

**Calculate haplotype blocks in both cohorts**

This step is usually done for all the genotype data. However, sicne this takes a lot of time, we previously excludes some regions in order for the analyses to run faster (it still takes a couple of hours though)

***If you do not want to run this stepand save some time, there are output files uploaded for you. You can retreive the files running the cell below:***

In [36]:
#Retrieve output haplotype files for the ACB population:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/ACB_blocks.* /home/jupyter/1000G_TEST/')

#Retrieve output haplotype files for the YRI population:
shell_do(f'gsutil -mu {BILLING_PROJECT_ID} cp {WORKSPACE_BUCKET}/YRI_blocks.* /home/jupyter/1000G_TEST/')

Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/ACB_blocks.* /home/jupyter/1000G_TEST/
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/ACB_blocks.blocks...
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/ACB_blocks.blocks.det...   
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/ACB_blocks.log...          
\ [3/3 files][ 86.5 MiB/ 86.5 MiB] 100% Done                                    
Operation completed over 3 objects/86.5 MiB.                                     
Executing: gsutil -mu terra-9b559320 cp gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/YRI_blocks.* /home/jupyter/1000G_TEST/
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/YRI_blocks.blocks...
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/YRI_blocks.blocks.det...   
Copying gs://fc-c04486b2-8d7e-4359-a607-63643e9a7914/YRI_blocks.log...          
\ [3/3 files][ 90.2 MiB/ 90.2 MiB] 100% Done                                    
Operation completed over 3 objects/9

***If you instead prefer to run the commands yourself, run the following two cells. Otherwise, skip these steps and use the uploaded files!***

In [None]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/CASES_ACB_TOHAPLO \
--blocks \
--allow-no-sex \
--out /home/jupyter/1000G_TEST/ACB_blocks')

In [None]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/1000G_TEST/CASES_YRI_TOHAPLO \
--blocks \
--allow-no-sex \
--out /home/jupyter/1000G_TEST/YRI_blocks')

**Compare the length and number of SNPs of haplotypes at the 92 risk loci between the two populations:**

In [38]:
%%R
#Read the haplotype files:
ACB_Haplo <- read.csv("/home/jupyter/1000G_TEST/ACB_blocks.blocks.det", sep="")
YRI_Haplo <- read.csv("/home/jupyter/1000G_TEST/YRI_blocks.blocks.det", sep="")

In [39]:
%%R
#PD loci file:
PD_loci <- read.delim("/home/jupyter/1000G_TEST/92SNPs_haplotype.txt")

In [40]:
%%bash
head -10 /home/jupyter/1000G_TEST/ACB_blocks.blocks.det

 CHR          BP1          BP2           KB  NSNPS SNPS
   1    154398961    154399353        0.393      2 rs111266048|rs78186112
   1    154400015    154400320        0.306      2 rs4845618|rs6687726
   1    154400600    154409719         9.12     32 rs80219185|rs6427658|rs73014271|rs7537306|rs58281895|rs79150137|rs28730665|rs2228144|rs2229237|rs6694817|rs28730733|rs73014279|rs74709806|rs7549250|rs7549338|rs7553796|rs114766856|rs56383622|rs4845619|rs114645658|rs77676154|rs78307575|rs59632925|rs4845620|rs7518199|rs7521458|rs115321795|rs4845371|rs57502626|rs6667434|rs55826755|rs149274188
   1    154409814    154410481        0.668      2 rs181088116|rs113588579
   1    154410482    154410490        0.009      2 rs61812592|rs61812593
   1    154411419    154416969        5.551     15 rs4845622|rs59502179|rs4393147|rs4453032|rs6664201|rs61812596|rs115093244|rs201335322|rs7529670|rs4845372|rs4845623|rs6676117|rs182632843|rs12753254|rs12730036
   1    154418749    154422067        3.319    

In [41]:
%%R
PD_loci$ACB.HapLength.KB <- NA
PD_loci$ACB.HapSnps.N <- NA
PD_loci$ACB.HapSnps.RS <- NA
PD_loci$YRI.HapLength.KB <- NA
PD_loci$YRI.HapSnps.N <- NA
PD_loci$YRI.HapSnps.RS <- NA
for(i in 1:length(PD_loci$SNP))
{
  thisSnp <- PD_loci$SNP[i]
  thisChr <- PD_loci$CHR[i]
  thisBp <- PD_loci$BP[i]
  ACBHap <- subset(ACB_Haplo, CHR == thisChr & BP1 <= thisBp & BP2 >= thisBp)
  YRIHap <- subset(YRI_Haplo, CHR == thisChr & BP1 <= thisBp & BP2 >= thisBp)
  if(length(ACBHap$KB) > 0) 
  {
    PD_loci$ACB.HapLength.KB[i] <- ACBHap$KB
    PD_loci$ACB.HapSnps.N[i] <- ACBHap$NSNPS
    PD_loci$ACB.HapSnps.RS[i] <- ACBHap$SNPS
  }
  if(length(YRIHap$KB) > 0) 
  {
    PD_loci$YRI.HapLength.KB[i] <- YRIHap$KB
    PD_loci$YRI.HapSnps.N[i] <- YRIHap$NSNPS
    PD_loci$YRI.HapSnps.RS[i] <- YRIHap$SNPS
  }
}
PD_loci_sort <- PD_loci[order(PD_loci$`ACB.HapSnps.N`),]
fwrite(PD_loci_sort, "/home/jupyter/1000G_TEST/PD_lociHaplo.tab", quote = F, sep = "\t", row.names = F, na = NA)

In [42]:
%%bash
head -4 /home/jupyter/1000G_TEST/PD_lociHaplo.tab

SNP	CHR	BP	ACB.HapLength.KB	ACB.HapSnps.N	ACB.HapSnps.RS	YRI.HapLength.KB	YRI.HapSnps.N	YRI.HapSnps.RS
rs62333164	4	170583157	0.514	2	rs72694777|rs62333164	0.514	2	rs72694777|rs62333164
rs11950533	5	134199105	0.673	2	rs73282857|rs111519842	0.673	2	rs73282857|rs111519842
rs3802920	11	133787001	0.009	2	rs3802921|rs3802920	1.201	6	rs75125504|rs3824992|rs3802923|rs3802922|rs75952732|rs3741104


The output file PD_lociHaplo.tab contain the lenght of the haplotype block in KB (HapLength.KB), the number of SNPs in the block (HapSnps.N) and the rsID of the SNPs for both populations mapped to the PD risk loci (SNPs) in the input file 92SNPs_haplotype.txt. This makes in possible for us to compare the haplotype blocks in the different populations at the risk loci. 

## 4. Rare variant analyses - Gene/Region-based analysis (including annotation using ANNOVAR)
<a id="4"></a>

* We will focus on a specific region of interest and as an example, we will again look at the region of the gene SNCA

* SNCA position on GRCh37/hg19:chr4:90645250-90759466 (Ensembl)

* First, we create a new folder for the analyses

In [43]:
%%bash
mkdir -p /home/jupyter/Burden/SNCA_burden

**Extract the region of interest**

In [44]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted \
--chr 4 \
--from-bp 90645250 \
--to-bp 90759466 \
--make-bed \
--out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA')

Executing: /home/jupyter/plink --bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted --chr 4 --from-bp 90645250 --to-bp 90759466 --make-bed --out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA.log.
Options in effect:
  --bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted
  --chr 4
  --from-bp 90645250
  --make-bed
  --out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
  --to-bp 90759466

15000 MB RAM detected; reserving 7500 MB for main workspace.
600 out of 12405097 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders 

**Recode the PLINK files to vcf**

In [45]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA \
--recode vcf-fid \
--out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA')

Executing: /home/jupyter/plink --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA --recode vcf-fid --out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA.log.
Options in effect:
  --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
  --out /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
  --recode vcf-fid

15000 MB RAM detected; reserving 7500 MB for main workspace.
600 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders and 0 nonfounders present.
Calculating allele frequencies... 101

**bgzip and tabix the vcf file**

In [46]:
%%bash
export PATH=$PATH:/home/jupyter/rvtests/third/tabix-0.2.6/
bgzip /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA.vcf
tabix -f -p vcf /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA.vcf.gz

**Annotate the vcf file using ANNOVAR**

In [47]:
%%capture
%%bash

perl /home/jupyter/annovar/table_annovar.pl /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA.vcf.gz /home/jupyter/annovar/humandb/ -buildver hg19 \
-out /home/jupyter/Burden/SNCA_burden/SNCA.annovar \
-remove -protocol refGene,ljb26_all,gnomad211_genome,clinvar_20140902 \
-operation g,f,f,f \
-nastring . \
-vcfinput

**Clean up the output file**

In [48]:
%%bash
head -1 /home/jupyter/Burden/SNCA_burden/SNCA.annovar.hg19_multianno.txt > /home/jupyter/Burden/SNCA_burden/header.txt
colct="$(wc -w /home/jupyter/Burden/SNCA_burden/header.txt| cut -f1 -d' ')"
cut -f1-$colct /home/jupyter/Burden/SNCA_burden/SNCA.annovar.hg19_multianno.txt > /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.txt

**Generate list of non-coding variants**

The following can also be done for all variants or other functional categories such as non-coding or CADD. (CADD = scoring of the deleteriousness variants). However, here we will look at the non-coding variants as an example (There is only one coding variant present in this dataset, hence we look at the noncodinh variants)

In [49]:
%%bash

awk '$6=="intronic" {print}' /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.txt > /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.txt
awk '{print $1" "$2" "$2" "$7}' /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.txt > /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt

Below, we will have a look at the non-coding variants in the region in our dataset:

In [50]:
%%bash

head -5 /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt

4 90647865 90647865 SNCA
4 90648185 90648185 SNCA
4 90648482 90648482 SNCA
4 90648686 90648686 SNCA
4 90648952 90648952 SNCA


**Extract the non-conding variants from the data:**

In [51]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA \
--extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt \
--recode vcf-fid \
--out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding')

Executing: /home/jupyter/plink --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA --extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt --recode vcf-fid --out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.log.
Options in effect:
  --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
  --extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt
  --out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding
  --recode vcf-fid

15000 MB RAM detected; reserving 7500 MB for main workspace.
600 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
--extract range: 19 variant

**Recode to generate a .ped file with phenotype info to use in RVTESTS**

The 6th column of the phenotype file .ped, which is in PLINK format, will be used. Rvtests will automatically check whether the phenotype is binary trait or quantitative trait. For binary trait, the recommended way of coding is to code controls as 1, cases as 2, missing phenotypes as -9 or 0.

In [52]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA \
--extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt \
--recode \
--out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding')

Executing: /home/jupyter/plink --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA --extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt --recode --out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.log.
Options in effect:
  --bfile /home/jupyter/Burden/SNCA_burden/IMPUTED.HARDCALLS.Demo_formatted.SNCA
  --extract range /home/jupyter/Burden/SNCA_burden/SNCA.trimmed.annotation.noncoding.variants.SNPs.txt
  --out /home/jupyter/Burden/SNCA_burden/SNCA.noncoding
  --recode

15000 MB RAM detected; reserving 7500 MB for main workspace.
600 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
--extract range: 19 variants excluded.
--ex

**Run rare variant analyses in RVTESTS**

RVTESTS is great since we can run all the test we are interested in at the same time:
* Burden test (cmc,zeggini,mb,fp,cmcWald)
* Non-burden sequence kernel association test (SKAT)
* Optimal SKAT (SKAT-O)

cmc,zeggini,mb,fp,cmcWald are all burden analyses but different ways to calculate/different algorithms. The website for RTVTESTS has nice explainations for all analyses and references if you are interested in learning more: http://zhanxw.github.io/rvtests/#burden-tests

**Bgzip and index the vcf file**

RVTESTS supports vcf files, and files in both plain text format or gzipped format are supported. But to use group-based rare variant tests (which we are going to), indexed the VCF files using tabix are required.

In [53]:
%%bash
export PATH=$PATH:/home/jupyter/rvtests/third/tabix-0.2.6/
bgzip /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.vcf
tabix -f -p vcf /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.vcf.gz

**Run the analysis**
Here we use an upper minor allele frequency of 1% (--freqUpper 0.01).

In [54]:
%%capture
%%bash
export PATH=$PATH:/home/jupyter/rvtests/executable

rvtest --noweb \
--inVcf /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.vcf.gz \
--pheno /home/jupyter/Burden/SNCA_burden/SNCA.noncoding.ped \
--covar /home/jupyter/Burden/covariateFile_Haplo.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--kernel skat,skato \
--burden cmc,zeggini,mb,fp,cmcWald \
--geneFile /home/jupyter/Burden/refFlat_hg19.txt \
--freqUpper 0.01 \
--out /home/jupyter/Burden/SNCA_burden/BURDEN.SNCA

# --out : Name of output 
# --burden X --kernel X: tests to run 
# --inVcf : path to VCF file 
# --pheno : path to pheno+covariate file
# --pheno-name : column name with phenotype in file
# --covar : path to pheno+covariate file
# --freqUpper : MAF cut-off
# --covar-name : covariates, listed by column name, separated by commas (no spaces between commas)
# --geneFile : refFlat path

In [55]:
%%bash
cd /home/jupyter/Burden/SNCA_burden/
ls

BURDEN.SNCA.CMC.assoc
BURDEN.SNCA.CMCWald.assoc
BURDEN.SNCA.Fp.assoc
BURDEN.SNCA.log
BURDEN.SNCA.MadsonBrowning.assoc
BURDEN.SNCA.Skat.assoc
BURDEN.SNCA.SkatO.assoc
BURDEN.SNCA.Zeggini.assoc
header.txt
IMPUTED.HARDCALLS.Demo_formatted.SNCA.bed
IMPUTED.HARDCALLS.Demo_formatted.SNCA.bim
IMPUTED.HARDCALLS.Demo_formatted.SNCA.fam
IMPUTED.HARDCALLS.Demo_formatted.SNCA.log
IMPUTED.HARDCALLS.Demo_formatted.SNCA.vcf.gz
IMPUTED.HARDCALLS.Demo_formatted.SNCA.vcf.gz.tbi
SNCA.annovar.avinput
SNCA.annovar.hg19_multianno.txt
SNCA.annovar.hg19_multianno.vcf
SNCA.noncoding.log
SNCA.noncoding.map
SNCA.noncoding.ped
SNCA.noncoding.vcf.gz
SNCA.noncoding.vcf.gz.tbi
SNCA.trimmed.annotation.noncoding.variants.SNPs.txt
SNCA.trimmed.annotation.noncoding.variants.txt
SNCA.trimmed.annotation.txt


In [56]:
%%bash
cd /home/jupyter/Burden/SNCA_burden/
head BURDEN.SNCA.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SNCA	4:90645249-90758350,4:90645249-90758127,4:90645249-90758127,4:90645249-90759447	404	193	193	17325.8	0.3	0.366207


As an example, you can see the output file for the SKAT-O analysis. The p-value for the analysis is approximately 0.37, indicating that the cumulative effect of rare (MAF<1%) noncoding variants in SNCA in our data do are not associated with PD 

## 5. Rare variant analyses - Pathway-based analysis
<a id="5"></a>

**Run gene-set based pathway test in RVESTs**

For this, we need a SetFile which is a tab-delimited file with the following format:
* Each pathway has it's own line, and gene coordinates follow in this format (no spaces between commas) 16:68245303-68261058,15:82659280-82709946 (each pathway is a set)
* PATHWAYNAME1 gene1coordinates,gene2coordinates etc. E.g., KEGG_LYSOSOME 16:68245303-68261058,16:5024843-5034141,16:71728999-71809201, ....
* PATHWAYNAME2 gene1coordinates,gene2coordinates etc.

**Recode the PLINK format file to vcf and bgzip:**

In [57]:
shell_do(f'/home/jupyter/plink \
--bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted \
--recode vcf-fid \
--out /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted')

Executing: /home/jupyter/plink --bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted --recode vcf-fid --out /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted
PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.log.
Options in effect:
  --bfile /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted
  --out /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted
  --recode vcf-fid

15000 MB RAM detected; reserving 7500 MB for main workspace.
12405097 variants loaded from .bim file.
404 people (202 males, 202 females) loaded from .fam.
404 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 404 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505

In [61]:
%%bash
export PATH=$PATH:/home/jupyter/rvtests/third/tabix-0.2.6/
bgzip /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.vcf
tabix -f -p vcf /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.vcf.gz

In [62]:
%%bash
cd /home/jupyter/Burden/
ls

covariateFile_Haplo.txt
IMPUTED.HARDCALLS.Demo_formatted.bed
IMPUTED.HARDCALLS.Demo_formatted.bim
IMPUTED.HARDCALLS.Demo_formatted.fam
IMPUTED.HARDCALLS.Demo_formatted.log
IMPUTED.HARDCALLS.Demo_formatted.vcf.gz
IMPUTED.HARDCALLS.Demo_formatted.vcf.gz.tbi
lysosomal_burden.log
lysosomal-setfile-example.txt
refFlat_hg19.txt
SNCA_burden


**Run the pathway-based burden analysis in RVTESTS**

In [1]:
%%bash
export PATH=$PATH:/home/jupyter/rvtests/executable

rvtest --noweb --hide-covar \
--out /home/jupyter/Burden/lysosomal_burden \
--bruden cmc \
--inVcf /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.vcf.gz \
--pheno /home/jupyter/Burden/covariateFile_Haplo.txt \
--pheno-name PHENO \
--covar /home/jupyter/Burden/covariateFile_Haplo.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01 \
--setFile /home/jupyter/Burden/lysosomal-setfile-example.txt

# --out : Name of output 
# --burden cmc --kernel skato: tests to run 
# --inVcf : path to VCF file 
# --pheno : path to pheno+covariate file
# --pheno-name : column name with phenotype in file
# --covar : path to pheno+covariate file
# --freqUpper : optional, MAF cut-off
# --covar-name : covariates, listed by column name, separated by commas (no spaces between commas)
# --set : optional, run individual set 
# --setFile : setFile path

Thank you for using rvtests (version: 20190205, git: 8defd6fcbba91ae5187ea8fbce6ccc9c944bb4cb)
  For documentations, refer to http://zhanxw.github.io/rvtests/
  For questions and comments, plase send to Xiaowei Zhan <zhanxw@umich.edu>
  For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues



The following parameters are available.  Ones with "[]" are in effect:

Available Options
      Basic Input/Output:
                          --inVcf [/home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.vcf.gz]
                          --inBgen [], --inBgenSample [], --inKgg []
                          --out [/home/jupyter/Burden/lysosomal_burden]
                          --outputRaw
       Specify Covariate: --covar [/home/jupyter/Burden/covariateFile_Haplo.txt]
                          --covar-name [SEX,AGE,PC1,PC2,PC3,PC4,PC5], --sex
       Specify Phenotype: --pheno [/home/jupyter/Burden/covariateFile_Haplo.txt]
                          --inverseNormal, --useResidualAsPhenotype, --mpheno []
                          --pheno-name [PHENO], --qtl, --multiplePheno []
        Specify Genotype: --dosage [], --multipleAllele
    Chromosome X Options: --xLabel [], --xParRegion []
           People Filter: --peopleIncludeID [], --peopleIncludeFile []
                          --peopl

CalledProcessError: Command 'b'export PATH=$PATH:/home/jupyter/rvtests/executable\n\nrvtest --noweb --hide-covar \\\n--out /home/jupyter/Burden/lysosomal_burden \\\n--bruden cmc \\\n--inVcf /home/jupyter/Burden/IMPUTED.HARDCALLS.Demo_formatted.vcf.gz \\\n--pheno /home/jupyter/Burden/covariateFile_Haplo.txt \\\n--pheno-name PHENO \\\n--covar /home/jupyter/Burden/covariateFile_Haplo.txt \\\n--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \\\n--freqUpper 0.01 \\\n--setFile /home/jupyter/Burden/lysosomal-setfile-example.txt\n\n# --out : Name of output \n# --burden cmc --kernel skato: tests to run \n# --inVcf : path to VCF file \n# --pheno : path to pheno+covariate file\n# --pheno-name : column name with phenotype in file\n# --covar : path to pheno+covariate file\n# --freqUpper : optional, MAF cut-off\n# --covar-name : covariates, listed by column name, separated by commas (no spaces between commas)\n# --set : optional, run individual set \n# --setFile : setFile path\n'' returned non-zero exit status 1.

In [2]:
%%bash
cd /home/jupyter/Burden/
head lysosomal_burden.CMC.assoc

Range	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	NonRefSite	Pvalue
KEGG_LYSOSOME	16:68245303-68261058,16:5024843-5034141,16:71728999-71809201,16:23463541-23521995,16:88813733-88856970,16:2513951-2520218,16:1351930-1364113,16:67438013-67481181,16:28474110-28495575,15:82659280-82709946,15:78921057-78949574,15:50908671-51005895,15:89830598-89894638,15:72340923-72376420,22:37608724-37633564,22:19179472-19291719,22:29327679-29423179,22:50622753-50628173,22:42058333-42070842,2:218382028-218396894,2:223751685-223838027,2:20032649-20051628,1:40071460-40097727,1:43974486-43978295,1:150730187-150765792,1:30732468-30757774,1:113894193-113905201,1:150796207-150808260,1:206009263-206023895,1:155234451-155244699,1:84398531-84415018,1:109309567-109397918,1:23845076-23868294,4:127917798-127966034,4:155924117-155953866,4:177430773-177442437,4:7430284-7434930,4:986996-1004564,4:102630769-102760994,4:76158736-76234536,20:58995184-59007254,20:45890143-45898820,6:159969098-160113507,6:109366513-109382467,6:3215

Here, the pathwaya appear to be significantly associated with PD (p=0.0026), but again, case-control status was randomized in this data. 

Even though this analysis can give us an indication regarding the association of pathway-specific genetic variation on PD risk, it is important to know that RVTESTS has some limitations for pathway analyses:
* Rvtests relies on genomic coordinates, so prone to gene overlap/error
* Rvtests doesn't take into account strand direction

If you are interested to learn more about functional pathways in PD, please have a look at the paper by Bandres-Ciga et al: Large-scale pathway specific polygenic risk and transcriptomic community network analysis identifies novel functional pathways in Parkinson disease, 2020, doi: 10.1007/s00401-020-02181-3