<a href="https://colab.research.google.com/github/AndreMacedo88/Genomics-Notebooks/blob/main/NGS_Datasets_Chr22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up the machine

## Pre-work

Instructions for the pre-work got from: http://www.htslib.org/workflow/#mapping_to_variant

From the sequencer we get the reads in fastq files, which must be mapped to a genome reference.

### Mapping


To prepare the reference for mapping you must first index it by typing the following command where <ref.fa> is the path to your reference file:

```
bwa index <ref.fa>
```

This may take several hours as it prepares the Burrows Wheeler Transform index for the reference, allowing the aligner to locate where your reads map within that reference.

Once you have finished preparing your indexed reference you can map your reads to the reference:

```
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' <ref.fa> <read1.fa> <read1.fa> > lane.sam
```

Typically your reads will be supplied to you in two files written in the FASTQ format. It is particularly important to ensure that the @RG information here is correct as this information is used by later tools. The SM field must be set to the name of the sample being processed, and LB field to the library. The resulting mapped reads will be delivered to you in a mapping format known as SAM.

Because BWA can sometimes leave unusual FLAG information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags:

```
samtools fixmate -O bam <lane.sam> <lane_fixmate.bam>
```

To sort them from name order into coordinate order:

```
samtools sort -O bam -o <lane_sorted.bam> -T </tmp/lane_temp> <lane_fixmate.sam>
```

### Improving

In order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK Realigner.

```
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <lane.bam> -o <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf>
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <lane.bam> -targetIntervals <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf> -o <lane_realigned.bam>
```

BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines. It does this in two steps, the first analyses your data to detect covariates and the second compensates for those covariates by adjusting quality scores.

```
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R <ref.fa> -knownSites >bundle/b38/dbsnp_142.b38.vcf> -I <lane.bam> -o <lane_recal.table>
java -Xmx2g -jar GenomeAnalysisTK.jar -T PrintReads -R <ref.fa> -I <lane.bam> --BSQR <lane_recal.table> -o <lane_recal.bam>
```

It is helpful at this point to compile all of the reads from each library together into one BAM, which can be done at the same time as marking PCR and optical duplicates. To identify duplicates we currently recommend the use of either the Picard or biobambam’s mark duplicates tool.

```
java -Xmx2g -jar MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=<lane_1.bam> INPUT=<lane_2.bam> INPUT=<lane_3.bam> OUTPUT=<library.bam>
```

Once this is done you can perform another merge step to produce your sample BAM files.
```
samtools merge <sample.bam> <library1.bam> <library2.bam> <library3.bam>
samtools index <sample.bam>
```

If you have the computational time and resources available it is helpful to realign your INDELS again:

```
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <sample.bam> -o <sample.intervals> --known >bundle/b38/Mills1000G.b38.vcf>
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <sample.bam> -targetIntervals <sample.intervals> --known >bundle/b38/Mills1000G.b38.vcf> -o <sample_realigned.bam>
```

Lastly we index our BAM using samtools:

```
samtools index <sample_realigned.bam>
```

### Variant calling

Variant Calling
To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call.

You can do this using a pipe as shown here:

```
bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>
```

Alternatively if you need to see why a specific site was not called by examining the BCF, or wish to spread the load slightly you can break it down into two steps as follows:

```
bcftools mpileup -Ob -o <study.bcf> -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam>
bcftools call -vmO z -o <study.vcf.gz> <study.bcf>
```

To prepare our VCF for querying we next index it using tabix:

```
tabix -p vcf <study.vcf.gz>
```

Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:

```
bcftools stats -F <ref.fa> -s - <study.vcf.gz> > <study.vcf.gz.stats>
mkdir plots
plot-vcfstats -p plots/ <study.vcf.gz.stats>
```

Finally you will probably need to filter your data using commands such as:

```
bcftools filter -O z -o <study_filtered..vcf.gz> -s LOWQUAL -i'%QUAL>10' <study.vcf.gz>
```

Variant filtration is a subject worthy of an article in itself and the exact filters you will need to use will depend on the purpose of your study and quality and depth of the data used to call the variants.

## Setup

First need to install some tools like tabix and bgzip, and to import the main python packages

In [2]:
# This is  used to time the running of this notebook
import time
start_time = time.time()

In [3]:
# check if runtime is clean
!ls
# get the htslib tools
!wget -c https://github.com/samtools/htslib/releases/download/1.11/htslib-1.11.tar.bz2

sample_data
--2020-10-09 13:00:28--  https://github.com/samtools/htslib/releases/download/1.11/htslib-1.11.tar.bz2
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/4339773/5bc1ce80-fcf0-11ea-8ec0-9a6f894b8e1c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201009%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201009T130029Z&X-Amz-Expires=300&X-Amz-Signature=d5beb0783dd52f9ddd369b31d6d8f7890bf12d5ce3a210ebd16669230dbf6965&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=4339773&response-content-disposition=attachment%3B%20filename%3Dhtslib-1.11.tar.bz2&response-content-type=application%2Foctet-stream [following]
--2020-10-09 13:00:29--  https://github-production-release-asset-2e65be.s3.amazonaws.com/4339773/5bc1ce80-fcf0-11ea-8ec0-9a6f894b8e1c?X-Amz-Algorithm=AWS4-HM

In [4]:
# install the tools
!tar -xjf htslib-1.11.tar.bz2
%cd htslib-1.11
!./configure --prefix=/htslib
!make
!make install
%cd

/content/htslib-1.11
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for ranlib... ranlib
checking for grep that handles long lines and -e... /bin/grep
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking whether the compiler accepts -fvisibility=hidden... yes
checking how to run the C preprocessor... gcc -E
checking for egrep... /bin/grep -E
checking for ANSI C header files... ye

In [5]:
# add the tools to PATH
import os
os.environ["PATH"] += os.pathsep + "/htslib/bin"
# test if tabix and bgzig are installed correctly
!tabix

### Install Python packages

In [6]:
%%time
# Install general packages
!pip install matplotlib
!pip install numpy

!pip install biopython
!pip install PyVCF

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/76/02/8b606c4aa92ff61b5eda71d23b499ab1de57d5e818be33f77b01a6f435a8/biopython-1.78-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 2.8MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.78
Collecting PyVCF
  Downloading https://files.pythonhosted.org/packages/20/b6/36bfb1760f6983788d916096193fc14c83cce512c7787c93380e09458c09/PyVCF-0.6.8.tar.gz
Building wheels for collected packages: PyVCF
  Building wheel for PyVCF (setup.py) ... [?25l[?25hdone
  Created wheel for PyVCF: filename=PyVCF-0.6.8-cp36-cp36m-linux_x86_64.whl size=121984 sha256=0162d358fbd8d9487448c2416f0c1f7e306405a5edbef61fb7a20e9389c0179e
  Stored in directory: /root/.cache/pip/wheels/81/91/41/3272543c0b9c61da9c525f24ee35bae6fe8f60d4858c66805d
Successfully built PyVCF
Installing collected packages: PyVCF
Successfully installed PyVCF-0.6.8
CPU times: user 64.9 ms, sys: 29 

# Analyzing variant calls

After running the Pre-work, or another genotype caller (for example, GATK or samtools mpileup), you will have a Variant Call Format (VCF) file reporting on genomic variations, such as SNPs (Single-Nucleotide Polymorphisms), InDels (Insertions/Deletions), CNVs (Copy Number Variation) among others.

In [7]:
# perform a partial download of the VCF file for chromosome 22 (up to 17 Mbp) of the 1000 genomes project. Then, bgzip will compress it.
!tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000|bgzip -c > genotypes.vcf.gz
# create a tabix index file for the bgzip-compressed VCF, which we will need for direct access to a section of the genome.
!tabix -p vcf genotypes.vcf.gz
# The tabix command appends .tbi to the .vcf.gz filename, creating a binary index file named .vcf.gz.tbi with which genomic coordinates can quickly be translated into file offsets in .vcf.gz.

[E::easy_errno] Libcurl reported error 78 (Remote file not found)
[E::easy_errno] Libcurl reported error 78 (Remote file not found)


In [23]:
# check if files correctly downloaded and processed
!ls
!gzip -dk genotypes.vcf.gz
!head -20 genotypes.vcf
!rm genotypes.vcf

ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz.tbi
genotypes.vcf.gz
genotypes.vcf.gz.tbi
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20140730
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##contig=<ID=5,assembly=b37,length=180915260>
##contig=<ID=6,assembly=b37,length=171115067>
##contig=<ID=7,assembly=b37,length=159138663>
##contig=<ID=8,assembly=b37,length=146364022>
##contig=<ID=9,assembly=b37,length=141213431>
##contig=<ID=10,assembly=b37,length=135534747>
##contig=<ID=11,assembly=b37,length=135006516>
##contig=<ID=12,assembly=b37,length=133851895>
##contig=<ID=13,assembly=b37,length=115169878>
##contig=<ID=14,assem

# Final consideration

In [9]:
# Running time of the notebook
print("{:.2f} minutes".format((time.time()-start_time)/60))

2.08 minutes
