Tutorial

What you will learn

Build a basic understanding of Princess workflow
Analyzing long-reads like PacBio HiFi or Oxford Nanopore (ONT)
Sequence data QC
Call Single Nucleotide Variations (SNPs), short Insertion and Deletions (Indels)
Call Structural Variants (SVs)
Phase SNPs and SVs
Get familiar with VCF files

Problems that you will be able to solve

Starting with a capture sample, we will identify SNVs, indels, SVs, and their haplotype with minimum installation and no prior expertise with bioinformatics tools optimization.

Requirements

Basic experience with the Linux system and command line will be helpful.

Installation

Install Conda

Princess was tested on CentOS release 6.7, and Conda version 4.7.12 is installed: for more information about Installing Conda press here. To download the same Conda version here..

Create a working directory

mkdir princess_tutorial
cd princess_tutorial

Create an environment

conda create --name princess_env python=3.7

Install Snakemake and the tools we will need

conda activate princess_env
conda install snakemake=5.7.1
conda install pyyaml
conda install bcftools # for SNVs and SVs in-depth analysis
conda install bedtools # for bed files intersection

Download Princess

git clone https://github.com/MeHelmy/princess.git
chmod +x install.sh
./install.sh

Test installation

python princess -h

Downloading data

downloading the fastq file wget https://bcm.box.com/shared/static/hqv7ghnncroxcvfzz1k45z12du99v4i0 --output-document MdaMb231_brPanel_MinION.12.fastq.gz
In case you didn't manage to run Princess and finalize the analysis, here is a copy of the Princess final output, you can download and follow up with the tutorial wget https://bcm.box.com/shared/static/x9ena1gd41p4x12etk60h6j2l8qg29sb --output-document analysis_finall.tar.gz
Extracting files
tar -xf analysis_finall.tar.gz

Running the analysis

Note: We used GRCh37 reference.
python princess all -r ont -d analysis -f hs37d5_mainchr.fa -s $PWD/MdaMb231_brPanel_MinION.12.fastq.gz -e --printshellcmds --dry-run

Questions

How many folders can you see in the analysis directory, what are they called?

benchmark log PrincessLog.txt result snake_log statistics

In the `benchmark/align/minimap` directory, can you identify the run time of the mapping process?

awk '{print $2}' stat.benchmark.txt

In the directory `benchmark/sv/minimap` can you detect the run time of Sniffles and the CPU time?

awk '{print $2}' sv.benchmark.txt

awk '{print $(NF)}' sv.benchmark.txt

What do you learn from the statistics output?

How many reads, bases, mean read length, and max read?

cd statistics/raw_reads
cat reads_stat.txt

Reads: 28396
Bases: 409038723
Mean read length: 14404.800781800253
Median: 7733.5
Max: 178579
N50: 29994

Aligning statistics

cd statitics/minimap
cat data.stat| grep ^SN | cut -f 2-

What is the number of “reads mapped”? Compare to the number of raw reads from statistics/raw_reads/reads_stat.txt.

What is the “average quality” of mapped reads? The number of reads with a mapping quality of zero, “reads MQ0”?

SV basic statistics

cd statitics/sv

What is the distribution of SV sizes, and how many Deletion (DEL) with sizes between 1000-10000bp

cat data.stat

How many deletions detected in chromosome 5?

cat data.stat_CHR

SNVs statistics

What is the number of SNPs, Indels, and how many records (variants)?

grep "^SN" snp.txt

Do we have “multiallelic SNP sites”?

So far all the analysis we got is directly from Princess, now let us use other tools to get more information about our results.

Select the heterozygote SNPs, How many are they?

cd analysis/result
bcftools view -v snps -i 'GT="het"' minimap.phased.SNVs.vcf.gz | wc -l # ~25079

Number of Phased SNPs

bcftools view -p minimap.phased.SNVs.vcf.gz | wc -l

Are there any SNPs in gene BRAF?

Note: gene is located on chromosome 16 starts at 67952369 and ends at 67976758

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz

How many of those we could not phase?

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools view -H -P | wc -l

For the same gene, How many phase blocks are there? And what is the phase block number?

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools query -f '[%PS]\n' | sort | uniq -c # 15 67965575

Do we identify an SV in this location `5:57679993-57686103`?

bcftools view -r 5:57679993-57686103 minimap.SVs.phased.vcf.gz | less

what is the SV type, is it phased?

Bonus: If you identified SV in the previous location, try to take a look in the IGV using the bam file minimap.hap.bam and color reads by tag HP

So far, what have we learned?

Installing Princess
Running Princess for a comprehensive sample analysis
QC for raw reads (number of reads, bases, max read length, N50, etc...)
Variant QC (Number of SVs, SNVs, and Indels)
Calling variants, SVs, SNVs, and indels
Phasing variants
Count heterozygote vs. homozygous variant
Select only phased variant
The number of variants per gene and how many phased variants are there.
Detecting SVs per region/gene