## `01_get_phased_reads`: Generating Phased Read Sets

This notebook shows the process of performing the haplotype phasing with 10x, Oxford Nanopore Reads and PacBio CCS reads. We also perform a couple different experiments of different combinations of the data sets for phasing.

## Getting GIAB data and selectings regions covering MHC region

The data was downloaded from the following links:

PacBio CCS reads from human pangenomics project 
 - aws s3 cp s3://human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190920_173625.Q20.fastq 15kb/
 - aws s3 cp s3://human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190921_234837.Q20.fastq 15kb/
 - aws s3 cp s3://human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/20kb/m64011_190830_220126.Q20.fastq 15kb/
 - aws s3 cp s3://human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/20kb/m64011_190901_095311.Q20.fastq 15kb/

ONT "ultralong" 
 - http://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.2.4_2020-01-22/HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam 
 - ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.2.4_2020-01-22/HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam.bai    


### for running this notebook, we need to setup a conda env and IPython kernel using a unix shell console. 
```
conda create -n whatshap
conda activate whatshap
conda install -y minimap2 samtools bcftools pysam bamtools whatshap -c bioconda

apt-get install samtools bcftools bamtools bzip2 tabix
#pip install git+https://bitbucket.org/whatshap/whatshap@split

```
We need to reconnect the new IPython kernel to make this work

In [None]:
%%bash
dx download -f /GIAB_BAM_files/HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam
dx download -f /GIAB_BAM_files/HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam.bai

In [None]:
%%bash
apt-get install -y pigz
#pip install git+https://bitbucket.org/whatshap/whatshap@split

For PacBio CCS Sequel 15Kb, Sequel II 11Kb, and ONT "ultra-long" the reads covering the MHC were selected using the following commands (with bam file name updated for each data) for haplotype 1 and haplotype 2:

This script was run with a directory structure that includes subdirectories of "reads", cram", "ref", "vcf", "whatshap"

In [None]:
%%writefile filter_script.json
{ "filters" : [ { "id" : "inHP1", "tag" : "HP:1" }, { "id" : "inHP2", "tag" : "HP:2" } ], "rule" : "!(inHP1 | inHP2)" }

filter_script.json:
{
    "filters" : [
    { "id" : "inHP1", "tag" : "HP:1" },
    { "id" : "inHP2", "tag" : "HP:2" }
    ],
    "rule" : "!(inHP1 | inHP2)"
}

In [None]:
%%bash

samtools view -bh HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam 6:28477797-33448354 | \
    bamtools filter -in stdin -tag HP:1 | \
        samtools bam2fq - | \
        bgzip -c > HG002.ultra-long-ont_hs37d5_phased_reheader.HP1.fastq.gz

samtools view -bh HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam 6:28477797-33448354 | \
    bamtools filter -in stdin -tag HP:2 | \
        samtools bam2fq - | \
        bgzip -c > HG002.ultra-long-ont_hs37d5_phased_reheader.HP2.fastq.gz

samtools view -bh HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam 6:28477797-33448354 | \
    bamtools filter -in stdin -script filter_script.json | \
    samtools bam2fq - | \
    bgzip -c > HG002.ultra-long-ont_hs37d5_phased_reheader.MHConly.UnPartitioned.fastq.gz

## Generate haplotype binned reads with various tools within the WhatsHAP toolset.

Getting the input files

In [None]:
%%bash
#wget --quiet ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.2_Supernova2.0.1_04122018/GRCh37/NA24385_300G/NA24385.GRCh37.phased_variants.vcf.gz
#wget --quiet https://storage.cloud.google.com/genomics-public-data/references/hs37d5/hs37d5.fa.gz

dx download -f /20200316_asm_for_revision/data/NA24385.GRCh37.phased_variants.vcf.gz
dx download -f /20200316_asm_for_revision/data/hs37d5.fa.gz


In [None]:
%%bash
mkdir -p ref
mv hs37d5.fa.gz ref

In [None]:
%%bash
dx download -f /20200316_asm_for_revision/data/mapped_reads.fa
grep -c ">" mapped_reads.fa

Merging FASTQs again
convert mapped_reads.fa to reads/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq.gz

In [None]:
%%bash
apt-get install pigz
mkdir -p reads
zcat HG002.ultra-long-ont_hs37d5_phased_reheader.*.fastq.gz | pigz > reads/HG002_ultra-long-ont_hs37d5_phased_reheader.MHConly.fastq.gz
cat mapped_reads.fa | pigz > reads/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq.gz
#cat HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq |  pigz > reads/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq.gz
#zcat HG002.SequelII.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.*.fastq.gz | pigz >> reads/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq.gz

Mapping CCS reads

In [None]:
%%bash
mkdir -p cram

/opt/conda/envs/whatshap/bin/minimap2 -t 32 -R '@RG\tID:1\tSM:HG002' -ax asm20 ref/hs37d5.fa.gz \
reads/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.MHConly.fastq.gz | \
samtools view -bS > cram/HG002.PacBio.15kbCCS.bam

In [None]:
%%bash
samtools sort -@ 32 -o cram/HG002.PacBio.15kbCCS.sorted.bam cram/HG002.PacBio.15kbCCS.bam
samtools index cram/HG002.PacBio.15kbCCS.sorted.bam

Mapping ONT reads

In [None]:
%%bash
/opt/conda/envs/whatshap/bin/minimap2 -t 32 -R '@RG\tID:1\tSM:HG002' -ax map-ont ref/hs37d5.fa.gz \
reads/HG002_ultra-long-ont_hs37d5_phased_reheader.MHConly.fastq.gz | \
samtools view -bS - > cram/HG002.ultra-long-ont_hs37d5_phased_reheader.bam

In [None]:
%%bash
samtools sort  -@ 32  -o cram/HG002.ultra-long-ont_hs37d5_phased_reheader.sorted.bam cram/HG002.ultra-long-ont_hs37d5_phased_reheader.bam 
samtools index cram/HG002.ultra-long-ont_hs37d5_phased_reheader.sorted.bam

In [None]:
%%bash
dx download -f /HG002_reanalysis/pacbio-15kb-hapsort-wgs.vcf.gz
mkdir -p vcf/deepvariant/
mv pacbio-15kb-hapsort-wgs.vcf.gz vcf/deepvariant/

In [None]:
%%bash
tabix -p vcf vcf/deepvariant/pacbio-15kb-hapsort-wgs.vcf.gz

Select het SNVs from DeepVariant call set. The goal is to use this as a starting point to construct a super confident set of heterozygous SNVs that we can use to partition reads

In [None]:
%%bash
/opt/conda/envs/whatshap/bin/bcftools view -g het -v snps vcf/deepvariant/pacbio-15kb-hapsort-wgs.vcf.gz 6:28477797-33448354 > vcf/deepvariant/mhc.het.snps.vcf

Re-genotype these SNV from ONT reads

In [None]:
%%bash
zcat ref/hs37d5.fa.gz > ref/hs37d5.fa
samtools faidx ref/hs37d5.fa

In [None]:
%%bash
mkdir -p whatshap/genotype/
/opt/conda/envs/whatshap/bin/whatshap genotype --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/genotype/HG002.ultra-long-ont.vcf \
vcf/deepvariant/mhc.het.snps.vcf \
cram/HG002.ultra-long-ont_hs37d5_phased_reheader.sorted.bam > whatshap/genotype/HG002.ultra-long-ont.log 2>&1

Re-genotype these SNV from CCS reads

In [None]:
%%bash
/opt/conda/envs/whatshap/bin/whatshap genotype --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/genotype/HG002.PacBio.15kbCCS.vcf \
vcf/deepvariant/mhc.het.snps.vcf \
cram/HG002.PacBio.15kbCCS.sorted.bam > whatshap/genotype/HG002.PacBio.15kbCCS.log 2>&1

Find the intersection, i.e. only retain those that are het accoring to DV/CCS, WhatHap/ONT, and WhatsHap/CCS

In [None]:
%%bash
bcftools view -g het whatshap/genotype/HG002.ultra-long-ont.vcf > whatshap/genotype/HG002.ultra-long-ont.hetonly.vcf
bgzip -f whatshap/genotype/HG002.ultra-long-ont.hetonly.vcf
rm HG002.ultra-long-ont.hetonly.vcf.gz.tbi
tabix -p vcf -f whatshap/genotype/HG002.ultra-long-ont.hetonly.vcf.gz

bcftools view -g het whatshap/genotype/HG002.PacBio.15kbCCS.vcf > whatshap/genotype/HG002.PacBio.15kbCCS.hetonly.vcf
bgzip -f whatshap/genotype/HG002.PacBio.15kbCCS.hetonly.vcf
rm HG002.PacBio.15kbCCS.hetonly.vcf.gz.tbi
tabix -p vcf -f whatshap/genotype/HG002.PacBio.15kbCCS.hetonly.vcf.gz

mkdir -p whatshap/genotype-intersect
bcftools isec -p whatshap/genotype-intersect whatshap/genotype/HG002.ultra-long-ont.hetonly.vcf.gz whatshap/genotype/HG002.PacBio.15kbCCS.hetonly.vcf.gz 
cp whatshap/genotype-intersect/0003.vcf whatshap/confident-hets.vcf

Extract 10x phased blocks produced by LongRanger 

In [None]:
%%bash
mkdir -p 10x
cp NA24385.GRCh37.phased_variants.vcf.gz 10x/
cd 10x
rm NA24385.GRCh37.phased_variants.vcf.gz.tbi
tabix -p vcf -f NA24385.GRCh37.phased_variants.vcf.gz
bcftools view NA24385.GRCh37.phased_variants.vcf.gz chr6:28477797-33448354 | \
    awk 'BEGIN {OFS="\t"} $1 == "chr6" {$1="6"} $1=="#CHROM" {$10="HG002"} {print}'| \
    bgzip > NA24385.GRCh37.phased_variants.mhc.vcf.gz

Now run phasing from multiple combinations of data sources

In [None]:
%%bash
rm -rf whatshap/phase/
mkdir -p whatshap/phase/

/opt/conda/envs/whatshap/bin/whatshap phase --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/phase/HG002.PacBio.15kbCCS.vcf \
whatshap/confident-hets.vcf \
cram/HG002.PacBio.15kbCCS.sorted.bam > whatshap/phase/HG002.PacBio.15kbCCS.log 2>&1

bgzip -f whatshap/phase/HG002.PacBio.15kbCCS.vcf
tabix -p vcf -f whatshap/phase/HG002.PacBio.15kbCCS.vcf.gz

/opt/conda/envs/whatshap/bin/whatshap phase --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa -o whatshap/phase/HG002.ultra-long-ont.vcf \
whatshap/confident-hets.vcf \
cram/HG002.ultra-long-ont_hs37d5_phased_reheader.sorted.bam > whatshap/phase/HG002.ultra-long-ont.log 2>&1

bgzip -f whatshap/phase/HG002.ultra-long-ont.vcf
tabix -p vcf -f  whatshap/phase/HG002.ultra-long-ont.vcf.gz

/opt/conda/envs/whatshap/bin/whatshap phase --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/phase/HG002.ultra-long-ont+10x.vcf \
whatshap/confident-hets.vcf \
cram/HG002.ultra-long-ont_hs37d5_phased_reheader.sorted.bam \
10x/NA24385.GRCh37.phased_variants.mhc.vcf.gz > whatshap/phase/HG002.ultra-long-ont+10x.log  2>&1

bgzip -f whatshap/phase/HG002.ultra-long-ont+10x.vcf
tabix -p vcf -f  whatshap/phase/HG002.ultra-long-ont+10x.vcf.gz

/opt/conda/envs/whatshap/bin/whatshap phase --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/phase/HG002.PacBio+10x.vcf \
whatshap/confident-hets.vcf cram/HG002.PacBio.15kbCCS.sorted.bam \
10x/NA24385.GRCh37.phased_variants.mhc.vcf.gz > whatshap/phase/HG002.PacBio+10x.log 2>&1

bgzip -f whatshap/phase/HG002.PacBio+10x.vcf
tabix -p vcf -f whatshap/phase/HG002.PacBio+10x.vcf.gz



In [None]:
%%bash
/opt/conda/envs/whatshap/bin/whatshap phase --ignore-read-groups --chromosome 6 \
--reference ref/hs37d5.fa \
-o whatshap/phase/HG002.PacBio+ultra-long-ont.vcf \
whatshap/confident-hets.vcf cram/HG002.PacBio.15kbCCS.sorted.bam \
whatshap/phase/HG002.ultra-long-ont.vcf.gz > whatshap/phase/HG002.PacBio+ultra-long-ont.log 2>&1

In [None]:
%%bash
/opt/conda/envs/whatshap/bin/whatshap haplotag --ignore-read-groups \
--reference ref/hs37d5.fa \
-o cram/HG002.PacBio.15kbCCS.sorted.tagged-newsplit.bam \
whatshap/phase/HG002.ultra-long-ont+10x.vcf.gz \
cram/HG002.PacBio.15kbCCS.sorted.bam

In [None]:
%%bash
samtools view cram/HG002.PacBio.15kbCCS.sorted.tagged-newsplit.bam | grep HP:i:1 | awk '{print $1}' | sort -u  > H1_reads &
samtools view cram/HG002.PacBio.15kbCCS.sorted.tagged-newsplit.bam | grep HP:i:2 | awk '{print $1}' | sort -u  > H2_reads &
samtools view cram/HG002.PacBio.15kbCCS.sorted.tagged-newsplit.bam | grep -v HP:i:2 | grep -v HP:i:1 | awk '{print $1}' | sort -u > tmp_reads &

In [None]:
phased_reads = open("H1_reads").read().split("\n")
phased_reads.extend(open("H2_reads").read().split("\n"))
tmp_reads = open("tmp_reads").read().split("\n")
tmp_reads = set(tmp_reads)
phased_reads = set(phased_reads)
unphased_reads = tmp_reads - phased_reads
with open("unphased_reads", "w") as f:
    for r in unphased_reads:
        print(r, file=f)