# Pedigree informed micro-haplotype calling in an autopolyploid bi-parental cross

*(Last updated for MCHap version 0.10.0)*

This notebook demonstrates pedigree informed micro-haplotype calling in a small bi-parental cross.
It uses a ***development release of MCHap***.
The following topics are introduced:

- **Pooled de novo haplotype assembly with `mchap assemble`**
  - Input files
  - Sample pooling
- **Re-calling genotypes with `mchap call`**
  - Specifying a population prior on allele frequencies
- **Re-calling genotypes with `mchap call-pedigree`**
  - Specifying pedigree to inform haplotype calling

**Software requirements:**

This notebook uses the [bash-kernel](https://github.com/takluyver/bash_kernel) for Jupyter which can be installed with `pip install bash_kernel`.
Alternatively, the code from this notebook may be run in a unix-like bash environment.

In addition to using the MCHap software, these examples also require the `bgzip` and `tabix` tools which are part of [`htslib`](https://github.com/samtools/htslib).

**Data sources:**

The bi-parental population used within this notebook is a small subset of the population published by:
J Tahir, C Brendolise, S Hoyte, M Lucas, S Thomson, K Hoeata, C McKenzie, A Wotton, K Funnell, E Morgan, D Hedderley, D Chagné, P M Bourke, J McCallum, S E Gardiner,* and L Gea 
“QTL Mapping for Resistance to Cankers Induced by Pseudomonas syringae pv. actinidiae (Psa) in a Tetraploid Actinidia chinensis Kiwifruit Population” 
Pathogens 2020, 9, 967; doi:10.3390/pathogens9110967

- Raw sequences: [10.5281/zenodo.4285665](https://zenodo.org/record/4285666) and [10.5281/zenodo.4287636](https://zenodo.org/record/4287637)
- Reference genome: DOI [10.5281/zenodo.5717386](https://zenodo.org/record/5717387) (chromosome 1 only)


## De novo haplotype assembly with `mchap assemble`

*For more background information on* `mchap assemble` *see the online [documentation](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/assemble.rst).*

### Input files

The required input files have been organised by file type:

- `input/bam` BAM alignment files for the parents and progeny samples
- `input/bed` Target loci for haplotype assembly
- `input/vcf` VCF file of "basis" SNVs
- `input/fasta`: Reference genome

In this example we use one BAM file per sample:

In [1]:
ls input/bam

convert_to_cram.sh       [0m[01;32mprogeny006.loci.bam[0m      [01;32mprogeny013.loci.bam.bai[0m
[01;32mparent1.loci.bam[0m         [01;32mprogeny006.loci.bam.bai[0m  [01;32mprogeny014.loci.bam[0m
[01;32mparent1.loci.bam.bai[0m     [01;32mprogeny007.loci.bam[0m      [01;32mprogeny014.loci.bam.bai[0m
[01;32mparent2.loci.bam[0m         [01;32mprogeny007.loci.bam.bai[0m  [01;32mprogeny015.loci.bam[0m
[01;32mparent2.loci.bam.bai[0m     [01;32mprogeny008.loci.bam[0m      [01;32mprogeny015.loci.bam.bai[0m
[01;32mprogeny001.loci.bam[0m      [01;32mprogeny008.loci.bam.bai[0m  [01;32mprogeny016.loci.bam[0m
[01;32mprogeny001.loci.bam.bai[0m  [01;32mprogeny009.loci.bam[0m      [01;32mprogeny016.loci.bam.bai[0m
[01;32mprogeny002.loci.bam[0m      [01;32mprogeny009.loci.bam.bai[0m  [01;32mprogeny017.loci.bam[0m
[01;32mprogeny002.loci.bam.bai[0m  [01;32mprogeny010.loci.bam[0m      [01;32mprogeny017.loci.bam.bai[0m
[01;32mprogeny003.loci.bam[0m   

The BED file specifies the genomic coordinates of assembly targets. The fourth column (loci ID) of the BED file is optional but, if present, the loci IDs will be included in the output VCF file:

In [2]:
cat input/bed/targets4.bed

chr1	17590	17709	locus001
chr1	568848	568967	locus012
chr1	684808	684927	locus016
chr1	809104	809223	locus019


The input VCF file is used to specify basis SNVs for haplotype alleles. These may be multi-allelic SNVs. Any sample data within this VCF file will be ignored:

### Identify putative SNVs

MCHap includes the `find-snvs` tool for identifying putative SNVs.
This is a fast but simplistic approach to identifying potential basis variants for assembling into haplotypes.

In [3]:
mchap find-snvs \
    --bam input/bam/*.bam \
    --reference input/fasta/chr1.fa.gz \
    --targets input/bed/targets4.bed \
    --ind-maf 0.1 \
    --ind-mad 3 \
    --min-ind 2 \
    | bgzip > putative_snvs.vcf.gz
tabix -p vcf putative_snvs.vcf.gz

Notes:
- The `--targets` parameter should be a bed file defining genomic interval to search for putative SNVs
- The `--ind-mad` parameter specifies a (per sample) minor allele depths 
- The `--ind-maf` parameter specifies a (per sample) minor allele frequency (calculated from allele depths)
- The `--min-ind` parameter specifies the minimum number of sample required to meet the above conditions

The output of `mchap find-snvs` is a VCF without genotype calls.
Instead, sample allele depths are reported.
The total allele depths and mean of individual frequencies are also reported in an `INFO` field:

In [4]:
zcat putative_snvs.vcf.gz | head -n 20

##fileformat=VCFv4.3
##fileDate=20240910
##source=mchap v0.9.4.dev72+g719498f.d20240710
##commandline="/home/cfltxm/mambaforge/envs/mchap/bin/mchap find-snvs --bam input/bam/parent1.loci.bam input/bam/parent2.loci.bam input/bam/progeny001.loci.bam input/bam/progeny002.loci.bam input/bam/progeny003.loci.bam input/bam/progeny004.loci.bam input/bam/progeny005.loci.bam input/bam/progeny006.loci.bam input/bam/progeny007.loci.bam input/bam/progeny008.loci.bam input/bam/progeny009.loci.bam input/bam/progeny010.loci.bam input/bam/progeny011.loci.bam input/bam/progeny012.loci.bam input/bam/progeny013.loci.bam input/bam/progeny014.loci.bam input/bam/progeny015.loci.bam input/bam/progeny016.loci.bam input/bam/progeny017.loci.bam input/bam/progeny018.loci.bam input/bam/progeny019.loci.bam input/bam/progeny020.loci.bam --reference input/fasta/chr1.fa.gz --targets input/bed/targets4.bed --ind-maf 0.1 --ind-mad 3 --min-ind 2"
##reference=file:input/fasta/chr1.fa.gz
##contig=<ID=chr1,length=21898217>


### Pooled assembly

For this tutorial we will jump straight into a pooled assembly. A more beginner friendly example can be found in the standard MCHap [bi-parental example notebook](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/example/bi-parental.ipynb).

Sample pools can be defined using a tabular file:

In [5]:
cat input/pools/sample-pools.txt

parent1	POOL
parent2	POOL
progeny001	POOL
progeny002	POOL
progeny003	POOL
progeny004	POOL
progeny005	POOL
progeny006	POOL
progeny007	POOL
progeny008	POOL
progeny009	POOL
progeny010	POOL
progeny011	POOL
progeny012	POOL
progeny013	POOL
progeny014	POOL
progeny015	POOL
progeny016	POOL
progeny017	POOL
progeny018	POOL
progeny019	POOL
progeny020	POOL


In the file shown above we assign all of the samples to a pool called "POOL".
However, you can imagine a more complex scheme if out example data contained more than one bi-parental cross.

We can then run the pooled assembly with:

In [6]:
mchap assemble \
        --bam input/bam/*.bam \
        --targets input/bed/targets4.bed \
        --variants putative_snvs.vcf.gz \
        --reference input/fasta/chr1.fa.gz \
        --sample-pool input/pools/sample-pools.txt \
        --ploidy 8 \
        --report AFP AOP \
        | bgzip > pooled_assembly.vcf.gz
tabix -p vcf pooled_assembly.vcf.gz

Notes:
- The `--bam` argument may be file with a list of BAM paths ([documentation](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/assemble.rst#analyzing-many-samples))
- The `--ploidy` and `--inbreeding` values can be specified per sample using a simple tabular file ([documentation](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/assemble.rst#sample-parameters))
- The `--inbreeding` argument will default to `0` which may be unrealistic and it's often better to guess a "sensible" value ([documentation](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/assemble.rst#sample-parameters))
- The `--sample-pool` argument is (optionally) used to define sample pools ([documentation](https://github.com/PlantandFoodResearch/MCHap/blob/master/docs/assemble.rst#sample-pooling))
- The `--report AFP AOP` line tells MCHap to report posterior allele frequencies (`AFP`) and posterior allele occurance (`AOP`)
- The output of `mchap assemble` is piped into `bgzip` and the resulting compressed VCF file is saved as `assemble.vcf.gz`
- The `tabix` tool is then used to index the compressed VCF file

Look at the VCF header information:

In [7]:
zcat pooled_assembly.vcf.gz | grep "^#"

[01;31m[K#[m[K#fileformat=VCFv4.3
[01;31m[K#[m[K#fileDate=20240910
[01;31m[K#[m[K#source=mchap v0.9.4.dev72+g719498f.d20240710
[01;31m[K#[m[K#phasing=None
[01;31m[K#[m[K#commandline="/home/cfltxm/mambaforge/envs/mchap/bin/mchap assemble --bam input/bam/parent1.loci.bam input/bam/parent2.loci.bam input/bam/progeny001.loci.bam input/bam/progeny002.loci.bam input/bam/progeny003.loci.bam input/bam/progeny004.loci.bam input/bam/progeny005.loci.bam input/bam/progeny006.loci.bam input/bam/progeny007.loci.bam input/bam/progeny008.loci.bam input/bam/progeny009.loci.bam input/bam/progeny010.loci.bam input/bam/progeny011.loci.bam input/bam/progeny012.loci.bam input/bam/progeny013.loci.bam input/bam/progeny014.loci.bam input/bam/progeny015.loci.bam input/bam/progeny016.loci.bam input/bam/progeny017.loci.bam input/bam/progeny018.loci.bam input/bam/progeny019.loci.bam input/bam/progeny020.loci.bam --targets input/bed/targets4.bed --variants putative_snvs.vcf.gz --reference input/

Note that the above VCF contains a single sample named "POOL"

Look at `locus001`:

In [8]:
zcat pooled_assembly.vcf.gz | grep "locus001"

chr1	17591	[01;31m[Klocus001[m[K	GTTATTGGACAGTGACGATGGAGTGATTGCTGGCGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGTACAC	ATTATTGGACAATGACGATGGGGTGGTTGCTGGCCCAGTCCGCCAGCACCACCACCACCAAGTCAACATGTCGGACATTTATGGGGTGGTGCCACGAAACCTGATCACAAATGGCGCAC,GTTATTGGACAGTGACGATGGAGTGATTGCTGGTGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGCACAC	.	PASS	AN=8;UAN=3;AC=1,1;NS=1;MCI=0;DP=619;RCOUNT=847;END=17709;NVAR=13;SNVPOS=1,12,22,26,34,35,39,65,73,96,104,115,116;AFP=0.75,0.125,0.125;AOP=1,1,1	GT:GQ:SQ:DP:RCOUNT:RCALLS:MEC:MECP:GPM:SPM:MCI:AFP:AOP	0/0/0/0/0/0/1/2:60:60:619:847:8019:17:0.002:1:1:0:0.75,0.125,0.125:1,1,1


### Individual genotype calling

We will start by calling genotypes using a prior on allele frequencies derived from the population

In [9]:
mchap call \
        --bam input/bam/*.loci.bam \
        --haplotypes pooled_assembly.vcf.gz \
        --ploidy 4 \
        --inbreeding 0.01 \
        --prior-frequencies AFP \
        | bgzip > individual_calling.vcf.gz
tabix -p vcf individual_calling.vcf.gz

Notes:
- The `--prior-frequencies AFP` command tells MCHap to use the posterior allele frequencies from the assembly (`AFP`) as the priors for calling genotypes

In [10]:
zcat individual_calling.vcf.gz | grep "locus001"

chr1	17591	[01;31m[Klocus001[m[K	GTTATTGGACAGTGACGATGGAGTGATTGCTGGCGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGTACAC	ATTATTGGACAATGACGATGGGGTGGTTGCTGGCCCAGTCCGCCAGCACCACCACCACCAAGTCAACATGTCGGACATTTATGGGGTGGTGCCACGAAACCTGATCACAAATGGCGCAC,GTTATTGGACAGTGACGATGGAGTGATTGCTGGTGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGCACAC	.	PASS	AN=88;UAN=3;AC=22,12;NS=22;MCI=0;DP=619;RCOUNT=847;END=17709;NVAR=13;SNVPOS=1,12,22,26,34,35,39,65,73,96,104,115,116	GT:GQ:SQ:DP:RCOUNT:RCALLS:MEC:MECP:GPM:SPM:MCI	0/0/1/2:60:60:30:43:383:2:0.005:1:1:0	0/0/0/1:60:60:22:28:280:2:0.007:1:1:0	0/0/1/2:60:60:31:41:401:1:0.002:1:1:0	0/0/1/2:23:60:14:26:176:0:0:0.995:1:0	0/0/1/2:60:60:25:31:320:0:0:1:1:0	0/0/1/2:60:60:31:46:408:0:0:1:1:0	0/0/0/1:60:60:22:33:291:0:0:1:1:0	0/0/1/2:30:60:21:26:272:1:0.004:0.999:1:0	0/0/1/2:60:60:38:50:497:1:0.002:1:1:0	0/0/0/1:60:60:33:46:424:1:0.002:1:1:0	0/0/0/1:24:60:28:41:358:4:0.011:0.996:1:0	0/0/0/1:28:60:19:2

## Pedigree informed genotype calling

**WARNING: This the `call-pedigree` program is highly experimental!**

Next we look at pedigree informed genotype calling.
In MCHap, a pedigree is defined with a simple tabular format:

In [11]:
cat input/pedigree/pedigree.txt

parent1	.	.
parent2	.	.
progeny001	parent1	parent2
progeny002	parent1	parent2
progeny003	parent1	parent2
progeny004	parent1	parent2
progeny005	parent1	parent2
progeny006	parent1	parent2
progeny007	parent1	parent2
progeny008	parent1	parent2
progeny009	parent1	parent2
progeny010	parent1	parent2
progeny011	parent1	parent2
progeny012	parent1	parent2
progeny013	parent1	parent2
progeny014	parent1	parent2
progeny015	parent1	parent2
progeny016	parent1	parent2
progeny017	parent1	parent2
progeny018	parent1	parent2
progeny019	parent1	parent2
progeny020	parent1	parent2


We can run `mchap call-pedigree` with:

In [12]:
mchap call-pedigree \
        --bam input/bam/*.loci.bam \
        --haplotypes pooled_assembly.vcf.gz \
        --ploidy 4 \
        --sample-parents input/pedigree/pedigree.txt \
        --gamete-error 0.1 \
        --prior-frequencies AFP \
        | bgzip > pedigree_calling.vcf.gz
tabix -p vcf pedigree_calling.vcf.gz



In [13]:
zcat pedigree_calling.vcf.gz | grep "#CHROM"
zcat pedigree_calling.vcf.gz | grep "locus001"

[01;31m[K#CHROM[m[K	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	parent1	parent2	progeny001	progeny002	progeny003	progeny004	progeny005	progeny006	progeny007	progeny008	progeny009	progeny010	progeny011	progeny012	progeny013	progeny014	progeny015	progeny016	progeny017	progeny018	progeny019	progeny020
chr1	17591	[01;31m[Klocus001[m[K	GTTATTGGACAGTGACGATGGAGTGATTGCTGGCGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGTACAC	ATTATTGGACAATGACGATGGGGTGGTTGCTGGCCCAGTCCGCCAGCACCACCACCACCAAGTCAACATGTCGGACATTTATGGGGTGGTGCCACGAAACCTGATCACAAATGGCGCAC,GTTATTGGACAGTGACGATGGAGTGATTGCTGGTGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGCACAC	.	PASS	AN=88;UAN=3;AC=22,12;NS=22;MCI=0;DP=619;RCOUNT=847;END=17709;NVAR=13;SNVPOS=1,12,22,26,34,35,39,65,73,96,104,115,116	GT:GQ:SQ:DP:RCOUNT:RCALLS:MEC:MECP:GPM:SPM:MCI:PEDERR	0/0/1/2:60:60:30:43:383:2:0.005:1:1:0:0	0/0/0/1:60:60:22:28:280:2:0.007:1:1:0:0	0/0/1/2:60:60:31:41:401:1:0.002:1:1:0:0	0/0/1/

### Pedigree imputation

If the pedigree names an individual that does not have an alignment file then MCHap will attempt to impute that individuals genotype.

We can create a list of alingment files that does not include `parent1`:

In [14]:
ls input/bam/*.bam | tail -n 21 > bams.txt
cat bams.txt

input/bam/parent2.loci.bam
input/bam/progeny001.loci.bam
input/bam/progeny002.loci.bam
input/bam/progeny003.loci.bam
input/bam/progeny004.loci.bam
input/bam/progeny005.loci.bam
input/bam/progeny006.loci.bam
input/bam/progeny007.loci.bam
input/bam/progeny008.loci.bam
input/bam/progeny009.loci.bam
input/bam/progeny010.loci.bam
input/bam/progeny011.loci.bam
input/bam/progeny012.loci.bam
input/bam/progeny013.loci.bam
input/bam/progeny014.loci.bam
input/bam/progeny015.loci.bam
input/bam/progeny016.loci.bam
input/bam/progeny017.loci.bam
input/bam/progeny018.loci.bam
input/bam/progeny019.loci.bam
input/bam/progeny020.loci.bam


We then run `mchap call-pedigree` using that list of BAM files:

In [15]:
mchap call-pedigree \
        --bam bams.txt \
        --haplotypes pooled_assembly.vcf.gz \
        --ploidy 4 \
        --sample-parents input/pedigree/pedigree.txt \
        --gamete-error 0.1 \
        --prior-frequencies AFP \
        | bgzip > pedigree_imputing.vcf.gz
tabix -p vcf pedigree_imputing.vcf.gz



We can see that `parent1` has been added as the last sample on the list:

In [16]:
zcat pedigree_imputing.vcf.gz | grep "#CHROM"
zcat pedigree_imputing.vcf.gz | grep "locus001"

[01;31m[K#CHROM[m[K	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	parent2	progeny001	progeny002	progeny003	progeny004	progeny005	progeny006	progeny007	progeny008	progeny009	progeny010	progeny011	progeny012	progeny013	progeny014	progeny015	progeny016	progeny017	progeny018	progeny019	progeny020	parent1
chr1	17591	[01;31m[Klocus001[m[K	GTTATTGGACAGTGACGATGGAGTGATTGCTGGCGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGTACAC	ATTATTGGACAATGACGATGGGGTGGTTGCTGGCCCAGTCCGCCAGCACCACCACCACCAAGTCAACATGTCGGACATTTATGGGGTGGTGCCACGAAACCTGATCACAAATGGCGCAC,GTTATTGGACAGTGACGATGGAGTGATTGCTGGTGCAGGCCGCCAGCACCACCACCACCAAGTCGACATGTCCGACATTTATGGGGTGGTGCCACAAAACCTGCTCACAAATGGCACAC	.	PASS	AN=88;UAN=3;AC=22,12;NS=22;MCI=0;DP=589;RCOUNT=804;END=17709;NVAR=13;SNVPOS=1,12,22,26,34,35,39,65,73,96,104,115,116	GT:GQ:SQ:DP:RCOUNT:RCALLS:MEC:MECP:GPM:SPM:MCI:PEDERR	0/0/0/1:16:60:22:28:280:2:0.007:0.976:1:0:0	0/0/1/2:60:60:31:41:401:1:0.002:1:1:0:0	0/0/1/2:28:60:14:26:176:0:0:0.998:1:0:0	0/

Note that the genotype for `parent1` is the same as above, but the quality scores are much lower:

In [17]:
echo individual_calling.vcf.gz
zcat individual_calling.vcf.gz | grep "#CHROM" | cut -f10
zcat individual_calling.vcf.gz | grep "locus001" | cut -f10
echo pedigree_calling.vcf.gz
zcat pedigree_calling.vcf.gz | grep "#CHROM" | cut -f10
zcat pedigree_calling.vcf.gz | grep "locus001" | cut -f10
echo pedigree_imputing.vcf.gz
zcat pedigree_imputing.vcf.gz | grep "#CHROM" | cut -f31
zcat pedigree_imputing.vcf.gz | grep "locus001" | cut -f31

individual_calling.vcf.gz
parent1
0/0/1/2:60:60:30:43:383:2:0.005:1:1:0
pedigree_calling.vcf.gz
parent1
0/0/1/2:60:60:30:43:383:2:0.005:1:1:0:0
pedigree_imputing.vcf.gz
parent1
0/0/1/2:3:3:0:0:0:0:.:0.528:0.537:0:0
