GitHub - EmaPajic/Variant-Calling: Implementation of Variant Calling algorithm for Computational Genomics course

Binomial Variant Caller from Pileup

This repo contains an implementation of a simple Variant Calling algorithm that assumes that nucleotide counts at one pileup position follow the same Binomial distribution and uses Maximum Likelihood Estimation (MLE) for calling the variants.

Youtube presentation

How to use

The algorithm takes an input file in the Pileup format, and produces a .vcf output file in the Variant Call format.

To run the tool:

Install pysam, numpy, scipy. TODO: Create requirements.txt
Run python main.py --input-file=PATH_TO_INPUT_FILE --output-file=PATH_TO_SAVE_OUTPUT --p=0.8.

To see other possible parameters, call python main.py --help.

What is Variant Calling?

Variant calling is the process of finding differences between a reference genome and an observed sample.
By comparing an observed genome with a well known reference genome, we can identify, store and query someone's genome easier.
Genomic variants can be simple, one example are single nucleotide variants (SNVs), and they can be complex, one example includes translocations.

Before identifying variants, we need to read the actual genome we want to analyze. Unfortunately, there isn't a method to read the whole genome in one go, so DNA is read by fragmenting the genome into millions of molecules and reading shorter reads instead. The reads are then aligned to reconstruct the whole genome.

The algorithm implemented in this repo identifies variants by looking at only one position at a time, for what the pileup format is perfect for.
Pileup format gives us how many of each nucleotide (A, G, C, T) are at every position, and also how many insertion/deletition occurencies are at that position. It also gives us the information what the reference genome had, against what we will compare.

Algorithm

After we have constrained the analysis to only one sequence position, we want to count how many of each AGCT letter and each INDEL (insertion/deletition) is at that position.
We will assume diploidy, hence there are at most two variants at one position. We will select between all letters and indels two with the largest count, remember their counts and discard others.

To simplify, let's call them letters a and b, with counts n_a and n_b. We will use the Maximum Likelihood (MLE) estimation to select the variants.

P(variant|counts) = P(variant) * P(counts|variant) / P(counts)

MLE ignores P(variant), and chooses the most probable one by maximizing P(counts|variant), the likelihood function.

All nucleotides at one position are either correct given the variant, or an error due to read extraction process. We will assume that all nucleotides are correct with probability p and are independent of each other. There are three cases:

Variant is aa, n_a nucleotides are correct. Likelihood function is given by Binomial distribution with parameters n_a + n_b and p, at position n_a.
Variant is bb, n_b nucleotides are correct. Likelihood function is given by Binomial distribution with parameters n_a + n_b and p, at position n_b.
Variant is ab, all nucleotides are correct. We found that it is suboptimal to blend a and b into one correct letter and have the likelihood function be equal to the same Binomial distribution at position n_a + n_b, because this option would be picked too often. To reduce that, we assumed that a and b should appear with equal probability 1/2 and calculated the likelihood as probability that the counts are n_a and n_b in that setting.

After calculating the likelihoods, we will pick the most likely variant! That variant is compared to the referent genome, to determine the genotype and the 'alts' field to store in the VCF 4.2 format output file.

Results

We used merged-normal.bam from Seven Bridges Cancer Genomics Cloud to test the algorithm. We produced merged-normal.pileup using Bcftools Mpileup tool and compared our called variants with variants produces by Bcftools Call tool.

Having or not having a variant at one position in the Bcftools Call tool output will be our true/false labels, and comparing the output of our algorithm gives us true positives, true negatives, false positives and false negatives, which are used to calculate standard metrics.

We ran the algorithm for probabilities between 0.5 and 1, at every 0.05 increment, and the results are:

The confusion matrix for p = 0.8 (best f1 score) is as follows:

Acknowlegments

This project was done for Computational Genomics course at School of Electrical Engineering, University of Belgrade.

You can check the course github for useful learning materials.

Bcftools Mpileup tool and Bcftools Call tool were ran on Seven Bridges Cancer Genomics Cloud , where we also downloaded the reference human genome and merged-normal.bam test file.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
images		images
test_data		test_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
metrics.py		metrics.py
pileup_reader.py		pileup_reader.py
test.py		test.py
variant_caller.py		variant_caller.py
vcf_writer.py		vcf_writer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binomial Variant Caller from Pileup

Table of content

How to use

What is Variant Calling?

Algorithm

Results

Acknowlegments

Licence

About

Releases

Packages

Contributors 2

Languages

License

EmaPajic/Variant-Calling

Folders and files

Latest commit

History

Repository files navigation

Binomial Variant Caller from Pileup

Table of content

How to use

What is Variant Calling?

Algorithm

Results

Acknowlegments

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages