mepi17/index.md at master · asmmhossain/mepi17 · GitHub

title

subtitle

author

job

framework

highlighter

hitheme

widgets

mode

revealjs

theme

transition

center

knit

MEPI group meeting, 2017

Mukarram Hossain

University of Cambridge

revealjs

highlight.js

zenburn

selfcontained

serif

cube

false

slidify::knit2slides

Alignment-free subtyping of HIV sequences

Mukarram Hossain

Department of Veterinary Medicine
University of Cambridge

MEPI group meeting, March 2017

Subtype classification

Viruses are often grouped into subtypes.
Subtypes have wide implications on the following studies of viruses:
- clinical
- epidemiological
- structural
- functional
Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.

Alignment uncertainty

Alignment-free classification

Lossless compression techniques have shown promising results for biological sequence classification:
- Protein family prediction (Begleiter et al., 2004)
- Protein structure prediction (Ferragina et al., 2007)

COMET

COMET is an ultrafast alignment free subtyping tool
Uses Prediction by Partial Matching (PPM)
Initially designed for HIV-1
COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
Sensitivity and specificity were comparable to or higher than:
- REGA (de Oliveira et al., 2005) and
- SCUEAL (Pond et al., 2009)
Detected and identified new recombinant forms

COMET algorithm

Builds variable-order Markov models for each reference sequence
Given a query, COMET calculates log likelihood of observing a base at each positions
This results in a matrix of likelihood values
Subtype call is done using a decision tree

The decision tree

Classification using Neural Networks

Neural networks are computational system mimicking biological brain
Consists of a cluster of neural units organised in layers

ANN: design

The input layer consists of 32 neurons:
- gets values from the fixed context
Hidden layer consists of N neurons
- processes inputs coming from the input layer using wights and biases
Output layer consists of 4 neurons
- uses softmax funnction to generate probabilities for the Nucleotide bases A, C, G, T

ANN : implementation

Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
Written in Python3

ANN: training

We use the reference sequence set used in COMET to train the ANN
Cross-validation is done using randomly removing one sequence from the training set
Cross-entropy cost function is used to update network weights and biases

ANN: subtyping

For each nucleotide positions in the query sequence:
- ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
The decision tree used in COMET is used to predict the subtype of the query sequence

Cross-validation

Test datasets

Reference
Vanderbilt
PR-RT

Accuracy comparison

Future direction

Optimise neural network parameters
Larger context size?
Recursive Neural Networks (RNN)?
Report breakpoints for potential novel recombinants
Implement using TensorFlow