Skip to content

Latest commit

 

History

History
167 lines (114 loc) · 4.21 KB

index.md

File metadata and controls

167 lines (114 loc) · 4.21 KB
title subtitle author job framework highlighter hitheme widgets mode revealjs theme transition center knit
MEPI group meeting, 2017
Mukarram Hossain
University of Cambridge
revealjs
highlight.js
zenburn
selfcontained
serif
cube
false
slidify::knit2slides

Alignment-free subtyping of HIV sequences



Mukarram Hossain

Department of Veterinary Medicine
University of Cambridge

MEPI group meeting, March 2017

 


Subtype classification

  • Viruses are often grouped into subtypes.
  • Subtypes have wide implications on the following studies of viruses:
    • clinical
    • epidemiological
    • structural
    • functional
  • Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.

Alignment uncertainty

 


Alignment-free classification

  • Lossless compression techniques have shown promising results for biological sequence classification:
    • Protein family prediction (Begleiter et al., 2004)
    • Protein structure prediction (Ferragina et al., 2007)



 


COMET

  • COMET is an ultrafast alignment free subtyping tool
  • Uses Prediction by Partial Matching (PPM)
  • Initially designed for HIV-1
  • COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
  • Sensitivity and specificity were comparable to or higher than:
    • REGA (de Oliveira et al., 2005) and
    • SCUEAL (Pond et al., 2009)
  • Detected and identified new recombinant forms

COMET algorithm

  • Builds variable-order Markov models for each reference sequence
  • Given a query, COMET calculates log likelihood of observing a base at each positions
  • This results in a matrix of likelihood values
  • Subtype call is done using a decision tree

The decision tree

 


Classification using Neural Networks

  • Neural networks are computational system mimicking biological brain
  • Consists of a cluster of neural units organised in layers



 


ANN: design

  • The input layer consists of 32 neurons:
    • gets values from the fixed context
  • Hidden layer consists of N neurons
    • processes inputs coming from the input layer using wights and biases
  • Output layer consists of 4 neurons
    • uses softmax funnction to generate probabilities for the Nucleotide bases A, C, G, T

ANN : implementation

  • Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
  • Written in Python3

ANN: training

  • We use the reference sequence set used in COMET to train the ANN
  • Cross-validation is done using randomly removing one sequence from the training set
  • Cross-entropy cost function is used to update network weights and biases

ANN: subtyping

  • For each nucleotide positions in the query sequence:
    • ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
  • The decision tree used in COMET is used to predict the subtype of the query sequence

Cross-validation

 


Test datasets

  • Reference
  • Vanderbilt
  • PR-RT

Accuracy comparison

 


Future direction

  • Optimise neural network parameters
  • Larger context size?
  • Recursive Neural Networks (RNN)?
  • Report breakpoints for potential novel recombinants
  • Implement using TensorFlow