title | subtitle | author | job | framework | highlighter | hitheme | widgets | mode | revealjs | theme | transition | center | knit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MEPI group meeting, 2017 |
Mukarram Hossain |
University of Cambridge |
revealjs |
highlight.js |
zenburn |
selfcontained |
serif |
cube |
false |
slidify::knit2slides |
Mukarram Hossain
Department of Veterinary Medicine
University of Cambridge
MEPI group meeting, March 2017
- Viruses are often grouped into subtypes.
- Subtypes have wide implications on the following studies of viruses:
- clinical
- epidemiological
- structural
- functional
- Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.
- Lossless compression techniques have shown promising results for biological sequence classification:
- Protein family prediction (Begleiter et al., 2004)
- Protein structure prediction (Ferragina et al., 2007)
- COMET is an ultrafast alignment free subtyping tool
- Uses Prediction by Partial Matching (PPM)
- Initially designed for HIV-1
- COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
- Sensitivity and specificity were comparable to or higher than:
- REGA (de Oliveira et al., 2005) and
- SCUEAL (Pond et al., 2009)
- Detected and identified new recombinant forms
- Builds variable-order Markov models for each reference sequence
- Given a query, COMET calculates log likelihood of observing a base at each positions
- This results in a matrix of likelihood values
- Subtype call is done using a decision tree
- Neural networks are computational system mimicking biological brain
- Consists of a cluster of neural units organised in layers
- The input layer consists of 32 neurons:
- gets values from the fixed context
- Hidden layer consists of N neurons
- processes inputs coming from the input layer using wights and biases
- Output layer consists of 4 neurons
- uses softmax funnction to generate probabilities for the Nucleotide bases A, C, G, T
- Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
- Written in Python3
- We use the reference sequence set used in COMET to train the ANN
- Cross-validation is done using randomly removing one sequence from the training set
- Cross-entropy cost function is used to update network weights and biases
- For each nucleotide positions in the query sequence:
- ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
- The decision tree used in COMET is used to predict the subtype of the query sequence
- Reference
- Vanderbilt
- PR-RT
- Optimise neural network parameters
- Larger context size?
- Recursive Neural Networks (RNN)?
- Report breakpoints for potential novel recombinants
- Implement using TensorFlow