/
index.Rmd
167 lines (114 loc) · 4.21 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title : MEPI group meeting, 2017
subtitle :
author : Mukarram Hossain
job : University of Cambridge
framework : revealjs # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : zenburn #
widgets : [] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
revealjs:
theme: serif
transition: cube
center: "false"
knit : slidify::knit2slides
---
## Alignment-free subtyping of HIV sequences
<br></br>
**Mukarram Hossain**
<br></br>
Department of Veterinary Medicine
University of Cambridge
<br></br>
MEPI group meeting, March 2017
<br></br>
<img src="assets/img/uc-colour-reveresed.png" width="30%" style="border: 0px">
---
### Subtype classification
- Viruses are often grouped into subtypes.
- Subtypes have wide implications on the following studies of viruses:
- clinical
- epidemiological
- structural
- functional
- Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.
---
### Alignment uncertainty
<img src="assets/img/ha.sp.png" width="65%" style="border: 0px">
---
### Alignment-free classification
- Lossless compression techniques have shown promising results for biological sequence classification:
- Protein family prediction (Begleiter et al., 2004)
- Protein structure prediction (Ferragina et al., 2007)
---
<br></br>
<img src="assets/img/comet_title.png" width="100%" style="border: 0px">
---
### COMET
- COMET is an ultrafast alignment free subtyping tool
- Uses **P**rediction by **P**artial **M**atching (PPM)
- Initially designed for HIV-1
- COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
- Sensitivity and specificity were comparable to or higher than:
- REGA (de Oliveira *et al.*, 2005) and
- SCUEAL (Pond *et al.*, 2009)
- Detected and identified new recombinant forms
---
### COMET algorithm
- Builds variable-order Markov models for each reference sequence
- Given a query, COMET calculates log likelihood of observing a base at each positions
- This results in a matrix of likelihood values
- Subtype call is done using a decision tree
---
### The decision tree
<img src="assets/img/decision_tree.png" width="550px" height="600px" style="border: 0px">
---
### Classification using Neural Networks
- *Neural networks* are computational system mimicking biological brain
- Consists of a cluster of neural units organised in layers
<br></br>
<img src="assets/img/ann.jpg" width="40%" style="border: 0px">
---
### ANN: design
- The input layer consists of 32 neurons:
- gets values from the fixed context
- Hidden layer consists of **N** neurons
- processes inputs coming from the input layer using wights and biases
- Output layer consists of 4 neurons
- uses *softmax funnction* to generate probabilities for the Nucleotide bases A, C, G, T
---
### ANN : implementation
- Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
- Written in Python3
---
### ANN: training
- We use the reference sequence set used in **COMET** to train the ANN
- Cross-validation is done using randomly removing one sequence from the training set
- Cross-entropy cost function is used to update network weights and biases
---
### ANN: subtyping
- For each nucleotide positions in the query sequence:
- ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
- The decision tree used in COMET is used to predict the subtype of the query sequence
---
### Cross-validation
<img src="assets/img/accuracy_8_500_train.png" width="70%" style="border: 0px">
---
### Test datasets
- Reference
- Vanderbilt
- PR-RT
---
### Accuracy comparison
<img src="assets/img/compare.png" width="60%" style="border: 0px">
---
### Future direction
- Optimise neural network parameters
- Larger context size?
- Recursive Neural Networks (RNN)?
- Report breakpoints for potential novel recombinants
- Implement using TensorFlow
---
<br></br>
<img src="assets/img/questions.jpg" width="600px" class="centred" style="margin: 10px 10px" />