-
Notifications
You must be signed in to change notification settings - Fork 3
/
extract_rdrp.Rmd
55 lines (41 loc) · 1.68 KB
/
extract_rdrp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
get accession number from Figure 3G of <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3647518/>, stored in [betaCov_acc.txt](data/betaCov).
download [genbank files](data/betaCov/gb) and extract the sequences, [betaCov.fasta](data/betaCov/betaCov.fasta).
extract a [partial RdRp sequence](data/betaCov/query.fas) and do BLAST to extract partial RdRp sequences.
```
makeblastdb -in betaCov.fasta -dbtype nucl -out betaCov
blastn -query query.fas -db betaCov -outfmt 11 -out "rdrq.blastn.asn"
blast_formatter -archive "rdrp.blastn.asn" -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids salltitles" > "rdrp.blastn.tab"
```
```{r}
xx = readLines("rdrp.blastn.tab")
strsplit(gsub(" ", "", sub("# Fields: ", "", xx[4])), split=",")[[1]] -> hh
x = read.delim2("rdrp.blastn.tab", comment="#", header=F)
colnames(x) = hh
start = x$s.start
end = x$s.end
ss = ss[x$subjecttitles]
subseq(ss, start, end) -> rdrp
## clean up sequence name and then exported to file
writeXStringSet(rdrp, filepath="rdrp.fas")
```
```
muscle -in rdrp.fas -out rdrp_aln.fas
iqtree -s rdrp_aln.fas
```
## visualization
```{r}
library(ggtree)
x = read.tree("rdrp_aln.fas.treefile")
d = data.frame(label = x$tip.label)
d$acc = sub(".*_(\\w+)$", "\\1", d$label)
ggtree(x, ladderize=F) %<+% d +
geom_tiplab(aes(color=acc %in% c("JX869059", "KC164505"))) +
xlim(0, 4) +
theme(legend.position="none") +
scale_color_manual(values=c("black", "red"))
ggtree(x, ladderize=F, branch.length='none') %<+% d +
geom_tiplab(aes(color=acc %in% c("JX869059", "KC164505"))) +
theme(legend.position="none") +
scale_color_manual(values=c("black", "red")) +
xlim(0, 18)
```