Propagate sequencing uncertainty in phylogenetic analysis.
Sequencing is a multi-step process which is prone to errors. If the output for the sequencing of a biological sample gives ATTGCTATGC
, what is the error probability associated with this result, at each position (for example, what is the probability that the second base is indeed a T
)? How can we propagate this uncertainty in downstream phylogenic analysis?
- Raw short read files (e.g. SAM)
- Sequences are read little bits at a time. Each read is recorded and stored in a file (along with its alignment to a reference sequence).
- The sequence reads include information on the read quality. This is encoded as a Phred score, which uses unicode characters to encode uncertainty about a given base call.
- The called base is assigned a probability according to the Phred score. The remaining probability is assigned equally to the remaining possible bases.
- For example, if P(T) = 0.7 from the Phred score, then P(A) = P(C) = P(G) = 0.1.
- Given many short reads, the probabilities for each base pair at each site are added together.
- If, say, T is always 100% certain, then the the sum of the probabilities will represent the number of reads.
- Consensus FASTQ Files
- SAM files aren't always practical (or available), so often FASTQ files are used instead.
- These contain a Phred score for each base call. This phred score is calculated similarly to the process for SAM files, but only the uncertainty for the most probable base is reported.
- Consensus FASTA files
- Contain no information about sequence uncertainty.
- However, uncertainty can be generated around FASTA to investigate how sensitive the methods are to the assumption of certainty in the conseq.
Dependencies:
- sratoolkit
- pangolin
- R packages:
install.packages(c("here", "dplyr", "ggplot2", "ape", "gtools", "tidyr", "rmarkdown", "patchwork", "scales", "stringr", "xtable"))
# Make required binaries (custom C program by Gopi Gugan)
cd src/parse-sam-c
make
cd ../..
# Download and parse SAM files into uncertainty matrices
Rscript src/downloader.R # Downloads the SAM files
bash src/run_all.sh # runs parse-sam-c on all SAM files
# Sample from the uncertainty matrix, produce reports
bash scripts/sample-S.sh