Benchmarking statistics for divergence simulation tests #27

victorlin · 2020-04-07T22:37:30Z

Given cov0r.fa, and an incoming stream of aligned reads (BAM format), count the total number of reads that match per sequence. Counting to be done on the fly until EOF.

cov0r.fa consists of sequences along with non-complement reverse entries. The counts for each sequence+reverse pair can be used to calculate TP/FP/TN/FN.

The text was updated successfully, but these errors were encountered:

ababaian · 2020-04-07T22:48:52Z

In the current workflow, we have unaligned RNA-seq reads in fastq format which are being aligned to a reference genome cov0r.fa. The output is a SAM file on STDOUT which is piped into samtools view -b - to compress to BAM file.

This counter should read all header entries in cov0r.fa. There are two types of sequences here: >ACCESSION_1 ... >ACCESSION_33000 and the respective >REVERSE_ACCESSION.N which is the reverse non-compliment sequence.

This tool is after alignment and before compression, it should read the SAM on STDIN, and add a counter for >ACCESSION to which the read maps, then pass the read in SAM to STDOUT for compression. At EOF it will output it's own TSV file with the header-name for each sequence in cov0r.fa and the count of how many reads mapped to that sequence in the output file.

victorlin · 2020-04-12T19:08:50Z

I've been trying various methods to tackle this. Since speed is a concern and cov0r.fa should be constant, much precomputing should be done on those sequences to optimize for less work during runtime. There are some optimizations that can be done given the nature of the data:

cov0r.fa
- generate suffix trees for faster subsequnce searching (python implementation)
- compute for forward sequences only (we can reverse the incoming SAM reads to check for reverse matches)
- compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)
SAM reads
- generate cache for previously seen reads, since this is a stream of data. For example, take SRR11454613.bt2.sam - out of 499,524 reads, 207,120 are exact duplicates.
- the above could be extended to subsequence matches within reads (e.g. if ATATC has been checked, ATAT will match with the exactly the same sequences)

That said, I don't seem to have enough computing power on my personal MacBook to handle this, mainly the SuffixTrees. I've tried using pickle to save batches of SuffixTrees to disk, but it fails due to recursion limits (Python error: Cannot recover from stack overflow 😕). @ababaian is there some computing resource that I can test this on?

ababaian · 2020-04-12T23:36:09Z

So I think you may have a few critically important points here.

cov0r.fa.

I may be misunderstanding, but you shouldn't have to do any sub-sequence searching, the aligner already will make those and the output is a SAM stream contains alignment data in the form of a CIGAR string, the only thing that needs to be 'counted' is number of aligned reads and what reference name or RNAME (this is column 3 in the SAM specification) they correspond to. The counter would be a list like this:

Accession_1 0
Accession_2 73
Accession_3 800034
...
Accession_33296 23
REVERSE_Accession_1 3
REVERSE_Accession_2 0
REVERSE_Accession_3 12
...
REVERSE_Accession_33296 0

If you're talking about creating an index for the reference sequence NAMES (Accession ID), then I agree this is the correct approach and will be the fastest way to identify exact matches. In this case there are only 33,000 sequence names, which I don't think should be difficult to build an index for.

compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)

How are you defining 'unique sequences'. Are there duplications of identical sequences in cov0r? This is very important as we can update the pan-genome processing step to reduce the search space for the aligner and make things run faster. Great catch!

SAM reads.

This is a good point and something we can investigate as a follow-up. If a true (PCR) duplicate, and has mapped to the pan-genome, we can heuristically check/remove PCR-duplicates in the final aligned data using samtools rmdup. The prerequisite is that this would have to be already sorted index. The question of if we should count PCR duplicates twice for a viral genome/transcript or not is important and I haven't considered it yet. I'll have to think about it.

For computing resources, if needed I can give you permissions to spool up some bigger CPU on AWS, just message me in Slack. As a rule of thumb, every instance we're running in this pipeline is a C4.large at the moment, so 2 vCPU with 8 Gb memory. If you can't run it on your MacBook, then it is likely too resource heavy to run in the pipeline.

As one additional note, we've bumped up to cov1r now which you can download at s3://serratus-public/seq/cov1r/cov1r.fa

victorlin · 2020-04-13T00:28:35Z

Wow, seems like I misunderstood the task. So all the SAM reads are references from cov0r.fa? If that's the case, then the problem is much easier than I thought - there's no need to even look at the sequences. Using python Counter on each read's reference_name (from pysam) should do the trick. It should be fairly efficient as is - we can run without any optimization tricks and apply some if needed. Where should I put the python script?

Yes, there are quite a bit of duplicate sequences in cov0r.fa (also for cov1r.fa - 8,431 duplicates out of 33,293 forward sequences). Let me know if you'd like me to do some further analysis.

ababaian · 2020-04-13T01:37:31Z

NP, good to clarify things :)

I'd consider a pysam independent method too, just Counter on column 3 on each row of STDIN. Really performance is key here, we want to do this without slowing the pipe. With respect to testing performance, can you objectively measure how this performs by timing from sam to bam conversion with and without the python counter. Something like triplicate measurements of:

With
samtools view -h SRR11454613.bt2.sam > test.input.sam

(time cat test.input.sam | samtools view -b - > test.output.bam)
vs.
(time cat test.input.sam | python_counter - | samtools view -b - > test.output.bam)

How are you measuring duplicates in cov1r.fa?

For the python script, this falls under samtools_merge so serratus/scripts/serratus_merge would be where we want it, eventually as part of samtools_merge.sh.

victorlin · 2020-04-13T03:22:49Z

re: duplicates, I'm looking at the full sequence entries in the FASTA. Here's some sample code:

from Bio import SeqIO

seqs_cov1r_fwd = [str(seq_record.seq)
                  for seq_record in SeqIO.parse("cov1r.fa", "fasta")
                  if 'REVERSE' not in seq_record.id]

print(f'{len(seqs_cov1r_fwd)} forward sequences')
print(f'{len(set(seqs_cov1r_fwd))} unique sequences')
print(f'{len(seqs_cov1r_fwd) - len(set(seqs_cov1r_fwd))} duplicate sequences')

Output:

33293 forward sequences
24862 unique sequences
8431 duplicate sequences

victorlin · 2020-04-13T04:07:54Z

Is it necessary to maintain the order of accession IDs from cov0r.fa?

Accession_1 0
Accession_2 73
Accession_3 800034
...
Accession_33296 23
REVERSE_Accession_1 3
REVERSE_Accession_2 0
REVERSE_Accession_3 12
...
REVERSE_Accession_33296 0

If it's acceptable to have results unsorted and excluding 0-counts, like this:

Accession_3 800034
Accession_2 73
REVERSE_Accession_3 12
...
Accession_33296 23
REVERSE_Accession_1 3

...then it can be achieved with an awk one-liner on the SAM STDIN, without even referencing cov0r.fa:

awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}'

The counts can be written to a file counts.txt (or CSV/TSV formatted) using tee:

cat SRR11454613.sam
| tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt)
| samtools view -b - > SRR11454613.bam

ababaian · 2020-04-13T15:34:27Z

That looks good, it will for sure be faster if we exclude zero counts (most will be zero) so yes. How is the performance?

victorlin · 2020-04-13T16:37:05Z

Quickly testing 3 runs of SRR11454613.bt2.sam | samtools view vs. .sam | counter | samtools view, it seems like ~9.5s vs. ~12.8s on my local machine.

Here is a full summary:

$ (time cat SRR11454613.bt2.sam | samtools view -b - > test.output.bam)
    
    #1
    real    0m9.461s
    user    0m9.328s
    sys     0m0.238s

    #2
    real    0m9.412s
    user    0m9.301s
    sys     0m0.236s

    #3
    real    0m9.786s
    user    0m9.588s
    sys     0m0.304s

$ (time cat SRR11454613.bt2.sam | tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt) | samtools view -b - > test.output.bam)

    #1
    real    0m12.432s
    user    0m9.474s
    sys     0m0.420s

    #2
    real    0m12.914s
    user    0m9.866s
    sys     0m0.535s

    #3
    real    0m12.886s
    user    0m9.851s
    sys     0m0.532s

ababaian · 2020-04-13T18:03:24Z

I think that should be OK performance. Now we just need to calculate TP/FP/FN/TN values using an extended counts.txt.

Take a look at this simulated data-set: CoV simulations which you can download from s3://serratus-public/notebook/200411/. You can make a similar simulated data-set for testing ROC and make it a bit deeper too.

Brainstorming, I can think of 2 ways to do this, maybe you can think of a better experiment.

If you take the wuCoV reference sequence (NC_045512v2). Create a 5% divergence 'positive control' sequence and a 100% divergence 'negative control' sequence. Map these reads against the original wuCoV sequence or the reverse non-compliment of the sequence and calculate the ROC stats based on if the + sequences match the reference sequence of the REVERSE and if the - sequences map to either (false positive). There is no measurement of True Negative with this though.
Same as above but the negative simulated sequences are 5% divergent from the REVERSE non-compliment sequences.

edit: performance is much less of an issue here, as this is only for benchmarking.

victorlin · 2020-04-14T04:53:58Z

I feel like I'm missing an important concept here. Where do the counts from earlier play into the TP/FP/FN/TN calculation? It seems that all the measurement is done on simulated positive/negative sequences?

ababaian · 2020-04-14T17:59:23Z

When we're aligning a sequences from a given accession we have no idea if it's CoV+ or CoV-. The count will be used to report how much alignment there is in the library to CoV forward or REVERSE sequences. Libraries with a high forward count and a high forward:reverse ratio we will classify as 'Potentially CoV+ libraries' and re-analyze those in more detail using transcriptome assembly.

The TP/FP/FN/TN rates with simulated reads is purely to establish benchmarks so we know how well the experimental system works in theory.

victorlin · 2020-04-17T05:23:44Z

I've tried some preliminary simulations based on the notebook referenced above. To me, it seems that all TP/FP/FN/TN values can be derived from experiment 1, outlined below.

Definitions:

cov : genome
rev_cov : reverse non-complement of cov
sim_pos : simulated reads from 5% divergence from cov
sim_neg : simulated reads from 100% divergence from cov

Statistics:

TP : number of reads in sim_pos that align to cov only
FN : number of reads in sim_pos that align to rev_cov or remain unmapped
FP : number of reads in sim_neg that align to cov only
TN : number of reads in sim_neg that align to rev_cov or remain unmapped

Simulation

Here are bowtie2 counts based on cov=NC_045512v2:

	aligned concordantly 0 times	aligned concordantly exactly 1 time
`sim_pos`~`cov`	175	7300
`sim_pos`~`rev_cov`	7475	0
`sim_neg`~`cov`	7475	0
`sim_neg`~`rev_cov`	7475	0

Results would be:

TP = 7300
FN = 175
FP = 0
TN = 7475

The calculation for these results was trivial. However, if the alignment of sim_pos~rev_cov had successfully mapped some reads, we would have to know the outcome of both cov and rev_cov alignments on a per-read basis.

ababaian · 2020-04-17T18:12:42Z

This is very clear headed. Good job. Few notes

Statistics:
-> TP: number of reads in sim_pos that align to either cov or rev_cov
These reads that align to rev_cov would be FN, the rev_cov is a bait or control sequence to measure how much reads would align to cov1 are simply background level.

TABLE

FP = 175

FN = 0

I think these are switched, there are no FP only 175 FN.

The calculation for these results was trivial...

The point is not to just to do this for this one data-set but use this to benchmark/test a wide variety of bowtie2 or other aligner settings across a wide range of %divergence. Thats why I was saying if you can make this compile the data neatly we'll be running it something like 200x across many conditions to 'explore' the alignment algorithm space.

victorlin · 2020-04-17T19:53:28Z

Thanks for the clarification regarding rev_cov and good catch - I've updated my comment accordingly. Main difference to note is TP/FP should check for alignment to cov only (alignment to both should be categorized as FN/TN).

I'll have to look into a way to track the outcome of both cov and rev_cov alignments on a per-read basis since we can't get these numbers by just looking at the bowtie2 output, but in the end we should be able to have a table like this:

	TP	FN	FP	TN
GENOME1	7300	175	0	7475
GENOME2	...	...	...	...
GENOME3	...	...	...	...
...	...	...	...	...

I'm guessing cov0 can be used as a starter for the list of genomes?

ababaian · 2020-04-17T19:59:59Z

negative use cov1r. You can parse out all the reverse sequences as the RNAME all will begin with REVERSE_ :) exactly for this use case

victorlin · 2020-04-17T20:03:03Z

ohh perfect. Think I have what's needed to get started then!

victorlin · 2020-04-20T04:21:22Z

I have a preliminary python script for this. Here's some sample output:

GENOME	TP	FN	TN
X81024.1	172	3	175
KX092228.1	99	1	100
KX092227.1	98	2	100
KX092226.1	99	1	100
KX092225.1	100	0	100
KX092224.1	8	0	8
KX092223.1	75	0	75

I'm thinking it might be better to normalize the counts as proportions relative to the number of reads, like this:

GENOME	TP	FN	FP	TN
X81024.1	172/175	3/175	0/175	175/175
KX092228.1	99/100	1/100	0/100	100/100
KX092227.1	98/100	2/100	0/100	100/100
KX092226.1	99/100	1/100	0/100	100/100
KX092225.1	100/100	0/100	0/100	100/100
KX092224.1	8/8	0/8	0/8	8/8
KX092223.1	75/75	0/75	0/75	75/75

rcedgar · 2020-04-20T04:29:00Z

"ROC" implies plotting as a function of a confidence score, which in this case would be the MAPQ value in the SAM/BAM file. For example, if the mapper reports MAPQ=0 (i.e., P_error=1), do we consider the hit or not? This is tricky because MAPQ doesn't mean what we want it to mean. We want "error probability that the read is derived from this genome", but MAPQ actually means "assuming the read does come from this genome, probability the location is wrong".

ababaian · 2020-04-20T19:41:50Z

This is a little off-topic: For actual data, we likely won't use MAPQ as a confidence function but at the end of the day we do need some function to classify a library as potentially CoV+ or CoV-. See #41 for detailing of the false-positives encountered so far. What they have in common is even if they had a very high number/percentage of mapped reads, the mapping was localized to a very small seed region with clipping on either side. We can 'mask' these regions out of the pan-genome but I feel like it might be wack-a-mole with false-positives this way and instead we should consider if a RNAME has above a threshold number of mapped reads in a library (say 100), we calculate how many bases over the reference sequence are 'covered'. if it's like 50 nt / 5000 nt then it's likely a false positive or mismapping, if it's broader like 400 nt / 5000 nt then maybe we have a something. This has obvious limitations so we'll need to think it through more carefully.

On topic: This looks great Victor. If we have a quick py-script to calculate and report these values then we can apply it to larger optimizations. i.e. get it into the hands of @charlescongxu for his bowtie2 optimization

#27

victorlin · 2020-04-21T04:25:16Z

@ababaian @charlescongxu I've added the script+docs in a new directory named benchmarking - feel free to move somewhere more fitting. Let me know if you have any trouble running in your environment.

ababaian · 2020-04-21T04:43:45Z

Hey @victorlin this looks like a good backbone for benchmarking. Can you break apart some of the functionality in this? We'll want it to be a bit more flexible then what you have

-x Set the reference genome to align against (make this independent of input sequence)
-i Specify the input "positive" sequence (TP)
-n Specify the input "negative" sequence (FP). Default if unset would be to use the same as -i
Print to STDOUT instead of a file.
-bt2 optional parameter, pass this string as a parameter to bowtie2. i.e. -bt2 "--very-sensitive-local" to run bowtie2 --very-sensitive-local instead of default.

Then in the README add the definition of TP/FP/TN/FN you used made above so we can close this issue and not lose important information : )

Edit:

Also a bypass for positive and negative sequence input as fastq files.

victorlin · 2020-04-22T03:58:56Z

Sure, more flexibility is always nice :)

Currently the script works on the entire pan-genome (cov1r.fa). The proposed changes seem to act on each genome entry (right now each genome has different sim_pos,sim_neg). Two approaches here:

Repurpose the script to generate one result (TP/FP/TN/FN) at a time, rather than all. Call this using another script that will define parameters and aggregate results into an output table.
Inputs for sim_pos and sim_neg would be in a FASTA file which has 1 entry per genome in cov1r.fa.

Option 2 would be easier to make the changes, but either is doable. Ultimately it's up to the use case and how these additional parameters are determined.

1. All default behavior

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT

What this does: simulate positive/negative read sets using 5% divergence from input and reverse(input). Align to input and reverse(input). Calculate and print TP,FN,FP,TN.

input : AAAAAAAAAATTTTTTTTTT
pos_seq : input
neg_seq : reverse(input)
prop_pos : 0.05
prop_neg : 0.05
pos_reads : simulated using 5% divergence from input
neg_reads : simulated using 5% divergence from reverse(input)
pos_align_seq : input
neg_align_seq : reverse(input)
msbar_params : -point 4 -block 0 -codon 0
art_illumina_params : --seqSys HS20 --paired --len 100 --mflen 300 --sdev 1 --fcov 50 --rndSeed 666 --noALN
bowtie2_params : --very-sensitive-local

2. Configure positive/negative alignment sequences

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
          --pos_align_seq CAAAAAAAAATTTTTTTTTG
          --neg_align_seq GTTTTTTTTTAAAAAAAAAC

3. Configure sequences for simulating positive/negative read sets

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
                --pos_seq AAAAAAAAAATTTTTTTTTG
                --neg_seq GTTTTTTTTTAAAAAAAAAA

4. Configure pre-simulated reads

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
              --pos_reads path/to/pos_prefix
              --neg_reads path/to/neg_prefix

This can result in lengthy commands. All the SEQ inputs can be single-entry FASTA files instead if that's easier to work with.

ababaian · 2020-04-23T20:24:55Z

This looks like it's what we need, the only thing which needs to be included is to allow for multiple fasta files in addition to regular fasta files. We can then put a wrapper around this bad boy and generate all the statistics we need in all the different 'experiments' we can think of.

victorlin · 2020-04-24T03:44:35Z

@ababaian so all the parameters taking a SEQ as input should take FASTA files instead (all with n number of records), and the command will run n experiments with other parameter values being constant?

ababaian · 2020-04-24T03:56:25Z

No, a multiple fasta file will be treated as a single entity for generating reads from or aligning against. Each run of the command should yield a single output.

victorlin added the Bioinformatics Bioinformatics task label Apr 7, 2020

victorlin self-assigned this Apr 7, 2020

victorlin added the BASH Scripting / bash related task label Apr 13, 2020

ababaian mentioned this issue Apr 13, 2020

cov1 Multiple Sequence Alignment #36

Closed

ababaian added this to Task In Progress in TODO List Apr 14, 2020

victorlin added a commit that referenced this issue Apr 21, 2020

Add benchmarking script

cc327e2

#27

ababaian mentioned this issue Apr 24, 2020

Perform CoV Cross-validation (hold-out) #48

Closed

victorlin changed the title ~~ROC statistics on BAM file/stream for control sequence~~ Benchmarking statistics for divergence simulation tests Apr 24, 2020

victorlin added the Python label Apr 24, 2020

victorlin mentioned this issue Apr 25, 2020

Update benchmarking script with new CLI #56

Draft

9 tasks

mathemage linked a pull request Apr 26, 2020 that will close this issue

Update benchmarking script with new CLI #56

Draft

9 tasks

ababaian closed this as completed Dec 9, 2020

TODO List automation moved this from Task In Progress to Completed Tasks Dec 9, 2020

ababaian removed this from Completed Tasks in TODO List Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking statistics for divergence simulation tests #27

Benchmarking statistics for divergence simulation tests #27

victorlin commented Apr 7, 2020

ababaian commented Apr 7, 2020

victorlin commented Apr 12, 2020 •

edited

ababaian commented Apr 12, 2020 •

edited

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020

victorlin commented Apr 13, 2020

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020 •

edited

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

victorlin commented Apr 17, 2020 •

edited

ababaian commented Apr 17, 2020

victorlin commented Apr 17, 2020

ababaian commented Apr 17, 2020

victorlin commented Apr 17, 2020

victorlin commented Apr 20, 2020

rcedgar commented Apr 20, 2020 •

edited

ababaian commented Apr 20, 2020

victorlin commented Apr 21, 2020

ababaian commented Apr 21, 2020 •

edited

victorlin commented Apr 22, 2020 •

edited

ababaian commented Apr 22, 2020

victorlin commented Apr 23, 2020 •

edited

ababaian commented Apr 23, 2020

victorlin commented Apr 24, 2020

ababaian commented Apr 24, 2020

Benchmarking statistics for divergence simulation tests #27

Benchmarking statistics for divergence simulation tests #27

Comments

victorlin commented Apr 7, 2020

ababaian commented Apr 7, 2020

victorlin commented Apr 12, 2020 • edited

ababaian commented Apr 12, 2020 • edited

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020

victorlin commented Apr 13, 2020

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020

victorlin commented Apr 13, 2020

ababaian commented Apr 13, 2020 • edited

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

victorlin commented Apr 17, 2020 • edited

Simulation

ababaian commented Apr 17, 2020

victorlin commented Apr 17, 2020

ababaian commented Apr 17, 2020

victorlin commented Apr 17, 2020

victorlin commented Apr 20, 2020

rcedgar commented Apr 20, 2020 • edited

ababaian commented Apr 20, 2020

victorlin commented Apr 21, 2020

ababaian commented Apr 21, 2020 • edited

victorlin commented Apr 22, 2020 • edited

ababaian commented Apr 22, 2020

victorlin commented Apr 23, 2020 • edited

1. All default behavior

2. Configure positive/negative alignment sequences

3. Configure sequences for simulating positive/negative read sets

4. Configure pre-simulated reads

ababaian commented Apr 23, 2020

victorlin commented Apr 24, 2020

ababaian commented Apr 24, 2020

victorlin commented Apr 12, 2020 •

edited

ababaian commented Apr 12, 2020 •

edited

ababaian commented Apr 13, 2020 •

edited

victorlin commented Apr 17, 2020 •

edited

rcedgar commented Apr 20, 2020 •

edited

ababaian commented Apr 21, 2020 •

edited

victorlin commented Apr 22, 2020 •

edited

victorlin commented Apr 23, 2020 •

edited