Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking statistics for divergence simulation tests #27

Closed
victorlin opened this issue Apr 7, 2020 · 28 comments · May be fixed by #56
Closed

Benchmarking statistics for divergence simulation tests #27

victorlin opened this issue Apr 7, 2020 · 28 comments · May be fixed by #56
Assignees
Labels
BASH Scripting / bash related task Bioinformatics Bioinformatics task Python

Comments

@victorlin
Copy link
Collaborator

Given cov0r.fa, and an incoming stream of aligned reads (BAM format), count the total number of reads that match per sequence. Counting to be done on the fly until EOF.

cov0r.fa consists of sequences along with non-complement reverse entries. The counts for each sequence+reverse pair can be used to calculate TP/FP/TN/FN.

@victorlin victorlin added the Bioinformatics Bioinformatics task label Apr 7, 2020
@victorlin victorlin self-assigned this Apr 7, 2020
@ababaian
Copy link
Owner

ababaian commented Apr 7, 2020

In the current workflow, we have unaligned RNA-seq reads in fastq format which are being aligned to a reference genome cov0r.fa. The output is a SAM file on STDOUT which is piped into samtools view -b - to compress to BAM file.

This counter should read all header entries in cov0r.fa. There are two types of sequences here: >ACCESSION_1 ... >ACCESSION_33000 and the respective >REVERSE_ACCESSION.N which is the reverse non-compliment sequence.

This tool is after alignment and before compression, it should read the SAM on STDIN, and add a counter for >ACCESSION to which the read maps, then pass the read in SAM to STDOUT for compression. At EOF it will output it's own TSV file with the header-name for each sequence in cov0r.fa and the count of how many reads mapped to that sequence in the output file.

@victorlin
Copy link
Collaborator Author

victorlin commented Apr 12, 2020

I've been trying various methods to tackle this. Since speed is a concern and cov0r.fa should be constant, much precomputing should be done on those sequences to optimize for less work during runtime. There are some optimizations that can be done given the nature of the data:

  1. cov0r.fa
    • generate suffix trees for faster subsequnce searching (python implementation)
    • compute for forward sequences only (we can reverse the incoming SAM reads to check for reverse matches)
    • compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)
  2. SAM reads
    • generate cache for previously seen reads, since this is a stream of data. For example, take SRR11454613.bt2.sam - out of 499,524 reads, 207,120 are exact duplicates.
    • the above could be extended to subsequence matches within reads (e.g. if ATATC has been checked, ATAT will match with the exactly the same sequences)

That said, I don't seem to have enough computing power on my personal MacBook to handle this, mainly the SuffixTrees. I've tried using pickle to save batches of SuffixTrees to disk, but it fails due to recursion limits (Python error: Cannot recover from stack overflow 😕). @ababaian is there some computing resource that I can test this on?

@ababaian
Copy link
Owner

ababaian commented Apr 12, 2020

So I think you may have a few critically important points here.

  1. cov0r.fa.

I may be misunderstanding, but you shouldn't have to do any sub-sequence searching, the aligner already will make those and the output is a SAM stream contains alignment data in the form of a CIGAR string, the only thing that needs to be 'counted' is number of aligned reads and what reference name or RNAME (this is column 3 in the SAM specification) they correspond to. The counter would be a list like this:

Accession_1 0
Accession_2 73
Accession_3 800034
...
Accession_33296 23
REVERSE_Accession_1 3
REVERSE_Accession_2 0
REVERSE_Accession_3 12
...
REVERSE_Accession_33296 0

If you're talking about creating an index for the reference sequence NAMES (Accession ID), then I agree this is the correct approach and will be the fastest way to identify exact matches. In this case there are only 33,000 sequence names, which I don't think should be difficult to build an index for.

compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)

How are you defining 'unique sequences'. Are there duplications of identical sequences in cov0r? This is very important as we can update the pan-genome processing step to reduce the search space for the aligner and make things run faster. Great catch!

  1. SAM reads.

This is a good point and something we can investigate as a follow-up. If a true (PCR) duplicate, and has mapped to the pan-genome, we can heuristically check/remove PCR-duplicates in the final aligned data using samtools rmdup. The prerequisite is that this would have to be already sorted index. The question of if we should count PCR duplicates twice for a viral genome/transcript or not is important and I haven't considered it yet. I'll have to think about it.

For computing resources, if needed I can give you permissions to spool up some bigger CPU on AWS, just message me in Slack. As a rule of thumb, every instance we're running in this pipeline is a C4.large at the moment, so 2 vCPU with 8 Gb memory. If you can't run it on your MacBook, then it is likely too resource heavy to run in the pipeline.

As one additional note, we've bumped up to cov1r now which you can download at s3://serratus-public/seq/cov1r/cov1r.fa

@victorlin
Copy link
Collaborator Author

Wow, seems like I misunderstood the task. So all the SAM reads are references from cov0r.fa? If that's the case, then the problem is much easier than I thought - there's no need to even look at the sequences. Using python Counter on each read's reference_name (from pysam) should do the trick. It should be fairly efficient as is - we can run without any optimization tricks and apply some if needed. Where should I put the python script?

Yes, there are quite a bit of duplicate sequences in cov0r.fa (also for cov1r.fa - 8,431 duplicates out of 33,293 forward sequences). Let me know if you'd like me to do some further analysis.

@ababaian
Copy link
Owner

NP, good to clarify things :)

I'd consider a pysam independent method too, just Counter on column 3 on each row of STDIN. Really performance is key here, we want to do this without slowing the pipe. With respect to testing performance, can you objectively measure how this performs by timing from sam to bam conversion with and without the python counter. Something like triplicate measurements of:

With
samtools view -h SRR11454613.bt2.sam > test.input.sam

(time cat test.input.sam | samtools view -b - > test.output.bam)
vs.
(time cat test.input.sam | python_counter - | samtools view -b - > test.output.bam)


How are you measuring duplicates in cov1r.fa?

For the python script, this falls under samtools_merge so serratus/scripts/serratus_merge would be where we want it, eventually as part of samtools_merge.sh.

@victorlin
Copy link
Collaborator Author

re: duplicates, I'm looking at the full sequence entries in the FASTA. Here's some sample code:

from Bio import SeqIO

seqs_cov1r_fwd = [str(seq_record.seq)
                  for seq_record in SeqIO.parse("cov1r.fa", "fasta")
                  if 'REVERSE' not in seq_record.id]

print(f'{len(seqs_cov1r_fwd)} forward sequences')
print(f'{len(set(seqs_cov1r_fwd))} unique sequences')
print(f'{len(seqs_cov1r_fwd) - len(set(seqs_cov1r_fwd))} duplicate sequences')

Output:

33293 forward sequences
24862 unique sequences
8431 duplicate sequences

@victorlin
Copy link
Collaborator Author

Is it necessary to maintain the order of accession IDs from cov0r.fa?

Accession_1 0
Accession_2 73
Accession_3 800034
...
Accession_33296 23
REVERSE_Accession_1 3
REVERSE_Accession_2 0
REVERSE_Accession_3 12
...
REVERSE_Accession_33296 0

If it's acceptable to have results unsorted and excluding 0-counts, like this:

Accession_3 800034
Accession_2 73
REVERSE_Accession_3 12
...
Accession_33296 23
REVERSE_Accession_1 3

...then it can be achieved with an awk one-liner on the SAM STDIN, without even referencing cov0r.fa:

awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}'

The counts can be written to a file counts.txt (or CSV/TSV formatted) using tee:

cat SRR11454613.sam
| tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt)
| samtools view -b - > SRR11454613.bam

@ababaian
Copy link
Owner

That looks good, it will for sure be faster if we exclude zero counts (most will be zero) so yes. How is the performance?

@victorlin
Copy link
Collaborator Author

Quickly testing 3 runs of SRR11454613.bt2.sam | samtools view vs. .sam | counter | samtools view, it seems like ~9.5s vs. ~12.8s on my local machine.

Here is a full summary:

$ (time cat SRR11454613.bt2.sam | samtools view -b - > test.output.bam)
    
    #1
    real    0m9.461s
    user    0m9.328s
    sys     0m0.238s

    #2
    real    0m9.412s
    user    0m9.301s
    sys     0m0.236s

    #3
    real    0m9.786s
    user    0m9.588s
    sys     0m0.304s

$ (time cat SRR11454613.bt2.sam | tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt) | samtools view -b - > test.output.bam)

    #1
    real    0m12.432s
    user    0m9.474s
    sys     0m0.420s

    #2
    real    0m12.914s
    user    0m9.866s
    sys     0m0.535s

    #3
    real    0m12.886s
    user    0m9.851s
    sys     0m0.532s

@victorlin victorlin added the BASH Scripting / bash related task label Apr 13, 2020
@ababaian
Copy link
Owner

ababaian commented Apr 13, 2020

I think that should be OK performance. Now we just need to calculate TP/FP/FN/TN values using an extended counts.txt.

Take a look at this simulated data-set: CoV simulations which you can download from s3://serratus-public/notebook/200411/. You can make a similar simulated data-set for testing ROC and make it a bit deeper too.

Brainstorming, I can think of 2 ways to do this, maybe you can think of a better experiment.

  1. If you take the wuCoV reference sequence (NC_045512v2). Create a 5% divergence 'positive control' sequence and a 100% divergence 'negative control' sequence. Map these reads against the original wuCoV sequence or the reverse non-compliment of the sequence and calculate the ROC stats based on if the + sequences match the reference sequence of the REVERSE and if the - sequences map to either (false positive). There is no measurement of True Negative with this though.

  2. Same as above but the negative simulated sequences are 5% divergent from the REVERSE non-compliment sequences.

edit: performance is much less of an issue here, as this is only for benchmarking.

@victorlin
Copy link
Collaborator Author

I feel like I'm missing an important concept here. Where do the counts from earlier play into the TP/FP/FN/TN calculation? It seems that all the measurement is done on simulated positive/negative sequences?

@ababaian
Copy link
Owner

When we're aligning a sequences from a given accession we have no idea if it's CoV+ or CoV-. The count will be used to report how much alignment there is in the library to CoV forward or REVERSE sequences. Libraries with a high forward count and a high forward:reverse ratio we will classify as 'Potentially CoV+ libraries' and re-analyze those in more detail using transcriptome assembly.

The TP/FP/FN/TN rates with simulated reads is purely to establish benchmarks so we know how well the experimental system works in theory.

@ababaian ababaian added this to Task In Progress in TODO List Apr 14, 2020
@victorlin
Copy link
Collaborator Author

victorlin commented Apr 17, 2020

I've tried some preliminary simulations based on the notebook referenced above. To me, it seems that all TP/FP/FN/TN values can be derived from experiment 1, outlined below.

Definitions:

  • cov : genome
  • rev_cov : reverse non-complement of cov
  • sim_pos : simulated reads from 5% divergence from cov
  • sim_neg : simulated reads from 100% divergence from cov

Statistics:

  • TP : number of reads in sim_pos that align to cov only
  • FN : number of reads in sim_pos that align to rev_cov or remain unmapped
  • FP : number of reads in sim_neg that align to cov only
  • TN : number of reads in sim_neg that align to rev_cov or remain unmapped

Simulation

Here are bowtie2 counts based on cov=NC_045512v2:

aligned concordantly 0 times aligned concordantly exactly 1 time aligned concordantly >1 times
sim_pos~cov 175 7300 0
sim_pos~rev_cov 7475 0 0
sim_neg~cov 7475 0 0
sim_neg~rev_cov 7475 0 0

Results would be:

  • TP = 7300
  • FN = 175
  • FP = 0
  • TN = 7475

The calculation for these results was trivial. However, if the alignment of sim_pos~rev_cov had successfully mapped some reads, we would have to know the outcome of both cov and rev_cov alignments on a per-read basis.

@ababaian
Copy link
Owner

This is very clear headed. Good job. Few notes

Statistics:
-> TP: number of reads in sim_pos that align to either cov or rev_cov
These reads that align to rev_cov would be FN, the rev_cov is a bait or control sequence to measure how much reads would align to cov1 are simply background level.

TABLE

  • FP = 175
  • FN = 0

I think these are switched, there are no FP only 175 FN.

The calculation for these results was trivial...

The point is not to just to do this for this one data-set but use this to benchmark/test a wide variety of bowtie2 or other aligner settings across a wide range of %divergence. Thats why I was saying if you can make this compile the data neatly we'll be running it something like 200x across many conditions to 'explore' the alignment algorithm space.

@victorlin
Copy link
Collaborator Author

Thanks for the clarification regarding rev_cov and good catch - I've updated my comment accordingly. Main difference to note is TP/FP should check for alignment to cov only (alignment to both should be categorized as FN/TN).

I'll have to look into a way to track the outcome of both cov and rev_cov alignments on a per-read basis since we can't get these numbers by just looking at the bowtie2 output, but in the end we should be able to have a table like this:

TP FN FP TN
GENOME1 7300 175 0 7475
GENOME2 ... ... ... ...
GENOME3 ... ... ... ...
... ... ... ... ...

I'm guessing cov0 can be used as a starter for the list of genomes?

@ababaian
Copy link
Owner

negative use cov1r. You can parse out all the reverse sequences as the RNAME all will begin with REVERSE_ :) exactly for this use case

@victorlin
Copy link
Collaborator Author

ohh perfect. Think I have what's needed to get started then!

@victorlin
Copy link
Collaborator Author

I have a preliminary python script for this. Here's some sample output:

GENOME TP FN FP TN
X81024.1 172 3 0 175
KX092228.1 99 1 0 100
KX092227.1 98 2 0 100
KX092226.1 99 1 0 100
KX092225.1 100 0 0 100
KX092224.1 8 0 0 8
KX092223.1 75 0 0 75

I'm thinking it might be better to normalize the counts as proportions relative to the number of reads, like this:

GENOME TP FN FP TN
X81024.1 172/175 3/175 0/175 175/175
KX092228.1 99/100 1/100 0/100 100/100
KX092227.1 98/100 2/100 0/100 100/100
KX092226.1 99/100 1/100 0/100 100/100
KX092225.1 100/100 0/100 0/100 100/100
KX092224.1 8/8 0/8 0/8 8/8
KX092223.1 75/75 0/75 0/75 75/75

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 20, 2020

"ROC" implies plotting as a function of a confidence score, which in this case would be the MAPQ value in the SAM/BAM file. For example, if the mapper reports MAPQ=0 (i.e., P_error=1), do we consider the hit or not? This is tricky because MAPQ doesn't mean what we want it to mean. We want "error probability that the read is derived from this genome", but MAPQ actually means "assuming the read does come from this genome, probability the location is wrong".

@ababaian
Copy link
Owner

This is a little off-topic: For actual data, we likely won't use MAPQ as a confidence function but at the end of the day we do need some function to classify a library as potentially CoV+ or CoV-. See #41 for detailing of the false-positives encountered so far. What they have in common is even if they had a very high number/percentage of mapped reads, the mapping was localized to a very small seed region with clipping on either side. We can 'mask' these regions out of the pan-genome but I feel like it might be wack-a-mole with false-positives this way and instead we should consider if a RNAME has above a threshold number of mapped reads in a library (say 100), we calculate how many bases over the reference sequence are 'covered'. if it's like 50 nt / 5000 nt then it's likely a false positive or mismapping, if it's broader like 400 nt / 5000 nt then maybe we have a something. This has obvious limitations so we'll need to think it through more carefully.

On topic: This looks great Victor. If we have a quick py-script to calculate and report these values then we can apply it to larger optimizations. i.e. get it into the hands of @charlescongxu for his bowtie2 optimization

victorlin added a commit that referenced this issue Apr 21, 2020
@victorlin
Copy link
Collaborator Author

@ababaian @charlescongxu I've added the script+docs in a new directory named benchmarking - feel free to move somewhere more fitting. Let me know if you have any trouble running in your environment.

@ababaian
Copy link
Owner

ababaian commented Apr 21, 2020

Hey @victorlin this looks like a good backbone for benchmarking. Can you break apart some of the functionality in this? We'll want it to be a bit more flexible then what you have

  1. -x Set the reference genome to align against (make this independent of input sequence)
  2. -i Specify the input "positive" sequence (TP)
  3. -n Specify the input "negative" sequence (FP). Default if unset would be to use the same as -i
  4. Print to STDOUT instead of a file.
  5. -bt2 optional parameter, pass this string as a parameter to bowtie2. i.e. -bt2 "--very-sensitive-local" to run bowtie2 --very-sensitive-local instead of default.

Then in the README add the definition of TP/FP/TN/FN you used made above so we can close this issue and not lose important information : )

Edit:

  1. Also a bypass for positive and negative sequence input as fastq files.

@victorlin
Copy link
Collaborator Author

victorlin commented Apr 22, 2020

Sure, more flexibility is always nice :)

Currently the script works on the entire pan-genome (cov1r.fa). The proposed changes seem to act on each genome entry (right now each genome has different sim_pos,sim_neg). Two approaches here:

  1. Repurpose the script to generate one result (TP/FP/TN/FN) at a time, rather than all. Call this using another script that will define parameters and aggregate results into an output table.
  2. Inputs for sim_pos and sim_neg would be in a FASTA file which has 1 entry per genome in cov1r.fa.

Option 2 would be easier to make the changes, but either is doable. Ultimately it's up to the use case and how these additional parameters are determined.

Other questions:

  1. Would it be desirable to pass parameters for msbar and art_illumina as well? This would make the run command quite lengthy but allows for more control over variables that are as of now hard-coded.
  2. What's the difference between 2/3 and 6? Are 2/3 about the divergent genomes (FASTA) and 6 about the simulated reads from those (FASTQ)?

@ababaian
Copy link
Owner

  1. Yes I think having the output be a single set of TP/FP/TN/FN is going to be the move and wrapping all of this in another script to vary the parameters. One of the key experiments we need is to also compare alignment rates between individual CoV genomes (i.e. how does MERS1 align against SARS-CoV-2).

  2. If the inputs are multiple fasta files then the simulations should be based on all entries of that input. Consider the experiment align the cov1r pan-genome to the human genome hg38 or the reverse.

  3. Yes parameters for every command would be great, again use the defaults we used before but allow for over-riding this.

  4. Correct, 2/3 is for fasta input sequences. 6 would be providing pre-simulated reads so we don't have to recalculate every time and if we want to use real fastq data files (i.e. real human transcriptome vs cov1r or real SARS-CoV-2 against human genome).

@victorlin
Copy link
Collaborator Author

victorlin commented Apr 23, 2020

Thought I'd hash out the high-level CLI first and get your opinion on it.
Note: I've changed the method to simulate negative reads from 100% divergence of cov to 5% divergence of rev_cov. I think it's more consistent with the way we are using rev_cov (which will be default for --neg_seq, --neg_align_seq) as a background filter, plus makes the parameters easier to understand.

$ python cov_benchmark.py -h
usage: cov_benchmark.py [-h] [--pos_seq SEQ] [--neg_seq SEQ] [--prop_pos PROP] [--prop_neg PROP] [--pos_reads FQ_PREFIX] [--neg_reads FQ_PREFIX] [--pos_align_seq SEQ]
                        [--neg_align_seq SEQ] [--msbar_params STR] [--art_illumina_params STR] [--bowtie2_params STR]
                        input

Runs divergence tests and generates summary statistics.
Outputs values in format: "TP,FN,FP,TN"

positional arguments:
  input                 Raw input sequence. Used for pos/neg read sets and read alignment, unless other parameters specified.

optional arguments:
  --pos_seq SEQ         Sequence used for simulation of positive read set.
                        Default: input sequence.
  --neg_seq SEQ         Sequence used for simulation of negative read set.
                        Default: reverse non-complement of input sequence.
  --prop_pos PROP       Proportion of pos_seq to mutate for simulation of positive read set.
                        Default: 0.05 (5% divergence).
  --prop_neg PROP       Proportion of neg_seq to mutate for simulation of negative read set.
                        Default: 0.05 (5% divergence).
  --pos_reads FQ_PREFIX
                        Positive read set. Paired-end files are {{FQ_PREFIX}}1.fq and {{FQ_PREFIX}}2.fq.
                        If specified, ignores pos_seq and prop_pos.
                        Default: reads derived from pos_seq with prop_pos applied.
  --neg_reads FQ_PREFIX
                        Negative read set. Paired-end files are {{FQ_PREFIX}}1.fq and {{FQ_PREFIX}}2.fq.
                        If specified, ignores neg_seq and prop_neg.
                        Default: reads derived from neg_seq with prop_neg applied.
  --pos_align_seq SEQ   Reference sequence for read alignment.
                        Reads mapped to this SEQ only will be classified as TP/FP.
                        Default: input sequence.
  --neg_align_seq SEQ   Reference sequence for read alignment.
                        Reads mapped to this SEQ will be classified as TN/FN, along with unmapped reads.
                        Default: reverse non-complement of input sequence.
  --msbar_params STR    Additional parameters to pass to the msbar command.
  --art_illumina_params STR
                        Additional parameters to pass to the art_illumina command.
  --bowtie2_params STR  Additional parameters to pass to the bowtie2 command.

Example usages:

1. All default behavior

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT

What this does: simulate positive/negative read sets using 5% divergence from input and reverse(input). Align to input and reverse(input). Calculate and print TP,FN,FP,TN.

  • input : AAAAAAAAAATTTTTTTTTT
  • pos_seq : input
  • neg_seq : reverse(input)
  • prop_pos : 0.05
  • prop_neg : 0.05
  • pos_reads : simulated using 5% divergence from input
  • neg_reads : simulated using 5% divergence from reverse(input)
  • pos_align_seq : input
  • neg_align_seq : reverse(input)
  • msbar_params : -point 4 -block 0 -codon 0
  • art_illumina_params : --seqSys HS20 --paired --len 100 --mflen 300 --sdev 1 --fcov 50 --rndSeed 666 --noALN
  • bowtie2_params : --very-sensitive-local

2. Configure positive/negative alignment sequences

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
          --pos_align_seq CAAAAAAAAATTTTTTTTTG
          --neg_align_seq GTTTTTTTTTAAAAAAAAAC

3. Configure sequences for simulating positive/negative read sets

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
                --pos_seq AAAAAAAAAATTTTTTTTTG
                --neg_seq GTTTTTTTTTAAAAAAAAAA

4. Configure pre-simulated reads

$ python cov_benchmark.py AAAAAAAAAATTTTTTTTTT
              --pos_reads path/to/pos_prefix
              --neg_reads path/to/neg_prefix

This can result in lengthy commands. All the SEQ inputs can be single-entry FASTA files instead if that's easier to work with.

@ababaian
Copy link
Owner

This looks like it's what we need, the only thing which needs to be included is to allow for multiple fasta files in addition to regular fasta files. We can then put a wrapper around this bad boy and generate all the statistics we need in all the different 'experiments' we can think of.

@victorlin
Copy link
Collaborator Author

@ababaian so all the parameters taking a SEQ as input should take FASTA files instead (all with n number of records), and the command will run n experiments with other parameter values being constant?

@ababaian
Copy link
Owner

No, a multiple fasta file will be treated as a single entity for generating reads from or aligning against. Each run of the command should yield a single output.

@victorlin victorlin changed the title ROC statistics on BAM file/stream for control sequence Benchmarking statistics for divergence simulation tests Apr 24, 2020
@mathemage mathemage linked a pull request Apr 26, 2020 that will close this issue
9 tasks
@ababaian ababaian closed this as completed Dec 9, 2020
TODO List automation moved this from Task In Progress to Completed Tasks Dec 9, 2020
@ababaian ababaian removed this from Completed Tasks in TODO List Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BASH Scripting / bash related task Bioinformatics Bioinformatics task Python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants