Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering reads based on low-quality base ratio #699

Open
gilstel opened this issue Apr 27, 2023 · 5 comments
Open

Filtering reads based on low-quality base ratio #699

gilstel opened this issue Apr 27, 2023 · 5 comments

Comments

@gilstel
Copy link

gilstel commented Apr 27, 2023

Is there a way to filter reads based on low-quality base ratio of the read?
Many thanks,
Gil

@rhpvorderman
Copy link
Collaborator

Would --max-expected-errors be sufficient for your use case? https://cutadapt.readthedocs.io/en/stable/guide.html#filtering
@marcelm I sped up the underlying calculation a bit: #700

I personally feel a --max-error-rate flag would be very helpful, as it would also work for variable length reads. Currently I have this implemented in fastq-filter but I'd rather use cutadapt than add another tool to the pipeline.

@marcelm
Copy link
Owner

marcelm commented Apr 28, 2023

I personally feel a --max-error-rate flag would be very helpful, as it would also work for variable length reads.

Is that the -e option in fastq-filter? That would be the number of expected errors (as computed by --max-error-rate) divided by the read length IIUC. I think I’d be fine with adding that.

I read https://www.drive5.com/usearch/manual/avgq.html back when I added --max-expected-errors; that appears to be the same motivation for -e in fastq-filter.

@rhpvorderman
Copy link
Collaborator

Is that the -e option in fastq-filter? That would be the number of expected errors (as computed by --max-error-rate) divided by the read length IIUC.

That is exactly it. It is a bit more generic as it does not care about the length and therefore works well on variable-length sequences.

The major mistake I made in fastq-filter that there is also a median filter. That is also quite terrible because it is not an informative metric. There is also a Q-score filter that correctly does the average (it converts the score to error rate internally), but I wish I hadn't added that because Q-scores are massively confusion. Q10 vs Q20 vs Q30 are massive differences and the Q-scores simply don't convey that. Error-rate is much better as it does what it says on the tin.

It was a good exercise in pedantic performance optimization though, the lookup table is really fast.

I read https://www.drive5.com/usearch/manual/avgq.html back when I added --max-expected-errors; that appears to be the same motivation for -e in fastq-filter.

Yes exactly. I got it from https://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/. Once seen this can't be unseen. I hate that almost every tool does this the wrong way, FastQC, FastP and upon informing them, they haven't even fixed it.

@gilstel
Copy link
Author

gilstel commented Apr 28, 2023

Hi

I tried to read the information in the links you provided (e.g. information on the --max-expected-errors option) to but I am still confusedon how to actually use it (which threshold to use)
If, for example I would like to filter out any reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%.
What is the best way to perform this filtration step?

Many thanks,
Gil

@marcelm
Copy link
Owner

marcelm commented May 3, 2023

If, for example I would like to filter out any reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%.

I am not familiar with this ratio. Is this something that is used in other tools? In any case, it is not directly supported in Cutadapt.

You can choose the --max-expected-errors threshold such that it is at least as strict as your filtering criterion, but it will filter more reads that would not be discarded by your filter.

Let’s assume you have a read length of $n=100$ bp. If I understand correctly, using your filter, the read will be discarded if it has more than 50 bases of quality 5. Let’s say it has exactly 50 bases of quality 5 and 50 bases of quality infinity (so we can ignore their error probabilities). A base of quality 5 has an error probability of $10^{-\frac{5}{10}}\approx0.316$. Since the number of expected errors for a read is the sum of the error probabilities, we get $50\times0.316=15.8$ expected errors for this read.

So for this read length, using --max-expected-errors with a threshold of 15.8 *or lower would guarantee that reads with your low-quality base threshold of >50% will be filtered out, but it filters out more: For example, a read with 50 quality scores of 5 and one quality score of 6 (and 49 infinite qualities) will also be filtered out by --max-expected-errors 15.8, but not with your criterion.

That said, a threshold of 15.8 appears to be very large anyway. https://www.drive5.com/usearch/manual/faq_emax.html recommends using a threshold of 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants