New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting the search to the first n characters #709
Comments
Barcodes in data I have worked with were usually directly at the 5' end, and Cutadapt offers anchored 5' adapters for these. If you need to be a bit more flexible, you could use a non-internal adapter with a couple of For example, if you have barcode
If you allow a large number of |
Thanks Marcel for your quick response. I´ll give it a try, thanks. I agree
that with Illumina datasets it is commonly the first bases where the
adaptor starts. I have often used the anchoring option and it has worked
well. However, recently I have been presented with two fairly common cases
in which there is some uncertainty about where the barcode is:
- In Illumina datasets, but in cases where a variable number of Ns have
been added before the barcodes to increase within-cycle variability and
hence increase the quality. I have found that there are cases in which the
number of Ns is not exactly the same as predicted, so I thought looking for
the actual adapter, but within the first 20 bp would be quicker and more
precise.
- In Oxford Nanopore datasets, in which the starting point of the sequence
varies quite a bit, and in which the higher error rate makes it prone to
find the adapter in the wrong place.
Hopefully the -g XN{4} - I didn't know about and never used the X option-
will solve both my issues. I would let you know how it goes, if you are
interested.
Best Regards
…On Mon, 29 May 2023 at 13:10, Marcel Martin ***@***.***> wrote:
Barcodes in data I have worked with were usually directly at the 5' end,
and Cutadapt offers anchored 5' adapters
<https://cutadapt.readthedocs.io/en/stable/guide.html#anchored-5adapters>
for these. If you need to be a bit more flexible, you could use a non-internal
adapter
<https://cutadapt.readthedocs.io/en/stable/guide.html#non-internal> with
a couple of N characters at the beginning and set the minimum overlap
such that only full occurrences are allowed.
For example, if you have barcode ACGTACGT (length 8) and you want to
allow up to 5 bases preceding it:
cutadapt -g 'XN{5}ACGTACGT;min_overlap=8' ...
If you allow a large number of N bases like this, this is a bit slower
than it could be, so please let me know if that is the case and if it is a
bottleneck and I could have a look into optimizing this.
—
Reply to this email directly, view it on GitHub
<#709 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADXKS37PRPOX7HI3U7HG5MTXIR7Z5ANCNFSM6AAAAAAYSPQSYU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi there,
I was wondering if there is a way of limiting the search for adapters to the first n characters (or the last n characters) of each sequence. I find that particularly useful when demultiplexing: if there are a considerable number of barcodes to match, it is often the problem that one of the barcodes matches somewhere in the middle of the read. As many sequencing experiments return data with known structure, one can expect the demultiplexing information to be located in the first n characters, so it will be more precise and quicker to find that info if it was possible to limit the seach to those first characters
The text was updated successfully, but these errors were encountered: