-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
barcode and umi opposite sides and whitelist optimizations #44
Comments
Hi! Thanks for your kind words. At the moment, your first flexiplex command is in 'discovery' mode and so won't work when piped directly into another run, as it doesn't produce a .fastq output - instead, it produces a table of barcode counts. However, if you had a whitelist passed in with When it comes to your second run of Flexiplex, the structure that the tool is looking for reads from left to right, with the adapter/BC/UMI sequence on the left and the cDNA on the right. This isn't an issue in practice as Flexiplex scans both the forward and reverse complement, but it does mean that if you expect something like:
in your forward strand, to demultiplex with Flexiplex you would want to search for the reverse complement of that:
Furthermore, as per the docs, if you only want to extract the UMI and know that there should be no barcode, you would also pass in
You may find it helpful to use the As for your second question, you should be able to configure the barcode error tolerance to be zero with the Cheers |
I should add as an aside: just earlier today, I had a colleague ask me the same question about a BC + cDNA + UMI situation! It might be a good idea to add this to the documentation... |
Hi Luuk, Thanks for sending so much detail about your use case! As a simple work around for your first issue of not having the barcodes reported. I think you could treat the barcodes as UMIs. So run something like: However, this won't do any error correction of the barcodes using the whitelist, which is essential your second issue/question. If you would like to error correct to the Ilumina list, my only thought would be to spit your reads up into multiple subfiles and run them in parallel, providing the whitelist to the -k parameter. Flexiplex has an option for multithreads as well, which you could try, but this tends to be slower than file splitting for large threads in our experience. Either way, my guess it that it will be very slow with millions of barcodes which are 32bp long. This is a really nice scenario for us to think about how we can improve the performance of the tool. So thanks for bringing it to our attention! All the best, |
Hi Nadia, Unfortunately we just missed eachother during your visit to Leuven! Thanks for the response and workaround suggestion! This does indeed seem to work perfectly and it shouldn't be a problem to extract the barcode in this way. Regarding the error correction, I have an implementation to do 1 mismatch correction (written in c++), but in this particular dataset it does not rescue many read. Correcting using the levenshtein distance would of course be better but computationally not feasible for this dataset. That's why, for the moment, I'm not too bothered to perform the mismatch correction. I'm definitely going to use this approach over my current implementation, a hacked variant of blaze, since it seems to perform a lot better compared to my hacked variant of blaze (both in spee, number of reads it retains, and simplicity of running it). Thanks again and I'm more than happy to help testing things if you're looking to improve the performance on datasets like mine. Best, |
Hi,
First off, great tool! This is exactly what I was looking for.
I have two questions, the first is regarding how to run flexiplex when your barcode and UMI are on different sides.
My read structure is as follows:
Do I understand it correctly that the best way to do this is as follows:
gunzip -c input.fastq.gz | flexiplex -x CTACACGACGCTCTTCCGATCT -b "???????????????????????????????" -u "" -f 4 | flexiplex -x TTTTTTTT -b "" -u "????????" -x CAGACGTGTGCTCTTCCGATCT -f 4
My second question arises from the fact that my whitelist is millions of barcodes (determined with short read illumina sequencing). The combination of the long barcodes and the millions of whitelisted barcodes will make the barcode matching impossible. I tried to run it on a smaller subset of my data initially in discovery mode and obtain a smaller whitelist using the flexiplex_filter script, but this will still result in >200K barcodes (likely still over a million with the full dataset). Do you have any suggestions or possible workarounds. Potentially only searching for perfect matches with the whitelist would be a solution, but I don't think this is currently implemented(?).
Best,
Luuk
The text was updated successfully, but these errors were encountered: