Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

barcode and umi opposite sides and whitelist optimizations #44

Open
ljwharbers opened this issue Jun 24, 2024 · 5 comments
Open

barcode and umi opposite sides and whitelist optimizations #44

ljwharbers opened this issue Jun 24, 2024 · 5 comments

Comments

@ljwharbers
Copy link

Hi,

First off, great tool! This is exactly what I was looking for.

I have two questions, the first is regarding how to run flexiplex when your barcode and UMI are on different sides.
My read structure is as follows:

Adaptor_seq1 -- BC (32bp) -- spacer_seq -- polyA -- cDNA -- UMI -- Adaptor_seq2

Do I understand it correctly that the best way to do this is as follows:
gunzip -c input.fastq.gz | flexiplex -x CTACACGACGCTCTTCCGATCT -b "???????????????????????????????" -u "" -f 4 | flexiplex -x TTTTTTTT -b "" -u "????????" -x CAGACGTGTGCTCTTCCGATCT -f 4

My second question arises from the fact that my whitelist is millions of barcodes (determined with short read illumina sequencing). The combination of the long barcodes and the millions of whitelisted barcodes will make the barcode matching impossible. I tried to run it on a smaller subset of my data initially in discovery mode and obtain a smaller whitelist using the flexiplex_filter script, but this will still result in >200K barcodes (likely still over a million with the full dataset). Do you have any suggestions or possible workarounds. Potentially only searching for perfect matches with the whitelist would be a solution, but I don't think this is currently implemented(?).

Best,
Luuk

@olliecheng
Copy link
Member

olliecheng commented Jun 24, 2024

Hi! Thanks for your kind words.

At the moment, your first flexiplex command is in 'discovery' mode and so won't work when piped directly into another run, as it doesn't produce a .fastq output - instead, it produces a table of barcode counts. However, if you had a whitelist passed in with -k, you would be able to pipe it in just fine.

When it comes to your second run of Flexiplex, the structure that the tool is looking for reads from left to right, with the adapter/BC/UMI sequence on the left and the cDNA on the right. This isn't an issue in practice as Flexiplex scans both the forward and reverse complement, but it does mean that if you expect something like:

cDNA + "TTTTTTTT" + umi + "CAGACGTGTGCTCTTCCGATCT"

in your forward strand, to demultiplex with Flexiplex you would want to search for the reverse complement of that:

AGATCGGAAGAGCACACGTCTG + umi_rev_cmp + "AAAAAAAA" + cDNA_rev_cmp

Furthermore, as per the docs, if you only want to extract the UMI and know that there should be no barcode, you would also pass in -b "" -k "?". So, I think the correct usage of Flexiplex in this situation would be:

flexiplex -x "AGATCGGAAGAGCACACGTCTG" -u "????????" -x "TTTTTTTT" -b "" -k "?" -f 4

You may find it helpful to use the -n flag to set different prefixes for the output files produced by both stages, otherwise they'll start overwriting each other.

As for your second question, you should be able to configure the barcode error tolerance to be zero with the -e 0 flag, but if you have a high number of barcodes, this would still be pretty slow. Alternatively, if you'd prefer not to set a barcode list, I think -k "?" should work in a more performant way... kind of. It will allow anything which matches the format of -b, but I don't think the read header contains the actual barcode as a result. This information may be able to be extracted from the flexiplex_reads_barcodes.txt, though... Is the ideal solution for you to have an option to extract barcodes without a whitelist (as if the whitelist contained every possible kmer I guess), so you can clean up the data downstream?

Cheers
OIllie

@olliecheng
Copy link
Member

I should add as an aside: just earlier today, I had a colleague ask me the same question about a BC + cDNA + UMI situation! It might be a good idea to add this to the documentation...

@ljwharbers
Copy link
Author

Hi Ollie,

Thanks for the quick reply!

With your pointers I do think I now managed to get both barcode and the UMI.

For reference, a (really bad) illustration of my read setup and my final command:
image

gunzip -c input.fastq.gz | flexiplex -x CTACACGACGCTCTTCCGATCT -b "???????????????????????????????" -u "" -f 4 -k "?" -n barcode | flexiplex -x "AGATCGGAAGAGCACACGTCTG" -u "????????" -x "?" -b "" -k "?" -f 4 -n umi > test.fastq

Since the UMI should be right before the adaptor, I think giving just the revcomp directly followed by UMI should work. This gives me the following output, which seemed really good to me:

1 million reads processed..
Number of reads processed: 1000000
Number of reads where a barcode was found: 915079
Number of reads where more than one barcode was found: 21953
All done!
Number of reads processed: 937636
Number of reads where a barcode was found: 758725
Number of reads where more than one barcode was found: 6297
All done!

However, it seems that setting -k "?" results in both the barcode not being appended to the read and not being included in the *_reads_barcodes.txt (unless I messed something up, which could very well be the case).
head barcode_reads_barcodes.txt

Read CellBarcode FlankEditDist BarcodeEditDist UMI
2a02b969-f655-48a5-b38e-d509201710db ? 0 1
a73dcab7-4506-4136-9a6c-86e60d0b3407 ? 0 1
b38787e9-3446-414a-9b48-319bf8331bb5 ? 0 1
ae642a7f-3e69-43e8-b20f-5d8d004d17e6 ? 2 1
2ca79e43-aaae-458c-9c6c-4deb9a456cee ? 0 1
2ca79e43-aaae-458c-9c6c-4deb9a456cee ? 0 1
f029c3b4-6309-4eef-b8e6-f6a220721b6d ? 3 1
a13c222c-968d-4829-a7dc-c163f9a8ffa0 ? 2 1
fb097d90-703b-4fc4-af15-23af58b4bb19 ? 0 1

To clarify my whitelist a bit further, it is not all possible kmer but it is the result from Illumina presequencing so the majority should actually also be present in our long read dataset. But there will be very few reads associated with each barcode. I've tried running it with the whitelist and setting -e 0 but this would still take days/week to run on my actual dataset.

So I guess the ideal result for me would be to run it with -k "?" but that the barcodes get appended to the read name. If the barcode would only be reported in *reads_barcodes.txt it is also fine but I will have to append them to the read name with a quick script myself afterwards.

Thanks a lot again and sorry for the very specific usecase :')

Best,
Luuk

@nadiadavidson
Copy link
Contributor

Hi Luuk,

Thanks for sending so much detail about your use case!

As a simple work around for your first issue of not having the barcodes reported. I think you could treat the barcodes as UMIs. So run something like:
gunzip -c input.fastq.gz | flexiplex -x CTACACGACGCTCTTCCG -b "ATCT" -u "???????????????????????????????" -f 3 -e 1 -k "ATCT" -n barcode | flexiplex -x "AGATCGGAAGAGCACACGTCTG" -u "????????" -x "?" -b "" -k "?" -f 4 -n umi > test.fastq
Then you'll get the the barcode in the read, prefaces with the ATCT_ which hopefully is easy to ignore in your down stream analysis.

However, this won't do any error correction of the barcodes using the whitelist, which is essential your second issue/question. If you would like to error correct to the Ilumina list, my only thought would be to spit your reads up into multiple subfiles and run them in parallel, providing the whitelist to the -k parameter. Flexiplex has an option for multithreads as well, which you could try, but this tends to be slower than file splitting for large threads in our experience. Either way, my guess it that it will be very slow with millions of barcodes which are 32bp long.

This is a really nice scenario for us to think about how we can improve the performance of the tool. So thanks for bringing it to our attention!

All the best,
Nadia.

@ljwharbers
Copy link
Author

Hi Nadia,

Unfortunately we just missed eachother during your visit to Leuven!

Thanks for the response and workaround suggestion! This does indeed seem to work perfectly and it shouldn't be a problem to extract the barcode in this way.

Regarding the error correction, I have an implementation to do 1 mismatch correction (written in c++), but in this particular dataset it does not rescue many read. Correcting using the levenshtein distance would of course be better but computationally not feasible for this dataset. That's why, for the moment, I'm not too bothered to perform the mismatch correction.

I'm definitely going to use this approach over my current implementation, a hacked variant of blaze, since it seems to perform a lot better compared to my hacked variant of blaze (both in spee, number of reads it retains, and simplicity of running it).

Thanks again and I'm more than happy to help testing things if you're looking to improve the performance on datasets like mine.

Best,
Luuk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants