Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't we correct cell-barcode errors before identifying core barcodes? #79

Open
dylkot opened this issue Apr 3, 2019 · 4 comments

Comments

@dylkot
Copy link

dylkot commented Apr 3, 2019

I am noticing that I am losing over 50% of my reads because they are tagged with cell-barcodes that aren't "core" or aren't 1 edit distance away from a "core" cell-barcode. Some of the non-core cell-barcodes that are getting tossed out have more reads than the core ones. I am wondering why these aren't instead becoming core barcodes. I currently have my expected number of cell barcodes set to 4000 which is perhaps too low and is part of the problem. But I'm also wondering if some of the issue might be related to the order of the pipeline as it currently stands. Currently the order is:

Aligned.out.bam --> Aligned.merged.bam --> Aligned.repaired.bam --> gene_exon_tagged.bam --> gene_exon_tagged_bead_sub.bam --> final.bam

The set of core barcodes is set by running umi_tools whitelist on trimmmed_repaired_R1.fastq.gz. But wouldn't it be better to run it on the corrected barcodes (I.e. following DetectBeadSubstitutionErrors and DetectBeadSynthesisErrors)?

The repair_barcodes step doesn't seem to be in the most recent version of the Drop-seq_Alignment_Cookbook.pdf so I am just wondering what your logic is for doing it the way you are currently doing it. Thanks!

@Hoohm
Copy link
Owner

Hoohm commented Apr 3, 2019

Hello @dylkot

When umi_tools was added for the whitelist, I got better results (more reads) than the standard drop_seq tools and by combining both we should get the best of the two worlds. That was back with the old drop_seq tools.

At this time, umi_tools gives you a whitelist based on the total reads which might be skewed towards cells that have a high number of amplification duplicates. The correction from drop_seq tools would not be affected by this because we correct those barcodes before it happens and hence you should still get those corrected properly. Although, you might recover fewer cells because of the preselection done by umi_tools since the output is based on this list.

I'm trying to figure out if changing the order would make a difference. I guess the only way to know is to test it out. I don't have time right now to make that happen so it might take a week or two before getting you a definite answer.

Could you post your knee plot here so that we could see how bad it looks and how the curve fits those cells barcodes that are not collected in the end?

I'm not sure I understand what you mean by the repair_barcodes step doesn't seem to be in the most recent cookbook.

@dylkot
Copy link
Author

dylkot commented Apr 3, 2019

Thanks for the response!

I see, that is all useful context. I can certainly try umi_tools with the --method umis flag to see if that helps. But actually I'm just looking at barcode_mapping_counts.pkl to see how many reads are getting tossed out because they are assigned to barcodes in the "unknown" class. And I was surprised to see that some of those discarded barcodes have more reads than barcodes we are keeping, even after combining reads that get corrected to the same barcode from the "0" and "1" class.

Here is the knee plot:

knee_plot.pdf

And re your last question, I was just stating the obvious that repair_barcodes is your addition to the pipeline and is not in the current iteration of drop_seq tools.

@Hoohm
Copy link
Owner

Hoohm commented Apr 3, 2019

From your plot I would consider the selection to be ok. As we don't see a clear knee/bend, I would rather stay conservative.

I would look at the violin plots as well and you would probably see that the cells past the top 4500-5000 cells are probably going to be discarded anyway later on.

My assumption is that those cells will have low complexity (aka high top50) and they won't contribute much to your downstream analysis.

And yes, there are a lot of added functionality on top of drop_seq tools in the pipeline, that's the beauty of making pipelines, you can easily add cool features without any fuss to existing tools :)

@dylkot
Copy link
Author

dylkot commented Apr 3, 2019

OK, thanks for the feedback! I won't worry too much about those cells then for the time being. I suppose I was just surprised by the cumulative fraction only being 50% but I agree that there isn't really a knee to speak of. Thanks again :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants