-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Steps for single cell (drop-seq) protocol #97
Comments
Hi @Hoohm. Arguably, you could use the a single barcode (cell + UMI) for deduplication as this would correct for errors in both the cell barcode and UMI simultaneously. However, unless I'm mistaken, the two barcodes are subtly different. Whilst both barcodes are random, the cell barcode should be one of n discrete sequence, where n is the number of cells sequenced. Due to sequencing errors (and possibly multiple cell barcodes in a single droplet for drop-seq?), the number of cell barcodes observed is > n. We would therefore like to correct the cell barcode against a whitelist of the real n barcodes. This is different to the UMIs where we don't have a whitelist of allowed sequences and hence we use networks to distinguish real UMIs from erroneous ones. Is this the case with your drop-seq data too? |
I wasn't aware that drop-seq had white listed cell barcodes? I would be templated to extract the full 20bp as a single barcode/UMI, perform the deduplication on this, and then split the barcode into cell barcode and umi. The only thing you might want to watch out for is that at 20bp, you might want up the edit_distance threshold. What do you think @TomSmithCGAT? In theory, you could do the two separately, but what you'd need to do is this:
Currently we don't provide mechanisms to do this. (not saying its impossible, but its not in the current version, and would have to go on a to do list). |
The way we consider valid cell barcodes is by checking the cumulative
fraction of reads per cell barcodes on a so-called knee-plot. This allows
to determine this white list of barcodes that you select as "whole cells".
The "bend" is considered as the moment where the fraction of reads drops
significantly and hence is not a whole cell anymore but rather floating
RNA. (knee_plot attached)
We already have a method for doing this using a protocol given by the
Macosko group, but I wanted to compare it with your tool.
I've also tried some manual selection, with simple alignement, allowing one
mismatch/deletion/insertion.
I could also use the whitelist from the other method and combine it with
umi_tools for the UMI dedup.
2017-03-21 15:46 GMT+01:00 Tom Smith <notifications@github.com>:
… Hi @Hoohm <https://github.com/Hoohm>. Arguably, you could use the a
single barcode (cell + UMI) for deduplication as this would correct for
errors in both the cell barcode and UMI simultaneously. However, unless I'm
mistaken, the two barcodes are subtly different. Whilst both barcodes are
random, the cell barcode should be one of *n* discrete sequence, where *n*
is the number of cells sequenced. Due to sequencing errors (and possibly
multiple cell barcodes in a single droplet for drop-seq?), the number of
cell barcodes observed is > *n*. We would therefore like to correct the
cell barcode against a whitelist of the real *n* barcodes. This is
different to the UMIs where we don't have a whitelist of allowed sequences
and hence we use networks to distinguish real UMIs from erroneous ones.
Is this the case with your drop-seq data too?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#97 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNXaD4YuHPoa5RmHRvnzRo5cS1EeYN5ks5rn-KsgaJpZM4MjpmK>
.
--
*Roelli Patrick Division of Animal Physiology and Immunology TUM School of
Life Sciences Weihenstephan Technische Universität München Weihenstephaner
Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment
Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland*
*https://github.com/Hoohm <https://github.com/Hoohm>*
|
Hi @Hoohm. The knee-plot approach sounds identical to the method used in the Klein et al inDrop paper which came out around the same time. When I re-analysed the inDrop data, I first made a cell barcode whitelist, then I parsed the raw data and split the reads into single end fastqs for each whitelisted cell barcode, discarding reads which did not match the whitelist. At the same time I corrected cell barcodes which were one mismatch away from a single whitelisted cell barcode, and extracted the UMIs and appended this to the read name. I think this approach would be the best way to go here too. This is a relatively bespoke extraction procedure since it involves simulatenous cell barcode extraction and correction, raw data splitting and UMI extraction, hence it is not supported by UMI-tools extract. I'm happy to share my code for the above if you would like? |
@IanSudbery Should we perhaps consider supporting a drop-seq/inDrop UMI extraction using the method above. We could leave it up to the user to generate a cell barcode white list as they wish. It wouldn't take much effort to add this to UMI-tools but my concern is how stable the drop-seq/inDrop methods are? However, presumably any droplet based method will have a similar issue with the cell barcodes? |
@TomSmithCGAT In the long run, I think we should provide the ability to pass I'm also interested in whether our methods could act as a replacement for the knee method. Either by deduping on the whole barcode (cell + UMI) or the two step porcess I outlined above. |
@IanSudbery Good idea. How would you see this working generically with any aligner? We could fall back on the method of appending the cell barcode to the read id, i.e: My guess is that the knee method should be near optimal but it would definitely be interesting to compare the results. One of the advantages of the knee method is that it guarantees the cell barcodes are the n most abundant even if this includes cell barcodes which are a single edit distance apart. This might not necessarily be the most accurate answer but it's perhaps more appealing than merging together two highly abundant cell barcodes which are 1 edit distance apart whilst retaining very lowly abundant cell barcode, as could occur with a network-based method. Merging together two highly abundant cell barcodes incorrectly could be lead to apparently interesting but ultimately erroneous findings from one's scRNA-Seq - I'm thinking 'dual-state cell' type findings arising from cell doublets. An issue we would currently have with the network-based method is that the 'true' cell barcodes should be the same across the whole data set and we're currently treating each positions independently. We could solve this with a two pass method, e.g: Pass 1: Tally counts for all cell barcodes across all reads, form networks and resolve to identify the n 'true' barcodes across sample, and crucially, the barcodes which are in the same cell group. For pass 1 I'd suggest the directional method, although the threshold may need to be stricter and/or the edit distance increased? Whichever method we use, we may need to also introduce a minimal counts threshold for a 'true' barcode to be accepted to avoid issues with very low abundance cell barcodes in a single node network. |
I think you are right, merging two barcodes that are close but abundant
should be avoided.
Finding the top n cell barcodes for the whitelist should be used to
collapse the rest of the reads but avoided inside the whitelist.
Although, I'm pretty sure the chance of two high abundance read for two cell barcodes that are one bp away is pretty low. If it happens it might be the same, but this should be checked in post-processing.
@TomSmithCGAT I would be really interested in the code you used.
2017-03-21 18:28 GMT+01:00 Tom Smith <notifications@github.com>:
… @IanSudbery <https://github.com/IanSudbery> Good idea. How would you see
this working generically with any aligner? We could fall back on the method
of appending the cell barcode to the read id, i.e:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019_AGACTAGGAG;AGCGAGCG
where the barcodes are ;-separeted. If no cell barcodes are extracted, the
cell barcode portion of the bundle key would be empty.
My guess is that the knee method should be near optimal but it would
definitely be interesting to compare the results. One of the advantages of
the knee method is that it guarantees the cell barcodes are the *n* most
abundant even if this includes cell barcodes which are a single edit
distance apart. This might not necessarily be the most accurate answer but
it's perhaps more appealing than merging together two highly abundant cell
barcodes which are 1 edit distance apart whilst retaining very lowly
abundant cell barcode, as could occur with a network-based method. Merging
together two highly abundant cell barcodes incorrectly could be lead to
apparently interesting but ultimately erroneous findings from one's
scRNA-Seq - I'm thinking 'dual-state cell' type findings arising from cell
doublets.
An issue we would currently have with the network-based method is that the
'true' cell barcodes should be the same across the whole data set and we're
currently treating each positions independently. We could solve this with a
two pass method, e.g:
Pass 1: Tally counts for all cell barcodes across all reads, form networks
and resolve to identify the *n* 'true' barcodes across sample, and
crucially, the barcodes which are in the same *cell group*.
Pass 2: Deduplicate per position, per *cell group*
For pass 1 I'd suggest the directional method, although the threshold may
need to be stricter and/or the edit distance increased? Whichever method we
use, we may need to also introduce a minimal counts threshold for a 'true'
barcode to be accepted to avoid issues with very low abundance cell
barcodes in a single node network.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#97 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNXaKVuIaSUOhwRQXEo8W0l815AhYLpks5roAjBgaJpZM4MjpmK>
.
--
*Roelli Patrick Division of Animal Physiology and Immunology TUM School of
Life Sciences Weihenstephan Technische Universität München Weihenstephaner
Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment
Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland*
*https://github.com/Hoohm <https://github.com/Hoohm>*
|
I feel like the first step would be to allow The next step would be to implement methods for getting that tag onto the BAM record. I think the would have to be some sort of extra field on the fastq record. Could be: Thus the reads would still be compatible the current format where split("_")[-1] is taken to be the UMI. Finally some method for collapsing cell barcodes. Is the "knee" for the "knee" automatically calculated, or done by hand? I suspect that this would be a separate tool. The "knee" method sounds remarkably similar to the "percentile" method for umi deduplicating (except applied across the whole sample rather than just at one position). |
@IanSudbery There's an example of the 'knee' plot on page 12 of the drop-seq guide here. It appears that the 'knee' is not automatically calculated. I think the easiest way to do it automatically would be to make a density plot for counts per cell barcode and then find where the second derivative = 0. This would represent the point between the distribution of counts for the real cell barcodes and the errors. |
@Hoohm. I'll dig out the relevant code tomorrow afternoon |
@TomSmithCGAT That would be the easiest way to do it. Although, some people are struggling to have a clear signal and don't get this "bend" on the plot, making it complicated sometimes to define where the limit is. |
@Hoohm I've attached a python function which will take the paired end fastqs from drop-seq and perform the UMI extraction. Github wont let me attach a .py file so I saved it as .txt: This code is modified from the
|
Hello, I just wanted to chime in that I'm also interested in applying UMI Tools to dropseq data. Have you made any progress on this front? I'm using Dropseq Tools from the McCarroll lab (under "Computational resources" here) to process my data. The output of this will be a BAM with the cell barcode in the XC tag, UMI in the XM tag, and gene in the GE tag. If I wanted to fix UMI errors within cell barcodes, I would first need to split my BAM into one BAM per cell barcode and then run a command like this for each BAM: Is this the correct way of doing this? Is there some way to prevent UMIs from different cells being compared against each other so that I do not have to split the BAM file? Additionally, it would be very nice if I was able to use UMI Tools to correct cell barcode errors. For context, we generally use tens of thousands of beads but get millions of unique cell barcodes. The majority of these cell barcodes have only one or two reads in them but there's a significant population with many reads, presumably from PCR errors. Recovering these reads would be very helpful in further analysis. I know @TomSmithCGAT mentioned that he did this with indrop data, would you mind detailing how exactly you did this? |
I haven't put more work into it right now I was busy pushing a new version
of my dropseq pipeline <https://github.com/Hoohm/dropSeqPipe>.
I will add the UMI tools in there once I have tested it against the
correction/fusion method proposed by the McCarroll lab.
I think the dedup method is a good one, although there is one thing that
might get you to loose some cell barcodes in the process. If I'm not
mistaken, the DetectBeadSynthesisErrors is trying to correct for errors in
the cell barcodes, so splitting your data before this step will make you
loose some reads.
I would hope we can make use of UMI-tools to check for cell barcodes also,
but I'm not sure it accounts for production biases as mentioned in the
cookbook:
*In June 2015, we noticed that a recently purchased batch of ChemGenes
beads generated apopulation of cell barcodes (about 1020%) with sequences
that shared the first 11 bases, but differedat the last base. These same
cell barcodes also had a very high percentage of the base “T” at the
lastposition of the UMI. Based on these observations, we concluded that a
percentage of beads in the lothad not undergone all twelve splitandpool
bases (perhaps they had stuck to some piece ofequipment or container, and
the been reintroduced after the missing synthesis cycle). Thus, the20bp
Read 1 contained a mixed base at base 12 (in actuality, the first base of
the UMI) and a fixedTbase at base 20 (in actuality, the first base of the
polyT segment).*
I'll keep you posted when I tested it.
2017-05-03 21:38 GMT+02:00 b-dawes <notifications@github.com>:
… Hello,
I just wanted to chime in that I'm also interested in applying UMI Tools
to dropseq data. Have you made any progress on this front?
I'm using Dropseq Tools from the McCarroll lab (under "Computational
resources" here <http://mccarrolllab.com/dropseq/>) to process my data.
The output of this will be a BAM with the cell barcode in the XC tag, UMI
in the XM tag, and gene in the GE tag.
If I wanted to fix UMI errors within cell barcodes, I would first need to
split my BAM into one BAM per cell barcode and then run a command like this
for each BAM:
umi_tools dedup -I in.bam --extract-umi-method tag --umi-tag XM
--gene-tag=GE -S out.bam
Is this the correct way of doing this? Is there some way to prevent UMIs
from different cells being compared against each other so that I do not
have to split the BAM file?
Additionally, it would be very nice if I was able to use UMI Tools to
correct cell barcode errors. For context, we generally use tens of
thousands of beads but get millions of unique cell barcodes. The majority
of these cell barcodes have only one or two reads in them but there's a
significant population with many reads, presumably from PCR errors.
Recovering these reads would be very helpful in further analysis. I know
@TomSmithCGAT <https://github.com/TomSmithCGAT> mentioned that he did
this with indrop data, would you mind detailing how exactly you did this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#97 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNXaOxVBpQrqL_so_RdgUZQmMPslnQsks5r2NfGgaJpZM4MjpmK>
.
--
*Roelli Patrick Division of Animal Physiology and Immunology TUM School of
Life Sciences Weihenstephan Technische Universität München Weihenstephaner
Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment
Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland*
*https://github.com/Hoohm <https://github.com/Hoohm>*
|
As far as I can tell, currently the McCarroll pipeline maps to the genome, while at the moment we only support per_gene deduplicating if reads are mapped to the transcriptome. We are working on ways around this that don't involve a large increase in memory usage (the problem is we never know when we've seen all the reads associated with a gene and so don't know when to do the UMI collapsing. This means we have to remember the entire file and do all the dedupling at the end, which is very memory intensive). At the moment it is also necessary to split the file into one file per cell. We are working on ways around this as this shouldn't be too hard to fix. Look out for all these things in the next major feature (i.e. non-bugfix) release. |
Hi @b-dawes. I just like to echo @IanSudbery's comments. We're actively working on implementing an approach whereby you can retain the reads from all cells in a single BAM and use UMI-tools to both error correct the cell barcode and deduplicate the reads in an error-aware manner. In answer to your question about indrop data, as you'll see from my comment above yours, I had to do this in a more laborious, naive manner which involved knowing the number of cell barcodes beforehand, splitting the reads into one file per cell barcode and then deduplicating. We'll update this issue when we have a better solution |
Hello, I'm looking for potential updates on UMI-tools to work with drop-seq/10x/etc protocols. |
Hi @Hoohm, you can checkout the {ts}_support_cell_barcodes branch, which we give no guarantees on at the moment, but is pretty close to ready and should work as far as we know. We are also working on a tutorial for analysing 10x/drop-seq/inDrop using umi_tools. |
@IanSudbery Cool, I'll look into it then! |
Hey, I'm already coming back with one question.
So, I guess the whitelist is empty for some reason. I'll test it out a bit more, but if you have an idea, I'm listening. |
Ok, same error with 40 000 000 but it works with 20 000 000.
In the meantime I'll use the cell threshold given by a partial run and run it again with it as cell_count. |
Wow. That is a lot of reads!!! I'm guessing its not finding a local minima. @TomSmithCGAT Might now more about this than me, but on those error runs, did it get as far as outputting the distributional plot? (You'll need to have run it with What sort of order of magnitude of cells were you expecting? I suspect we need to catch this and tell users that they need to examine the plots and pick a threshold manually. |
Hi @Hoohm - Thanks for reporting the error. The selection of the local minima is currently a bit dirty - since there are often local minima in the region of very high read/UMI counts per cell, there are some cut-offs to make sure these minima are rejected. What's happened here is that the local minima that were detected (194 & 898) were both rejected. I'm finding that these manual cut-offs are not suitable for across all data sets (and in your case, different numbers of input reads) so we'll will probably need to try and alternative approach. In the mean time, I'll make a patch to the branch so that the counts for the local mins are still output even when the whitelist generation fails. Following @IanSudbery's suggestion, you can then run Alternatively, you can use the whitelist generated using the first 20M reads as is plenty to identify the "true" cell barcodes (you can use the Apologies for the inelegant solutions. |
Just to clarify, the |
Perfect, I have a new run to test it on :) |
Just in case you don't already know, you can use the |
Hi @Hoohm. Just to let you know that v0.5 is now out, including a guide on using UMI-tools for scRNA-Seq: https://github.com/CGATOxford/UMI-tools/blob/master/doc/Single_cell_tutorial.md. I'll close this issue now but feel free to re-open if you have any more questions for your analysis |
Hello,
I'm trying to use your tool to handle our single cell drop-seq data.
The method is using paired reads but the pairing is not on the gene itself, meaning we're not doing paired end sequencing, but rather having the barcode + UMI on R1 and the sequence on the R2.
As an example:
R1:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019 1:N:0:TAAGGCGA
NCTGACGCAAGGACAGGGAG
+
#<<GGGGIIIIGIIGGIAGG
R2:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019 2:N:0:TAAGGCGA
ATCTGCTCCTGGAGGTGGTAACAAGGTTCCACAGAAAAAAAGTAAAACTTGATGAAGATGATGAGGACGATGATGAGGACGATGAGGATGATGAGGATGA
+
AAGGGIIGGAGGGG<G.AGGIGIGAAGIIIGGGGGGGGAGIGIGGGIGGGGGGGGGGGGGIIGGGIIIIAGIGG.<G.GGGAGGGGGGGIAGGGAGGGGG
R1 is based on a 12bp of cell Barcode and a 8bp of UMI.
The 12bp Barcode is "random" but constant for one cell.
The 8bp UMI is random.
Would it be possible to use your tool to resolve the cell Barcodes as well as the UMI.
The easiest way would maybe be to extract twice R1 with a paired option so that we append both the BC and the UMI in the name of the read in the R2.
Then we could dedup the aligned file first on the BC and secondly on the UMI.
Please tell me if it seems doable and/or makes sense.
The text was updated successfully, but these errors were encountered: