Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps for single cell (drop-seq) protocol #97

Closed
Hoohm opened this issue Mar 21, 2017 · 30 comments
Closed

Steps for single cell (drop-seq) protocol #97

Hoohm opened this issue Mar 21, 2017 · 30 comments

Comments

@Hoohm
Copy link
Contributor

Hoohm commented Mar 21, 2017

Hello,
I'm trying to use your tool to handle our single cell drop-seq data.
The method is using paired reads but the pairing is not on the gene itself, meaning we're not doing paired end sequencing, but rather having the barcode + UMI on R1 and the sequence on the R2.

As an example:
R1:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019 1:N:0:TAAGGCGA
NCTGACGCAAGGACAGGGAG
+
#<<GGGGIIIIGIIGGIAGG

R2:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019 2:N:0:TAAGGCGA
ATCTGCTCCTGGAGGTGGTAACAAGGTTCCACAGAAAAAAAGTAAAACTTGATGAAGATGATGAGGACGATGATGAGGACGATGAGGATGATGAGGATGA
+
AAGGGIIGGAGGGG<G.AGGIGIGAAGIIIGGGGGGGGAGIGIGGGIGGGGGGGGGGGGGIIGGGIIIIAGIGG.<G.GGGAGGGGGGGIAGGGAGGGGG

R1 is based on a 12bp of cell Barcode and a 8bp of UMI.
The 12bp Barcode is "random" but constant for one cell.
The 8bp UMI is random.

Would it be possible to use your tool to resolve the cell Barcodes as well as the UMI.
The easiest way would maybe be to extract twice R1 with a paired option so that we append both the BC and the UMI in the name of the read in the R2.
Then we could dedup the aligned file first on the BC and secondly on the UMI.

Please tell me if it seems doable and/or makes sense.

@TomSmithCGAT
Copy link
Member

Hi @Hoohm. Arguably, you could use the a single barcode (cell + UMI) for deduplication as this would correct for errors in both the cell barcode and UMI simultaneously. However, unless I'm mistaken, the two barcodes are subtly different. Whilst both barcodes are random, the cell barcode should be one of n discrete sequence, where n is the number of cells sequenced. Due to sequencing errors (and possibly multiple cell barcodes in a single droplet for drop-seq?), the number of cell barcodes observed is > n. We would therefore like to correct the cell barcode against a whitelist of the real n barcodes. This is different to the UMIs where we don't have a whitelist of allowed sequences and hence we use networks to distinguish real UMIs from erroneous ones.

Is this the case with your drop-seq data too?

@IanSudbery
Copy link
Member

I wasn't aware that drop-seq had white listed cell barcodes?

I would be templated to extract the full 20bp as a single barcode/UMI, perform the deduplication on this, and then split the barcode into cell barcode and umi. The only thing you might want to watch out for is that at 20bp, you might want up the edit_distance threshold. What do you think @TomSmithCGAT?

In theory, you could do the two separately, but what you'd need to do is this:

  1. Deduplicate on gene + umi (method = unique) + cell barcode (method = directional)
  2. Deduplicate the result of 1 on gene + umi (method= directional) + cell barcode(method = unique)

Currently we don't provide mechanisms to do this. (not saying its impossible, but its not in the current version, and would have to go on a to do list).

@Hoohm
Copy link
Contributor Author

Hoohm commented Mar 21, 2017 via email

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Mar 21, 2017

Hi @Hoohm. The knee-plot approach sounds identical to the method used in the Klein et al inDrop paper which came out around the same time. When I re-analysed the inDrop data, I first made a cell barcode whitelist, then I parsed the raw data and split the reads into single end fastqs for each whitelisted cell barcode, discarding reads which did not match the whitelist. At the same time I corrected cell barcodes which were one mismatch away from a single whitelisted cell barcode, and extracted the UMIs and appended this to the read name. I think this approach would be the best way to go here too.

This is a relatively bespoke extraction procedure since it involves simulatenous cell barcode extraction and correction, raw data splitting and UMI extraction, hence it is not supported by UMI-tools extract. I'm happy to share my code for the above if you would like?

@TomSmithCGAT
Copy link
Member

@IanSudbery Should we perhaps consider supporting a drop-seq/inDrop UMI extraction using the method above. We could leave it up to the user to generate a cell barcode white list as they wish. It wouldn't take much effort to add this to UMI-tools but my concern is how stable the drop-seq/inDrop methods are? However, presumably any droplet based method will have a similar issue with the cell barcodes?

@IanSudbery
Copy link
Member

@TomSmithCGAT In the long run, I think we should provide the ability to pass dedup a tag that contains a cell barcode and this should be added to the bundle key so that splitting into separate files wasn't necessary.

I'm also interested in whether our methods could act as a replacement for the knee method. Either by deduping on the whole barcode (cell + UMI) or the two step porcess I outlined above.

@TomSmithCGAT
Copy link
Member

@IanSudbery Good idea. How would you see this working generically with any aligner? We could fall back on the method of appending the cell barcode to the read id, i.e:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019_AGACTAGGAG;AGCGAGCG
where the barcodes are ;-separeted. If no cell barcodes are extracted, the cell barcode portion of the bundle key would be empty.

My guess is that the knee method should be near optimal but it would definitely be interesting to compare the results. One of the advantages of the knee method is that it guarantees the cell barcodes are the n most abundant even if this includes cell barcodes which are a single edit distance apart. This might not necessarily be the most accurate answer but it's perhaps more appealing than merging together two highly abundant cell barcodes which are 1 edit distance apart whilst retaining very lowly abundant cell barcode, as could occur with a network-based method. Merging together two highly abundant cell barcodes incorrectly could be lead to apparently interesting but ultimately erroneous findings from one's scRNA-Seq - I'm thinking 'dual-state cell' type findings arising from cell doublets.

An issue we would currently have with the network-based method is that the 'true' cell barcodes should be the same across the whole data set and we're currently treating each positions independently. We could solve this with a two pass method, e.g:

Pass 1: Tally counts for all cell barcodes across all reads, form networks and resolve to identify the n 'true' barcodes across sample, and crucially, the barcodes which are in the same cell group.
Pass 2: Deduplicate per position, per cell group

For pass 1 I'd suggest the directional method, although the threshold may need to be stricter and/or the edit distance increased? Whichever method we use, we may need to also introduce a minimal counts threshold for a 'true' barcode to be accepted to avoid issues with very low abundance cell barcodes in a single node network.

@Hoohm
Copy link
Contributor Author

Hoohm commented Mar 21, 2017 via email

@IanSudbery
Copy link
Member

IanSudbery commented Mar 21, 2017

I feel like the first step would be to allow dedup to do the per position, per cell group deduplication based on a user specified BAM tag. This step would be useful for pretty much any multiplexed single cell sequencing technique I reckon.

The next step would be to implement methods for getting that tag onto the BAM record. I think the would have to be some sort of extra field on the fastq record. Could be:
@HISEQ:190:H5VNVBCXY:1:1101:1380:2019_<CELL_BC>_<UMI>

Thus the reads would still be compatible the current format where split("_")[-1] is taken to be the UMI.

Finally some method for collapsing cell barcodes. Is the "knee" for the "knee" automatically calculated, or done by hand? I suspect that this would be a separate tool. The "knee" method sounds remarkably similar to the "percentile" method for umi deduplicating (except applied across the whole sample rather than just at one position).

@TomSmithCGAT
Copy link
Member

@IanSudbery There's an example of the 'knee' plot on page 12 of the drop-seq guide here. It appears that the 'knee' is not automatically calculated. I think the easiest way to do it automatically would be to make a density plot for counts per cell barcode and then find where the second derivative = 0. This would represent the point between the distribution of counts for the real cell barcodes and the errors.

@TomSmithCGAT
Copy link
Member

@Hoohm. I'll dig out the relevant code tomorrow afternoon

@Hoohm
Copy link
Contributor Author

Hoohm commented Mar 21, 2017

@TomSmithCGAT That would be the easiest way to do it. Although, some people are struggling to have a clear signal and don't get this "bend" on the plot, making it complicated sometimes to define where the limit is.
Automatic is fine, but there should be a possibility to decide manually.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Mar 22, 2017

@Hoohm I've attached a python function which will take the paired end fastqs from drop-seq and perform the UMI extraction. Github wont let me attach a .py file so I saved it as .txt:
parse_dropseq.txt

This code is modified from the extractUMIsAndFilterFastqGSE65525 function in PipelineScRNASeq.py which I used to re-analyse the inDrop data for the UMI-tools publication.

  • The code here is simpler as inDrop has an unusual arrangement where the cell barcode is not a consistent length and can be one of 384^2 possible sequences rather than being entirely random. Is the cell barcode for drop-seq completely random, i.e are all 4^12 = 16777216 cell barcodes possible?
  • I now also remember than I didn't error correct the cell barcodes either since the run time to identify barcodes which could only be corrected to a single barcode was very long for only a minimal gain in rescued reads. You may of course want to add this back in?
  • This function assumes you know the number of cell barcodes beforehand (as I did with inDrop since the data had already been analysed in the Klein et al paper). You could split this function at the point at which the cell barcodes are written out, make the 'knee' plot to determine the number of barcodes and then run the second half of the function
  • The function utilises the CGAT.IOTools module (https://github.com/CGATOxford/cgat) in order to write out to a multiple fastqs at once. If we follow Ian's suggestion above and append both the UMI barcode and cell barcode to the read name this wouldn't be necessary as we could dedup all the reads together from a single BAM. This won't happen immediately though so you may be best of splitting the reads by cell barcode for now in order to make the downstream BAMs compatible with UMI-tools.

@b-dawes
Copy link

b-dawes commented May 3, 2017

Hello,

I just wanted to chime in that I'm also interested in applying UMI Tools to dropseq data. Have you made any progress on this front?

I'm using Dropseq Tools from the McCarroll lab (under "Computational resources" here) to process my data. The output of this will be a BAM with the cell barcode in the XC tag, UMI in the XM tag, and gene in the GE tag.

If I wanted to fix UMI errors within cell barcodes, I would first need to split my BAM into one BAM per cell barcode and then run a command like this for each BAM:
umi_tools dedup -I in.bam --extract-umi-method tag --umi-tag XM --gene-tag=GE -S out.bam

Is this the correct way of doing this? Is there some way to prevent UMIs from different cells being compared against each other so that I do not have to split the BAM file?

Additionally, it would be very nice if I was able to use UMI Tools to correct cell barcode errors. For context, we generally use tens of thousands of beads but get millions of unique cell barcodes. The majority of these cell barcodes have only one or two reads in them but there's a significant population with many reads, presumably from PCR errors. Recovering these reads would be very helpful in further analysis. I know @TomSmithCGAT mentioned that he did this with indrop data, would you mind detailing how exactly you did this?

@Hoohm
Copy link
Contributor Author

Hoohm commented May 4, 2017 via email

@IanSudbery
Copy link
Member

As far as I can tell, currently the McCarroll pipeline maps to the genome, while at the moment we only support per_gene deduplicating if reads are mapped to the transcriptome. We are working on ways around this that don't involve a large increase in memory usage (the problem is we never know when we've seen all the reads associated with a gene and so don't know when to do the UMI collapsing. This means we have to remember the entire file and do all the dedupling at the end, which is very memory intensive).

At the moment it is also necessary to split the file into one file per cell. We are working on ways around this as this shouldn't be too hard to fix.

Look out for all these things in the next major feature (i.e. non-bugfix) release.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented May 4, 2017

Hi @b-dawes. I just like to echo @IanSudbery's comments. We're actively working on implementing an approach whereby you can retain the reads from all cells in a single BAM and use UMI-tools to both error correct the cell barcode and deduplicate the reads in an error-aware manner.

In answer to your question about indrop data, as you'll see from my comment above yours, I had to do this in a more laborious, naive manner which involved knowing the number of cell barcodes beforehand, splitting the reads into one file per cell barcode and then deduplicating.

We'll update this issue when we have a better solution

@Hoohm
Copy link
Contributor Author

Hoohm commented Aug 11, 2017

Hello,

I'm looking for potential updates on UMI-tools to work with drop-seq/10x/etc protocols.
I would be keen on trying any new implementation you are still testing out.

@IanSudbery
Copy link
Member

Hi @Hoohm, you can checkout the {ts}_support_cell_barcodes branch, which we give no guarantees on at the moment, but is pretty close to ready and should work as far as we know.

We are also working on a tutorial for analysing 10x/drop-seq/inDrop using umi_tools.

@Hoohm
Copy link
Contributor Author

Hoohm commented Aug 11, 2017

@IanSudbery Cool, I'll look into it then!

@Hoohm
Copy link
Contributor Author

Hoohm commented Aug 11, 2017

Hey, I'm already coming back with one question.
If I run a partial run (first 400 000) lines, everything works fine.
If I run the whole file (~100'000'000 reads) I get this error:

TypeError: 'NoneType' object is not iterable


umi_methods.py", line 330, in getErrorCorrectMapping
    whitelist = set([str(x).encode("utf-8") for x in whitelist]

So, I guess the whitelist is empty for some reason. I'll test it out a bit more, but if you have an idea, I'm listening.

@Hoohm
Copy link
Contributor Author

Hoohm commented Aug 11, 2017

Ok, same error with 40 000 000 but it works with 20 000 000.
The local_min is not found for some reason and it returns None, hence it crashes.
Here are a few values I printed out while testing:

  • threshold = 134.29
  • local_mins = [194,898]

In the meantime I'll use the cell threshold given by a partial run and run it again with it as cell_count.

@IanSudbery
Copy link
Member

Wow. That is a lot of reads!!!

I'm guessing its not finding a local minima. @TomSmithCGAT Might now more about this than me, but on those error runs, did it get as far as outputting the distributional plot? (You'll need to have run it with --plot-prefix).

What sort of order of magnitude of cells were you expecting?

I suspect we need to catch this and tell users that they need to examine the plots and pick a threshold manually.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 15, 2017

Hi @Hoohm - Thanks for reporting the error.

The selection of the local minima is currently a bit dirty - since there are often local minima in the region of very high read/UMI counts per cell, there are some cut-offs to make sure these minima are rejected. What's happened here is that the local minima that were detected (194 & 898) were both rejected.

I'm finding that these manual cut-offs are not suitable for across all data sets (and in your case, different numbers of input reads) so we'll will probably need to try and alternative approach. In the mean time, I'll make a patch to the branch so that the counts for the local mins are still output even when the whitelist generation fails. Following @IanSudbery's suggestion, you can then run whitelist with --plot-prefix to obtain the counts and then re-run whitelist with the --set-cell-number option to manually set the number of accepted cell barcodes.

Alternatively, you can use the whitelist generated using the first 20M reads as is plenty to identify the "true" cell barcodes (you can use the --subset-reads option if you aren't doing so already). I'd definitely recommend visually assessing the plots to check you are happy that the number of cell barcodes detected is reasonable though?

Apologies for the inelegant solutions.

@TomSmithCGAT
Copy link
Member

Just to clarify, the local_min values relate to the x-axis on the density graph, not the # cell barcodes or # reads/UMIs per barcode - the values are meaningless by themselves without reference to the distribution of counts per cell barcode. threshold is the minimum number of reads/UMIs per cell barcode for it to be included in the density plot.

@TomSmithCGAT
Copy link
Member

@Hoohm - If you want to run whitelist and inspect the plots this is now possible even where the selection of the local minima fails (143f1ac). The cell-barcodes branch has now also been merged into the master branch so al the single cell functionalities are available on the master branch.

@Hoohm
Copy link
Contributor Author

Hoohm commented Aug 15, 2017

Perfect, I have a new run to test it on :)
@TomSmithCGAT I think the idea of running the sample once with a subset of reads to create an estimated number is great. I'll do that and might add a 5% on top just to be sure. I'll filter them out later on downstream analysis.
@IanSudbery If I remember correctly, the plot didn't work either on the whole run. We expected from 4 to 5k cells.

@IanSudbery
Copy link
Member

Just in case you don't already know, you can use the --subset-reads option to use only a sub-sample of reads to generate the whitelist.

@TomSmithCGAT
Copy link
Member

@Hoohm - Until 143f1ac the plotting would have failed whenever the whitelist generation failed. That shouldn't be the case any more

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 21, 2017

Hi @Hoohm. Just to let you know that v0.5 is now out, including a guide on using UMI-tools for scRNA-Seq: https://github.com/CGATOxford/UMI-tools/blob/master/doc/Single_cell_tutorial.md. I'll close this issue now but feel free to re-open if you have any more questions for your analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants