Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of emptyDrops to emptyDropsCellRanger on Alevin-fry unfiltered output #128

Merged
merged 5 commits into from
Sep 8, 2021

Conversation

allyhawkins
Copy link
Member

In this PR, I am addressing the first point in #105, which is if use of the DropletUtils::emptyDropsCellRanger() function for filtering of empty droplets provides an alternative option for filtering the unfiltered output from using Alevin-fry with the --unfiltered-pl option. Previously, we had found that use of DropletUtils::emptyDrops() resulted in much higher cell numbers than the numbers of cells reported from Cell Ranger on the same samples, and that cells not filtered by emptyDrops(), but filtered in the Cell Ranger output led to a lower UMI/cell and genes/cell distribution and median. DropletUtils::emptyDropsCellRanger() aims to mimic the filtering observed in Cell Ranger and Star Solo so here I tested it in comparison to emptyDrops and then looked at the number of cells and distributions of UMI/cell and genes/cell after filtering.

A few things to note for this notebook:

  • The version of DropletUtils I am using is from the cellranger branch: Further edits on the CellRanger-like emptyDrops MarioniLab/DropletUtils#66
  • There are errors in this version that I found when installing it so you cannot install from github directly. I had to clone the repo and build and install it locally after solving documentation errors and a mis-naming of a variable in an internal function.
  • Upon successfully being able to use emptyDropsCellRanger, I didn't see a huge improvement in the cells that were being filtered out with the number of cells and distributions looking very similar to use of emptyDrops with default parameters.
  • After doing some more reading, I chose to test increasing the lower parameter for emptyDrops which would increase the number of UMI count at which all cells below this point are considered empty Droplets. This is also used in creating the ambient RNA expression profile which is used to calculate the likelihood of each cell being empty or not. In doing this, I found that lower=200 resulted in a much more similar cell number and distribution of UMI/cell and genes/cell to Cell Ranger than the other filtering methods. I only tested lower=100 (the default) and lower=200.

For reviewers, are there any other metrics that would be helpful to see? Or any other modifications of parameters for the filtering methods?

@DongzeHE
Copy link

Hi @allyhawkins,

I notice that this PR mentioned the emptyDropsCellRanger() we are working on. To be honest, I would not expect that the result of this function is very different from that of emptyDrops() for single cell experiments.

The reason we implemented that function is we found that for single nucleus datasets, the knee finding method of alevin-fry sometimes returns much more high-quality barcodes than STARsolo and Cell Ranger. This is because usually, single-nucleus captures fewer reads than single-cell, which will make the knee finding method of alevin-fry very hard and conservative since the UMI count rank plot will be very flat. This also happens when applying emptyDrops() on the unfiltered count matrix. However, this will not affect STARsolo and Cell Ranger a lot because they have some hard thresholds (same as the arguments we used in emptyDropsCellRanger()). As there is no standard way to predict high-quality nuclei for single nucleus datasets, we decided to implement this function to make sure that Alevin-fry can at least get a similar result as other tools.

The version of emptyDropsCellRanger() you used is not the finalized version, but I think the final version will come out soon.

Sincerely,
Alevin-fry team

@jashapiro jashapiro self-requested a review August 30, 2021 13:42
@allyhawkins
Copy link
Member Author

Hi @DongzeHE, thanks so much for your reply and this insight! We too had noticed that the knee method with single-nucleus samples in Alevin-fry was resulting in filtering out many more cells than would have been filtered with Cell Ranger, which was part of our motivation to use the --unfiltered-pl approach rather than rely on the knee filtering. Here, we are testing this method on both single cell and single nucleus samples and currently I'm seeing the same affect on both of them. We will be on the lookout for the updated version of emptyDropsCellRanger.

I'm curious if you have found that using different hard thresholds for single nuclei samples and single cell samples are needed since it's expected that single nuclei have lower counts?

@jashapiro
Copy link
Member

It looks from https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#emptydrop-like-filtering like the CellRanger lower bound on UMI is 500 (umiMin), though it is a bit hard to tell from this exactly how this is applied. It is also hard to tell exactly how the various variables interact. In particular, while I can imagine that umiMinFracMedian sets a minimum bound based on a fraction of the median UMI count, but which takes priority in the interaction between umiMin and umiMinFracMedian? Is it always the larger or the smaller? I can imagine arguments for either choice!

Either way, it seems to me that a higher threshold is a quite reasonable adjustment to make. I do not think there is likely to be much value in a 500 UMI cell or nucleus. I would be curious to know if EmptyDrops was eliminating most of those cells anyway with the lower (200) threshold.

@DongzeHE
Copy link

Hi @allyhawkins,

The emptyDropsCellRanger function in the cellranger branch is ready. We still need to improve the description, but the function itself does what it is expected to do.

As for your question, we think the answer is yes, you do need different hard thresholds for single-cell and single-nucleus samples. This is because the default values of those hard thresholds are just heuristics, chosen by 10X and then propagated to STARsolo, so there is no reason to believe they should apply equally well to all experiments. However, we don't know what the best thresholds are, as they may vary from experiment to experiment. If testing different values is a part of your analysis, we will look forward to hearing your conclusion!

As for how the parameters interact, here is the relevant part in STARsolo, and here is how we implement it in R.

Sincerely,
Dongze

@allyhawkins
Copy link
Member Author

In particular, while I can imagine that umiMinFracMedian sets a minimum bound based on a fraction of the median UMI count, but which takes priority in the interaction between umiMin and umiMinFracMedian? Is it always the larger or the smaller? I can imagine arguments for either choice!

Based on going through some of the R code, it looks like the priority is going to be the smaller of the two.

Either way, it seems to me that a higher threshold is a quite reasonable adjustment to make. I do not think there is likely to be much value in a 500 UMI cell or nucleus. I would be curious to know if EmptyDrops was eliminating most of those cells anyway with the lower (200) threshold.

So from what I can tell it looks like the Alevin-fry paper used a lower=500 for emptyDrops and a umiMin=500 for the emptyDropsCellRanger function (which is the default). We could try lower=500 as well, but I tend to agree that I'm sure many of those closer to the lower bound are going to be thrown out regardless. I was trying to be consistent with Cell Ranger without being overly stringent.

@jashapiro
Copy link
Member

Based on going through some of the R code, it looks like the priority is going to be the smaller of the two.

Based on the code here, I drew the other conclusion; it is the larger of the specified minimum and the value calculated from the umi.min.frac.median (which uses the median of the top set of cells, not the whole population, which can lead to other complexity depending on how that top set gets defined!)

In any case, I would like to see the effect of the cutoff at 500, since that is what is used elsewhere. A higher cutoff may make the calculations a bit more efficient as well, since it will require fewer tests.

@jashapiro
Copy link
Member

I was trying to be consistent with Cell Ranger without being overly stringent.

I think we can err on the side of consistency with common parameters as long as it seems to makes sense. As always, we have the out of telling people to go use the unfiltered data and do their own filtering.

That said, I would also like to see a "bad" example here: do we have some snRNA-seq data with a knee plot that looks wonky, so we can look at what happens in that case? I'd be curious what happens with low count data, and the examples here don't seem to cover that situation very well.

@allyhawkins
Copy link
Member Author

In any case, I would like to see the effect of the cutoff at 500, since that is what is used elsewhere. A higher cutoff may make the calculations a bit more efficient as well, since it will require fewer tests.

Sounds reasonable, I will go ahead and add in lower=500 to the comparisons.

That said, I would also like to see a "bad" example here: do we have some snRNA-seq data with a knee plot that looks wonky, so we can look at what happens in that case? I'd be curious what happens with low count data, and the examples here don't seem to cover that situation very well.

We do have two examples of bad snRNA-seq data that I can add in here. I haven't yet processed them with alevin-fry cr-like-em since we switched to using that and the most recent version of Alevin-fry, but I don't think that will be a confounder here as the filtering was still done using --unfiltered-pl.

@jaclyn-taroni
Copy link
Member

I just looked at the HTML file to get a quick idea about these results and one thing that jumped out to me when looking at plots, etc. is that I don't know anything about the samples aside from their identifiers if I'm looking at the plots. I can guess which are single-cell and single-nuclei, but we don't want folks to guess if they happen to look at this!

@allyhawkins
Copy link
Member Author

@jashapiro I went ahead and updated this PR with the addition of 2 "bad" single-nuclei samples (SCPCR000118 and SCPCR000119), added in testing of emptyDrops with lower=500, and also added a tag for either cell or nuclei to the sample names so that it is present in all of the plots.

One thing to note is that in updating this PR, I used the most recent version of emptyDropsCellRanger that has an updated version to the way the ambient profile is calculated in this commit:
MarioniLab/DropletUtils@ebf214d
Because of this, the results have changed and we see that emptyDropsCellRanger is more similar to Cell Ranger and emtpyDrops with higher thresholds used for lower.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good as an overview of the different settings. I still have a few presentation comments, with the biggest one being suggesting a switch to a bar chart, because I find the jitter very hard to read. (This is one place where a bar chart makes sense to me!)

There also seems to be at least one bug that crept in when making the html file, so I am not sure that the results I was looking at are up to date.
The other comment is on the results in general: I suspect that the most relevant change in emptyDropsCellRanger is that it will dynamically lower the threshold for defining the ambient cells if the "top" cells have low expression. This is something we could implement outside the function if we wanted to, but I'm not convinced that it is worth it. How much value do we place on having a bunch of nuclei with UMI counts of 100-200? How much value do we want to put on retaining cells like that?

Comment on lines 326 to 327
geom_jitter() +
geom_sina(size = 4) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need jitter and sina here? I'm not sure we want either, but maybe a very small jitter? I found it very hard to read this plot in its current form. This might actually be a place for a bar chart (crazy talk).

Suggested change
geom_jitter() +
geom_sina(size = 4) +
geom_col(position = "dodge") +

```
Here you can see that `emptyDropsCellRanger()` actually does appear to give more comprable cell numbers to Cell Ranger than `emptyDrops()`.
Additionally, as expected, using the knee-based method decreases the numbers of cells across the board.
It appears that `emptyDrops()` with `lower=200` or `lower=500` is the most comparable to Cell Ranger for all samples and we see that those cell numbers tend to be fairly similar.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for #118, where emptyDrops does do much "worse" than emptyDropsCellRanger in that it calls many fewer cells. I'm not sure if there are other emptyDrops parameters we might be able to adjust to deal with pathological cases like that? I just looked through and can't really find anything. This does suggest that we might actually want to think a bit more about emptyDropsCellRanger, but it will depend how fast it gets into being something other people can use.

I'm not actually convinced retaining those low UMI count cells is worthwhile though, so I might not be too upset with the "stricter" emptydrops filtering. We can be a bit more lenient with lower = 200, but nothing is necessarily going to save a sample like 118...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that retaining the low UMI count cells is not worthwhile and I'm inclined to go with lower=500 since those cells should be filtered out anyways. I also don't think anything would save a sample like 118 here either and that's a case where it is just a poor sample, so I think we should design to the standard sample that we expect to get.



```{r fig.height=10, fig.width=10}
ggplot(coldata_df, aes(x = tool, y = sum, fill = filtering_method)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, I don't find these plots super helpful: All of the alevin samples are based on the same starting distribution. The median shifts are mostly because we cut off the lower end of the distribution at different points.

That said, there is something here in that the emptyDropsCellRanger seems to have a stricter cutoff at times, with a resulting larger shift at times. So to my eye it is sometimes resulting in some more dramatic shifts (126 & 127, in particular)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked back at these, and now I think they don't make sense at least not for 118. For 118, emptyDropsCellRanger retains many more cells, which I would expect to result in lower UMI counts on average (there just aren't many high UMI cells!)

Maybe the plots in the html file aren't updated? It looks like there was an error in the code, so these plots and others below migth be old:
Screen Shot 2021-09-03 at 2 50 42 PM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the html to make sure they reflect the correct results, but I actually disagree and think these plots are useful because they show where the cutoffs are and what cells are getting thrown out. We want to make sure we are throwing out cells with low counts, which wasn't necessarily the case with emptyDrops lower=100. I think with the higher thresholds, we now see a more similar distribution to Cell Ranger, and we are removing cells that we likely don't want to keep in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the figures: they do look more reasonable now. I still don't think I understand why in the case of 118 the median count decreases with lower=500, but I am not too worried about it.

My point with the boxplots is partly just me being ornery about box plots in general. If the distributions are not normal-ish, they tend to present what I think is a misleading picture. In this case there is a lot of wasted ink drawing "outliers" that are identical across most of the comparison distributions, and the upper part of the distributions are irrelevant to the questions we are interested in. What we are really interested in is the bottom of the distribution, and where the cutoffs end up functionally, and that is hard to see from these plots. Ultimately, it doesn't really matter much for this exploration, but I just want to register my reaction.

percent_filter_overlap)
```
In contrast, only 50% of the barcodes that are output from Alevin-fry filtered by `emptyDrops` with `lower=100` are found in Cell Ranger.
It appears that lowering the threshold of minimum UMI count helps retain a more similar group of cells to Cell Ranger.
Copy link
Member

@jashapiro jashapiro Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not my interpretation; To me 200 looked much better than 100, if I am interpreting the text correctly. There is one case (118) where cellranger is retaining many more cells, but as previously discussed, I don't think that data is useful anyway. Otherwise, there is usually better overlap at the 200 level.

Edit: Not sure what I saw for results is correct based on error here: #128 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typo, I agree that both 200 and 500 looked much better than 100. I fixed the text accordingly.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to be done with, even if I am still working through my emotions around box plots (nope, still don't like them).

The one thing I think this leaves a bit unresolved is a recommended action plan. Based on my read of the results, I would probably go with lower=200, as I might rather be a bit more lenient than less, but I could probably be convinced to go the other way as well! If emptyDropsCellRanger were stable and released, I might go with that, but as it is not, I don't think it is the right choice at this time. We may want to revisit if that changes before we process everything.

@allyhawkins
Copy link
Member Author

The one thing I think this leaves a bit unresolved is a recommended action plan. Based on my read of the results, I would probably go with lower=200, as I might rather be a bit more lenient than less, but I could probably be convinced to go the other way as well! If emptyDropsCellRanger were stable and released, I might go with that, but as it is not, I don't think it is the right choice at this time. We may want to revisit if that changes before we process everything.

I went back and looked at this again and it looks like lower=200 usually gives a cell number either equal to cell ranger or a little bit higher, while lower=500 is either equal or a little bit lower. So it really is a question of how lenient we want to be. My only pro for going with the lower=500 is that was what Alevin-fry uses in their paper and it maybe slightly more consistent with emptyDropsCellRanger (which I also agree I would pick in general if it were slightly more developed). I think I could really be convinced either way.

@jashapiro
Copy link
Member

The one thing I think this leaves a bit unresolved is a recommended action plan. Based on my read of the results, I would probably go with lower=200, as I might rather be a bit more lenient than less, but I could probably be convinced to go the other way as well! If emptyDropsCellRanger were stable and released, I might go with that, but as it is not, I don't think it is the right choice at this time. We may want to revisit if that changes before we process everything.

I went back and looked at this again and it looks like lower=200 usually gives a cell number either equal to cell ranger or a little bit higher, while lower=500 is either equal or a little bit lower. So it really is a question of how lenient we want to be. My only pro for going with the lower=500 is that was what Alevin-fry uses in their paper and it maybe slightly more consistent with emptyDropsCellRanger (which I also agree I would pick in general if it were slightly more developed). I think I could really be convinced either way.

I agree. My point is mostly that it would be worth including these overall thoughts in the document, as that is what we will likely go back to first rather than the series of github comments on the PR!

@allyhawkins
Copy link
Member Author

I agree. My point is mostly that it would be worth including these overall thoughts in the document, as that is what we will likely go back to first rather than the series of github comments on the PR!

Good point! I went ahead and added a section with conclusions, summarizing the points in this discussion so we have it there for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants