-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of Alevin-fry to Cellranger for 5' libraries #137
Conversation
Ugh, these do not look good. I suspect that just switching to I'm worried that alevin may be reverse complementing the cell barcodes, such that they would no longer match the provided list. This wouldn't have been a problem until the introduction of the unfiltered mode, as cell barcodes were only checked within libraries, so RC everything would be fine. One test would be to run alevin in knee mode, which would allow it to choose its own barcodes. Then we could check if those barcodes match reverse complements of the CR barcodes. If that works, we could provide a reverse complemented barcode list for 5' samples, which we would then have to reverse after import. Which isn't the worst thing in the world, but does complicate matters. It is also possible that the ability to specify read geometries added in salmon 1.4 might give us a solution, if we are allowed to use reversed start and finishes. So we might be able to specify Not sure if that is supported, but we could ask. (I think it would be worth diving a bit deeper to see if we can find if this is the problem before we did that though.) I'm also looking a bit at https://salmon.readthedocs.io/en/latest/library_type.html, and wondering if MSF might actually be the read format we want for 10xv2 5prime? I'm probably getting myself confused. Sort of curious what alevin would choose if given auto as an option. We could also try |
So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the @jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case? |
@allyhawkins & @jashapiro : one more thought I had — what does it look like if you use |
Yeah, that seems to have been my mistake. I was thinking that if the mapping was strand-specific, then cell barcode mapping might be too. Some of our comparisons between mappers start by filtering to shared barcodes, so if that was the problem we could end up filtering out many of the "real" cells. But it doesn't look like that was the issue.
Ah! We had missed (or forgotten about) this! We didn't change the orientation!
|
@rob-p Thank you so much for the help on this one! Changing from using the @jashapiro I went ahead and updated the notebook to include the samples run using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
I think it may need a rerender, but the only other major thing I noticed was the fact that the mito % was so much higher for the alevin-fry processing. Was that a pattern we saw before? I looked at 04-cell-level-benchmarking-metrics.nb.html, and it doesn't seem so: we saw higher mito percent for sketch mode with nuclei, but not for cells, and not for SA mode. So I would file that under mild curiosity, but nothing that should prevent us from proceeding.
In order to see if there are any differences between Alevin-fry and Cell Ranger in the ability to assign reads to each cell barcode, we first are looking at all cells before restricting to shared cells only. | ||
|
||
```{r} | ||
# mito content in shared cells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth commenting about that the mito percent is so much higher for alevin-fry? My first thought was that this might be a pre-filtering artifact, but it seems to hold post-filtereing too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jashapiro, one quick thought here — not sure if it's worth digging into or not. CellRanger is mapping against their custom reference, and so it has the entire mt chromosome (including genes on there that may not be part of the standard annotation). In this case, reads will map to the best location, which may be an unannotated (or at least not part of the restricted annotation) gene on the mt chromosome. In that case, it won't get counted toward an annotated mt gene and won't show up in the QC.
On the other hand, alevin-fry knowns only about the mt transcripts that are explicitly included in the reference. I know for a fact that there are mito transcripts missing from the standard annotations (like MT-RNR2 and MT-RNR1). Possibly then, reads coming from similar mt transcripts could be mapped instead to some annotated mt transcript and counted as mt expression. In this case, the STAR/CellRanger approach would actually have similar mito expression, but it just wouldn't show up in the count matrix since the mt chromosome is essentially acting as a "decoy" for these reads (because the transcripts of origin aren't annotated in the count matrix). It might be worth looking in the CellRanger BAM and what kind of mapping rate you get to the mt chromosome, and does that signify more reads arising from there than are accounted for in the reported gene counts? This may not be what is going on of course, but this is a thought I've had on this a few times before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clear explanation @rob-p. I think you bring up an interesting point and potentially something to look into. Mitochondrial genes also tend to have a lot of homology as well correct? I wonder if this is something that is more commonly seen in using the cr-like-em
resolution rather than just the cr-like
resolution and that some of the increased detection is due to multimapped reads not being thrown away here.
ggplot(qc_common_threshold, aes(x = tool, y = detected, fill = tool)) + | ||
geom_boxplot() + | ||
facet_grid(~ sample) + | ||
theme_classic() + | ||
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + | ||
ylab("Genes detected/cell") + | ||
xlab("") + | ||
ylim(c(0,1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the genes per cell drops so precipitously here is interesting, but I guess that probably suggests the data isn't particularly saturated?
ggplot(qc_common_threshold, aes(x = tool, y = sum, fill = tool)) + | ||
geom_boxplot() + | ||
facet_grid(~ sample) + | ||
theme_classic() + | ||
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + | ||
ylab("UMI/cell") + | ||
xlab("") + | ||
ylim(c(0,30000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this chunk and the one below do not appear in the rendered output (the plots do). Probably needs a fresh rerender?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. I fixed the html file so these chunks should now be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think we can leave the mitochondrial percentages as a bit of a mystery for now...
Closes #133. This PR adds a comparison of using Cell Ranger to Alevin-fry for pre processing 5' libraries. Here, I am analyzing 2 10Xv2_5prime samples, SCPCR000286 and SCPCR000287, that have been processed with Cell Ranger and Alevin-fry. For Cell Ranger, the orientation is automatically detected based on how many reads are in the antisense direction (this was confirmed for these samples by looking at the
web_summary.html
). For Alevin-fry, I am using the same configurations that we are using for 3', selective alignment with the splici index, unfiltered permit list, and cr-like-em resolution, but with the use of theISF
flag for library type, indicating the orientation of the libary.For this analysis, I generated the per cell and per gene metrics using the
benchmarking_generate_qc_df.R
script which creates thequant_info.tsv
,coldata_df.tsv
, androwdata_df.tsv
used for inputs to the analysis outlined in this notebook. In running that script again, I noticed an error in use ofscpcaTools::import_quant_data()
and had to add in themtx_format
option (this is described in AlexsLemonade/scpcaTools#46). This function is utilized in themake_sce_list.R
benchmarking function and in order to successfully run usingmapply
I had to add in that argument.After generating the necessary qc dataframes needed for plotting, I went ahead and restricted the analysis to only cells that were identified in both tools. I then compared the following:
Interestingly, these results are not as good as we have seen previously and the alevin-fry quantifications are showing much lower UMI/cell and genes detected/cell than with Cell Ranger. Also the correlation coefficients of mean gene expression are in the range of 0.5-0.6, whereas with the 3' data we were seeing coefficients above 0.98. I double checked the log files to ensure that both of these were run with the correct configurations using the
ISF
library type with Alevin-fry, but it appears that these are not as consistent as we would have hoped.One thought I had is that the 5' data doesn't do as well with the splici index or the other configurations that we are using so we can try out other options with Alevin-fry, but I am open to other suggestions or ideas.