Comparison of Alevin-fry to Cellranger for 5' libraries #137

allyhawkins · 2021-09-14T19:59:28Z

Closes #133. This PR adds a comparison of using Cell Ranger to Alevin-fry for pre processing 5' libraries. Here, I am analyzing 2 10Xv2_5prime samples, SCPCR000286 and SCPCR000287, that have been processed with Cell Ranger and Alevin-fry. For Cell Ranger, the orientation is automatically detected based on how many reads are in the antisense direction (this was confirmed for these samples by looking at the web_summary.html). For Alevin-fry, I am using the same configurations that we are using for 3', selective alignment with the splici index, unfiltered permit list, and cr-like-em resolution, but with the use of the ISF flag for library type, indicating the orientation of the libary.

For this analysis, I generated the per cell and per gene metrics using the benchmarking_generate_qc_df.R script which creates the quant_info.tsv, coldata_df.tsv, and rowdata_df.tsv used for inputs to the analysis outlined in this notebook. In running that script again, I noticed an error in use of scpcaTools::import_quant_data() and had to add in the mtx_format option (this is described in AlexsLemonade/scpcaTools#46). This function is utilized in the make_sce_list.R benchmarking function and in order to successfully run using mapply I had to add in that argument.

After generating the necessary qc dataframes needed for plotting, I went ahead and restricted the analysis to only cells that were identified in both tools. I then compared the following:

mitochondrial content/cell
UMI/cell
genes detected/cell
mean gene expression
overlap of genes detected across the tools

Interestingly, these results are not as good as we have seen previously and the alevin-fry quantifications are showing much lower UMI/cell and genes detected/cell than with Cell Ranger. Also the correlation coefficients of mean gene expression are in the range of 0.5-0.6, whereas with the 3' data we were seeing coefficients above 0.98. I double checked the log files to ensure that both of these were run with the correct configurations using the ISF library type with Alevin-fry, but it appears that these are not as consistent as we would have hoped.

One thought I had is that the 5' data doesn't do as well with the splici index or the other configurations that we are using so we can try out other options with Alevin-fry, but I am open to other suggestions or ideas.

jashapiro · 2021-09-14T21:34:37Z

Ugh, these do not look good. I suspect that just switching to ISR may not be sufficient.

I'm worried that alevin may be reverse complementing the cell barcodes, such that they would no longer match the provided list. This wouldn't have been a problem until the introduction of the unfiltered mode, as cell barcodes were only checked within libraries, so RC everything would be fine.

One test would be to run alevin in knee mode, which would allow it to choose its own barcodes. Then we could check if those barcodes match reverse complements of the CR barcodes.

If that works, we could provide a reverse complemented barcode list for 5' samples, which we would then have to reverse after import. Which isn't the worst thing in the world, but does complicate matters.

It is also possible that the ability to specify read geometries added in salmon 1.4 might give us a solution, if we are allowed to use reversed start and finishes. So we might be able to specify
--read-geometry 2[end-1] --bc-geometry 1[1-16] --umi-geometry 1[17-26]
rather than the default
--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

Not sure if that is supported, but we could ask. (I think it would be worth diving a bit deeper to see if we can find if this is the problem before we did that though.)

I'm also looking a bit at https://salmon.readthedocs.io/en/latest/library_type.html, and wondering if MSF might actually be the read format we want for 10xv2 5prime? I'm probably getting myself confused. Sort of curious what alevin would choose if given auto as an option. We could also try IU for no strand preference on the 5' read?

rob-p · 2021-09-16T03:49:16Z

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

rob-p · 2021-09-16T03:53:12Z

@allyhawkins & @jashapiro : one more thought I had — what does it look like if you use —sketch mode here instead. While selective alignment provides higher precision, it does not tolerate a lot of divergence (at least with the default parameters in alevin). If the 5’ reads are often softclipped, this could lead to a lower alignment rate in this data using selective alignment.

jashapiro · 2021-09-16T12:35:23Z

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

Yeah, that seems to have been my mistake. I was thinking that if the mapping was strand-specific, then cell barcode mapping might be too. Some of our comparisons between mappers start by filtering to shared barcodes, so if that was the problem we could end up filtering out many of the "real" cells. But it doesn't look like that was the issue.

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

Ah! We had missed (or forgotten about) this! We didn't change the orientation!
@allyhawkins: we need to update this line to either rc or both

alsf-scpca/workflows/alevin-quant/run-alevin-fry.nf

Line 112 in 2011c30

--expected-ori fw \

allyhawkins · 2021-09-16T21:40:45Z

@rob-p Thank you so much for the help on this one! Changing from using the fw option to rc for the read orientation while leaving the library type alone seemed to do the trick!

@jashapiro I went ahead and updated the notebook to include the samples run using rc and have removed the knee filtering. In doing so, I did choose to keep the comparisons where I looked at all cells without restricting to shared cells to show how similar the distributions are even before restricting to shared cells. The only caveat here is that it looks like Alevin-fry may be reporting some of the counts and number of genes/cell a bit higher than Cell Ranger, but that seems to decrease a little bit when restricting the analysis to only genes found in > 5% of cells. I am not really that concerned about these differences and feel pretty confident that we should be able to implement use of this workflow for our 5' samples.

jashapiro

This looks good.

I think it may need a rerender, but the only other major thing I noticed was the fact that the mito % was so much higher for the alevin-fry processing. Was that a pattern we saw before? I looked at 04-cell-level-benchmarking-metrics.nb.html, and it doesn't seem so: we saw higher mito percent for sketch mode with nuclei, but not for cells, and not for SA mode. So I would file that under mild curiosity, but nothing that should prevent us from proceeding.

jashapiro · 2021-09-17T17:26:35Z

analysis/quantifier-comparisons/11-five-prime-benchmarking.Rmd

+In order to see if there are any differences between Alevin-fry and Cell Ranger in the ability to assign reads to each cell barcode, we first are looking at all cells before restricting to shared cells only. 
+
+```{r}
+# mito content in shared cells


Is it worth commenting about that the mito percent is so much higher for alevin-fry? My first thought was that this might be a pre-filtering artifact, but it seems to hold post-filtereing too.

@jashapiro, one quick thought here — not sure if it's worth digging into or not. CellRanger is mapping against their custom reference, and so it has the entire mt chromosome (including genes on there that may not be part of the standard annotation). In this case, reads will map to the best location, which may be an unannotated (or at least not part of the restricted annotation) gene on the mt chromosome. In that case, it won't get counted toward an annotated mt gene and won't show up in the QC.

On the other hand, alevin-fry knowns only about the mt transcripts that are explicitly included in the reference. I know for a fact that there are mito transcripts missing from the standard annotations (like MT-RNR2 and MT-RNR1). Possibly then, reads coming from similar mt transcripts could be mapped instead to some annotated mt transcript and counted as mt expression. In this case, the STAR/CellRanger approach would actually have similar mito expression, but it just wouldn't show up in the count matrix since the mt chromosome is essentially acting as a "decoy" for these reads (because the transcripts of origin aren't annotated in the count matrix). It might be worth looking in the CellRanger BAM and what kind of mapping rate you get to the mt chromosome, and does that signify more reads arising from there than are accounted for in the reported gene counts? This may not be what is going on of course, but this is a thought I've had on this a few times before.

Thanks for the clear explanation @rob-p. I think you bring up an interesting point and potentially something to look into. Mitochondrial genes also tend to have a lot of homology as well correct? I wonder if this is something that is more commonly seen in using the cr-like-em resolution rather than just the cr-like resolution and that some of the increased detection is due to multimapped reads not being thrown away here.

jashapiro · 2021-09-17T17:31:03Z

analysis/quantifier-comparisons/11-five-prime-benchmarking.Rmd

+ggplot(qc_common_threshold, aes(x = tool, y = detected, fill = tool)) + 
+  geom_boxplot() + 
+  facet_grid(~ sample) +
+  theme_classic() + 
+  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
+  ylab("Genes detected/cell") + 
+  xlab("") + 
+  ylim(c(0,1000))


The fact that the genes per cell drops so precipitously here is interesting, but I guess that probably suggests the data isn't particularly saturated?

jashapiro · 2021-09-17T17:32:52Z

analysis/quantifier-comparisons/11-five-prime-benchmarking.Rmd

+ggplot(qc_common_threshold, aes(x = tool, y = sum, fill = tool)) + 
+  geom_boxplot() + 
+  facet_grid(~ sample) +
+  theme_classic() + 
+  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
+  ylab("UMI/cell") + 
+  xlab("") + 
+  ylim(c(0,30000))


Noting that this chunk and the one below do not appear in the rendered output (the plots do). Probably needs a fresh rerender?

Sorry about that. I fixed the html file so these chunks should now be present.

jashapiro

LGTM. I think we can leave the mitochondrial percentages as a bit of a mystery for now...

allyhawkins added 3 commits September 14, 2021 15:32

modify import function based on scpcaTools changes

4ce5a8a

add comparison of 5' preprocessing methods

d041a4d

fix typo

bffd7b5

allyhawkins requested a review from jashapiro September 14, 2021 20:00

allyhawkins added 2 commits September 15, 2021 15:02

fix filter error in quant_info

e354ea1

add knee comparison

d7985a6

jashapiro mentioned this pull request Sep 16, 2021

add orientation switch to af permit list step #138

Merged

re-analyzed using rc option

c71ed19

jashapiro reviewed Sep 17, 2021

View reviewed changes

fix html file

3fc82cc

allyhawkins requested a review from jashapiro September 27, 2021 13:48

jashapiro approved these changes Sep 27, 2021

View reviewed changes

allyhawkins merged commit ded706a into main Sep 27, 2021

jaclyn-taroni mentioned this pull request Sep 27, 2021

Support 5' libraries in production AlexsLemonade/scpca-nf#39

Closed

jeremymsimon mentioned this pull request Aug 2, 2023

Update documentation to include recommended processing for 10x scRNA 5' V2 COMBINE-lab/alevin-fry#118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison of Alevin-fry to Cellranger for 5' libraries #137

Comparison of Alevin-fry to Cellranger for 5' libraries #137

allyhawkins commented Sep 14, 2021

jashapiro commented Sep 14, 2021 •

edited

rob-p commented Sep 16, 2021

rob-p commented Sep 16, 2021

jashapiro commented Sep 16, 2021

allyhawkins commented Sep 16, 2021

jashapiro left a comment

jashapiro Sep 17, 2021

rob-p Sep 17, 2021 •

edited

allyhawkins Sep 20, 2021

jashapiro Sep 17, 2021

jashapiro Sep 17, 2021

allyhawkins Sep 20, 2021

jashapiro left a comment

Comparison of Alevin-fry to Cellranger for 5' libraries #137

Comparison of Alevin-fry to Cellranger for 5' libraries #137

Conversation

allyhawkins commented Sep 14, 2021

jashapiro commented Sep 14, 2021 • edited

rob-p commented Sep 16, 2021

rob-p commented Sep 16, 2021

jashapiro commented Sep 16, 2021

allyhawkins commented Sep 16, 2021

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Sep 17, 2021

Choose a reason for hiding this comment

rob-p Sep 17, 2021 • edited

Choose a reason for hiding this comment

allyhawkins Sep 20, 2021

Choose a reason for hiding this comment

jashapiro Sep 17, 2021

Choose a reason for hiding this comment

jashapiro Sep 17, 2021

Choose a reason for hiding this comment

allyhawkins Sep 20, 2021

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro commented Sep 14, 2021 •

edited

rob-p Sep 17, 2021 •

edited