Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of Alevin-fry to Cellranger for 5' libraries #137

Merged
merged 7 commits into from Sep 27, 2021

Conversation

allyhawkins
Copy link
Member

Closes #133. This PR adds a comparison of using Cell Ranger to Alevin-fry for pre processing 5' libraries. Here, I am analyzing 2 10Xv2_5prime samples, SCPCR000286 and SCPCR000287, that have been processed with Cell Ranger and Alevin-fry. For Cell Ranger, the orientation is automatically detected based on how many reads are in the antisense direction (this was confirmed for these samples by looking at the web_summary.html). For Alevin-fry, I am using the same configurations that we are using for 3', selective alignment with the splici index, unfiltered permit list, and cr-like-em resolution, but with the use of the ISF flag for library type, indicating the orientation of the libary.

For this analysis, I generated the per cell and per gene metrics using the benchmarking_generate_qc_df.R script which creates the quant_info.tsv, coldata_df.tsv, and rowdata_df.tsv used for inputs to the analysis outlined in this notebook. In running that script again, I noticed an error in use of scpcaTools::import_quant_data() and had to add in the mtx_format option (this is described in AlexsLemonade/scpcaTools#46). This function is utilized in the make_sce_list.R benchmarking function and in order to successfully run using mapply I had to add in that argument.

After generating the necessary qc dataframes needed for plotting, I went ahead and restricted the analysis to only cells that were identified in both tools. I then compared the following:

  • mitochondrial content/cell
  • UMI/cell
  • genes detected/cell
  • mean gene expression
  • overlap of genes detected across the tools

Interestingly, these results are not as good as we have seen previously and the alevin-fry quantifications are showing much lower UMI/cell and genes detected/cell than with Cell Ranger. Also the correlation coefficients of mean gene expression are in the range of 0.5-0.6, whereas with the 3' data we were seeing coefficients above 0.98. I double checked the log files to ensure that both of these were run with the correct configurations using the ISF library type with Alevin-fry, but it appears that these are not as consistent as we would have hoped.

One thought I had is that the 5' data doesn't do as well with the splici index or the other configurations that we are using so we can try out other options with Alevin-fry, but I am open to other suggestions or ideas.

@jashapiro
Copy link
Member

jashapiro commented Sep 14, 2021

Ugh, these do not look good. I suspect that just switching to ISR may not be sufficient.

I'm worried that alevin may be reverse complementing the cell barcodes, such that they would no longer match the provided list. This wouldn't have been a problem until the introduction of the unfiltered mode, as cell barcodes were only checked within libraries, so RC everything would be fine.

One test would be to run alevin in knee mode, which would allow it to choose its own barcodes. Then we could check if those barcodes match reverse complements of the CR barcodes.

If that works, we could provide a reverse complemented barcode list for 5' samples, which we would then have to reverse after import. Which isn't the worst thing in the world, but does complicate matters.

It is also possible that the ability to specify read geometries added in salmon 1.4 might give us a solution, if we are allowed to use reversed start and finishes. So we might be able to specify
--read-geometry 2[end-1] --bc-geometry 1[1-16] --umi-geometry 1[17-26]
rather than the default
--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

Not sure if that is supported, but we could ask. (I think it would be worth diving a bit deeper to see if we can find if this is the problem before we did that though.)

I'm also looking a bit at https://salmon.readthedocs.io/en/latest/library_type.html, and wondering if MSF might actually be the read format we want for 10xv2 5prime? I'm probably getting myself confused. Sort of curious what alevin would choose if given auto as an option. We could also try IU for no strand preference on the 5' read?

@rob-p
Copy link

rob-p commented Sep 16, 2021

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

@rob-p
Copy link

rob-p commented Sep 16, 2021

@allyhawkins & @jashapiro : one more thought I had — what does it look like if you use —sketch mode here instead. While selective alignment provides higher precision, it does not tolerate a lot of divergence (at least with the default parameters in alevin). If the 5’ reads are often softclipped, this could lead to a lower alignment rate in this data using selective alignment.

@jashapiro
Copy link
Member

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

Yeah, that seems to have been my mistake. I was thinking that if the mapping was strand-specific, then cell barcode mapping might be too. Some of our comparisons between mappers start by filtering to shared barcodes, so if that was the problem we could end up filtering out many of the "real" cells. But it doesn't look like that was the issue.

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

Ah! We had missed (or forgotten about) this! We didn't change the orientation!
@allyhawkins: we need to update this line to either rc or both

@allyhawkins
Copy link
Member Author

@rob-p Thank you so much for the help on this one! Changing from using the fw option to rc for the read orientation while leaving the library type alone seemed to do the trick!

@jashapiro I went ahead and updated the notebook to include the samples run using rc and have removed the knee filtering. In doing so, I did choose to keep the comparisons where I looked at all cells without restricting to shared cells to show how similar the distributions are even before restricting to shared cells. The only caveat here is that it looks like Alevin-fry may be reporting some of the counts and number of genes/cell a bit higher than Cell Ranger, but that seems to decrease a little bit when restricting the analysis to only genes found in > 5% of cells. I am not really that concerned about these differences and feel pretty confident that we should be able to implement use of this workflow for our 5' samples.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I think it may need a rerender, but the only other major thing I noticed was the fact that the mito % was so much higher for the alevin-fry processing. Was that a pattern we saw before? I looked at 04-cell-level-benchmarking-metrics.nb.html, and it doesn't seem so: we saw higher mito percent for sketch mode with nuclei, but not for cells, and not for SA mode. So I would file that under mild curiosity, but nothing that should prevent us from proceeding.

In order to see if there are any differences between Alevin-fry and Cell Ranger in the ability to assign reads to each cell barcode, we first are looking at all cells before restricting to shared cells only.

```{r}
# mito content in shared cells
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth commenting about that the mito percent is so much higher for alevin-fry? My first thought was that this might be a pre-filtering artifact, but it seems to hold post-filtereing too.

Copy link

@rob-p rob-p Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jashapiro, one quick thought here — not sure if it's worth digging into or not. CellRanger is mapping against their custom reference, and so it has the entire mt chromosome (including genes on there that may not be part of the standard annotation). In this case, reads will map to the best location, which may be an unannotated (or at least not part of the restricted annotation) gene on the mt chromosome. In that case, it won't get counted toward an annotated mt gene and won't show up in the QC.

On the other hand, alevin-fry knowns only about the mt transcripts that are explicitly included in the reference. I know for a fact that there are mito transcripts missing from the standard annotations (like MT-RNR2 and MT-RNR1). Possibly then, reads coming from similar mt transcripts could be mapped instead to some annotated mt transcript and counted as mt expression. In this case, the STAR/CellRanger approach would actually have similar mito expression, but it just wouldn't show up in the count matrix since the mt chromosome is essentially acting as a "decoy" for these reads (because the transcripts of origin aren't annotated in the count matrix). It might be worth looking in the CellRanger BAM and what kind of mapping rate you get to the mt chromosome, and does that signify more reads arising from there than are accounted for in the reported gene counts? This may not be what is going on of course, but this is a thought I've had on this a few times before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear explanation @rob-p. I think you bring up an interesting point and potentially something to look into. Mitochondrial genes also tend to have a lot of homology as well correct? I wonder if this is something that is more commonly seen in using the cr-like-em resolution rather than just the cr-like resolution and that some of the increased detection is due to multimapped reads not being thrown away here.

Comment on lines +218 to +225
ggplot(qc_common_threshold, aes(x = tool, y = detected, fill = tool)) +
geom_boxplot() +
facet_grid(~ sample) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Genes detected/cell") +
xlab("") +
ylim(c(0,1000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the genes per cell drops so precipitously here is interesting, but I guess that probably suggests the data isn't particularly saturated?

Comment on lines +208 to +215
ggplot(qc_common_threshold, aes(x = tool, y = sum, fill = tool)) +
geom_boxplot() +
facet_grid(~ sample) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("UMI/cell") +
xlab("") +
ylim(c(0,30000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this chunk and the one below do not appear in the rendered output (the plots do). Probably needs a fresh rerender?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. I fixed the html file so these chunks should now be present.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we can leave the mitochondrial percentages as a bit of a mystery for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quick comparison of alevin-fry to CellRanger for 5' libraries
3 participants