Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify a threshold without reprocessing all data? #4

Open
liz-is opened this issue Apr 20, 2022 · 2 comments
Open

Identify a threshold without reprocessing all data? #4

liz-is opened this issue Apr 20, 2022 · 2 comments

Comments

@liz-is
Copy link

liz-is commented Apr 20, 2022

Hello,

I think the BgeeCall approach to using intergenic regions to select a reasonable threshold to call a gene "expressed" is an excellent idea, and I'm keen to try it out. However it's not clear to me from the README/vignette whether I can run this approach using BgeeCall without re-analysing all my data from fastq files.

I have human scRNA-seq data (10x), as a Seurat object. I also have .bam files with alignments to the genome, which should enable extraction of expression for the human reference intergenic regions. Is it possible to use this processed data as input into your workflow?

This seems like it would be a pretty common use case, so it would be really helpful if you could add some guidance to the vignette/README about how to use your approach on already-processed data.

Thanks in advance for your help!

@jwollbrett
Copy link
Contributor

Hello @liz-is ,

Thanks for your question. For now BgeeCall has only been used to generate present/absent expression calls for bulk RNASeq and full length single-cell RNASeq.

We are currently investigating the best approach to make it compatible to target base single-cell RNASeq.

For now in BgeeCall we are using Kallisto for both mapping and quantification steps. The use case you are talking about (bam file + seurat object) is one of the approach we investigate but is not compatible with Kallisto. In this scenario you already have the mapping. So we need to use a different tool taking into consideration the mapping and the barcodes for the quantification step.
In case we decide to go this direction we were thinking on using FeatureCounts or UMI-tools. None of them are implemented in R.

Please let us know if you have name of other tools in mind.
Do not hesitate to contact us again if you have other use cases related to BgeeCall that could be useful for your analyzes.

@liz-is
Copy link
Author

liz-is commented May 9, 2022

Oh, I thought based on the preprint that you had also used this on 10X data already?

I actually already tried to implement something on the data I have. I subsetted the CellRanger bam files to the cluster level using the subset_bam tool from 10X, extracted the reference intergenic regions from your package and converted them to a .gtf, then used HTSeq on the command line to quantify expression of the reference intergenic regions per cluster. (It would probably be possible to do this within R if you have a list of barcodes per cluster and use scanBam filtering on appropriate tags, or something along those lines... but I expect this would be pretty slow.)

I pooled the gene and intergenic region expression counts across all clusters, converted to CPM, and used the p-value approach (I couldn't figure out how to calculate the q-value approach decsribed in the preprint - been a while since I had to use integrals...):
image

This gave me ~30% of genes expressed which seems consistent with the results shown for scRNA-seq in the preprint.
image

I didn't get sensible results if I tried to apply this approach to individual clusters, though (0-80% of genes expressed, median 0.4% across 18 clusters!). That's probably at least partially due to some sample-specific weirdnesses, as well as limited cell numbers / counts in some clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants