Identify a threshold without reprocessing all data? #4

liz-is · 2022-04-20T12:32:17Z

Hello,

I think the BgeeCall approach to using intergenic regions to select a reasonable threshold to call a gene "expressed" is an excellent idea, and I'm keen to try it out. However it's not clear to me from the README/vignette whether I can run this approach using BgeeCall without re-analysing all my data from fastq files.

I have human scRNA-seq data (10x), as a Seurat object. I also have .bam files with alignments to the genome, which should enable extraction of expression for the human reference intergenic regions. Is it possible to use this processed data as input into your workflow?

This seems like it would be a pretty common use case, so it would be really helpful if you could add some guidance to the vignette/README about how to use your approach on already-processed data.

Thanks in advance for your help!

jwollbrett · 2022-05-04T14:44:50Z

Hello @liz-is ,

Thanks for your question. For now BgeeCall has only been used to generate present/absent expression calls for bulk RNASeq and full length single-cell RNASeq.

We are currently investigating the best approach to make it compatible to target base single-cell RNASeq.

For now in BgeeCall we are using Kallisto for both mapping and quantification steps. The use case you are talking about (bam file + seurat object) is one of the approach we investigate but is not compatible with Kallisto. In this scenario you already have the mapping. So we need to use a different tool taking into consideration the mapping and the barcodes for the quantification step.
In case we decide to go this direction we were thinking on using FeatureCounts or UMI-tools. None of them are implemented in R.

Please let us know if you have name of other tools in mind.
Do not hesitate to contact us again if you have other use cases related to BgeeCall that could be useful for your analyzes.

liz-is · 2022-05-09T11:16:26Z

Oh, I thought based on the preprint that you had also used this on 10X data already?

I actually already tried to implement something on the data I have. I subsetted the CellRanger bam files to the cluster level using the subset_bam tool from 10X, extracted the reference intergenic regions from your package and converted them to a .gtf, then used HTSeq on the command line to quantify expression of the reference intergenic regions per cluster. (It would probably be possible to do this within R if you have a list of barcodes per cluster and use scanBam filtering on appropriate tags, or something along those lines... but I expect this would be pretty slow.)

I pooled the gene and intergenic region expression counts across all clusters, converted to CPM, and used the p-value approach (I couldn't figure out how to calculate the q-value approach decsribed in the preprint - been a while since I had to use integrals...):

This gave me ~30% of genes expressed which seems consistent with the results shown for scRNA-seq in the preprint.

I didn't get sensible results if I tried to apply this approach to individual clusters, though (0-80% of genes expressed, median 0.4% across 18 clusters!). That's probably at least partially due to some sample-specific weirdnesses, as well as limited cell numbers / counts in some clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify a threshold without reprocessing all data? #4

Identify a threshold without reprocessing all data? #4

liz-is commented Apr 20, 2022

jwollbrett commented May 4, 2022

liz-is commented May 9, 2022

Identify a threshold without reprocessing all data? #4

Identify a threshold without reprocessing all data? #4

Comments

liz-is commented Apr 20, 2022

jwollbrett commented May 4, 2022

liz-is commented May 9, 2022