Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about running the scDECAF #2

Closed
NianzhenGu opened this issue Jan 2, 2024 · 18 comments
Closed

Question about running the scDECAF #2

NianzhenGu opened this issue Jan 2, 2024 · 18 comments

Comments

@NianzhenGu
Copy link

Hi! I'm working on a single-cell RNA project that compares single-cell transcriptomic data of embryonic and adult mouse colons to identify embryonic-specific gene signatures and use these genes to score colon cancer single-cell data. I find the scDECAF algorithm is suitable for this project.

I have some problems understanding the inputs to the algorithm in the Quick Start part.

  • For the variable x, can I put the SingleCellExperiment object?
  • I don't know the meaning of geneset and the HM_geneset. Does the HM_geneset represent the human geneset that can be downloaded online? How about the mouse geneset?

What I have now is the gene signature, a vector of gene ids, like "ENSMUSG00000031957" "ENSMUSG00000069893" "ENSMUSG00000055827"...; the data I want to score: a SingleCellExperiment object; a list of highly variable genes (hvg).

Very much appreciate it if you could give me some instructions! Thanks!

@soroorh
Copy link
Contributor

soroorh commented Jan 2, 2024

Hi ! thank you for using the issue tracker!

  • we only support standard matrices atm, so you'd have to supply the matrix of normalised expression values of hvg genes
  • HM_geneset, or the input to genesetlist in general in pruneGenesets(), is a list of gene sets i.e. each element of the list is a list of genes (names or ids depends on row names of x). This is the set of all possible gene sets one is interested, and must consist of more than one gene set (say C2 collection from MSigDB). The name was selected based on our application (we actually meant Hallmark gene sets), and it does not imply the species. I encourage you to look up the documentation for any of the functions of interest by installing the packages and running ?functionname

The vector space representation computed by scDECAF requires more than one gene signature. So, i'd recommend you consider adding other gene sets to run the model. For your project, for example, differentially expressed genes in differentially abundant neighbourhoods which you can get from miloDE will provide you with sufficient number of gene sets to use as input to scDECAF.

Link to miloDE https://github.com/MarioniLab/miloDE

Hope this helps.

@soroorh soroorh added enhancement New feature or request and removed enhancement New feature or request labels Jan 2, 2024
@NianzhenGu
Copy link
Author

So for example, if I have two gene signatures, s1 = [a, b, c], s2 = [d, e, f]. I can create a geneset like [s1, s2]. Then I run the pruneGenesets() and genesets2ids() before the scDECAF() right?

@soroorh
Copy link
Contributor

soroorh commented Jan 2, 2024

so, the genesetlist has to be a named list. so i suggest

gslist = list()
gslist[['gs1']] <- s1
gslist[['gs2']] <- s2

As i mentioned, due to nature of the model we generally need larger than 2 gene sets. you got the order of running the functions correctly, but if your full geneset list has less than 10 gene sets, pruning via pruneGenesets () might not be required. Hence why i suggested obtaining additional gene sets from miloDE analysis, for example.

I suggest you also checkout our tutorials from the reproducibility repo
https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/kang_pbmc/kang_pbmc.ipynb
https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/cite_pbmc/TotalVI_scDECAF_analysis-addMilo.ipynb

Hope this helps

@NianzhenGu
Copy link
Author

Great! Thanks for your suggestion! I will try it in the coming days.

@NianzhenGu
Copy link
Author

Hi! I still have a problem. The picture shows my command of running the scDECAF. I'm not sure what should I use for the embedding.
Screenshot 2024-01-05 at 12 57 06 PM

The rest data: merged_counts is my original data with rows are genes and columns are samples. The dim of this data is 25904 x 4.
merged_counts

target is the result obtained from genesets2ids() where rows are genes and columns are genesets. The dim of this data is 106 x 8.
targert

hvg_union is the list of hvg, the length is 8698.
hvg_union

For the embedding = reducedDims(tumor_sce)[["UMAP"]], the tumor_sce is the SingleCellExperiment object. I'm not sure whether my inputs are correct for the data I showed above.

Appreciate it if you could find the problem! Thanks!

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

Hi. so the error is suggesting that dim(embedding)!= dim(merged_counts). Can you pls verify that? also, i suggest you use log normalised gene expression rather than raw counts, i.e. scTransformed data.

For embedding, you can use umap as you're doing here, but can also consider any other embedding (PCA, PHATE etc with > 2 dimensions).

Hope this helps!

@NianzhenGu
Copy link
Author

The merged_counts is log normalized. I tried the PCA but still got the same error:

Screenshot 2024-01-05 at 2 21 10 PM Screenshot 2024-01-05 at 2 19 09 PM

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

ok - thanks. Are the row names set for the embedding matrix? scDECAF at some point matches column names in merged_counts with row names in the embedding. hopefully that fixes?

@NianzhenGu
Copy link
Author

Sorry, I'm not sure about what you mean. The row name of embedding is the gene name and the column name is PC1, PC2, PC3, PC4. The row name of merged_counts is the gene name with the same order and the column name is four sample names.

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

ah then i see what's going wrong. the embedding is a cell embedding ie. has dims n_cells x n_D where D is the dimension in the dimension reduction space. Whereas you are providing a gene embedding. Your initial code was correct because you had reducedDims(tumor_sce)[["UMAP"]]. Please just check the row names there and verify nrow(educedDims(tumor_sce)[["UMAP"]]) == ncol(merged_counts).

@NianzhenGu
Copy link
Author

So the merged_count should be n_gene x n_cells and the embedding should be n_cells x n_D. But the dim(merged_count) will not be equal to dim(embedding)? Also, the column name of merged_count should be the same as the row name of embedding, which is the cell name, right?

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

correct. dim(embedding)!= dim(merged_counts) is always true and i actually meant the error is suggesting nrow(embedding)!= ncol(merged_counts), or that row names are not set in embedding. Apologies for confusion.

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

also since you only have 8 gene sets, k should be <8 (you have 10 now). I also updated README with more specifications.

@NianzhenGu
Copy link
Author

Thanks! Will try it.

@Jade0904
Copy link

Jade0904 commented Jan 5, 2024

Hi, I'm NianzhenGu's teammate. I still cannot run scDECAF successfully. Here's the error:

image

"merged_logcounts" was defined by "merged_logcounts <- logcounts(merged_sce)" and the logcounts assay was generated by "merged_sce <- logNormCounts(merged_sce)".

"target" was defined by "target <- genesets2ids(merged_logcounts, gene_signature)", where "gene_signature" was a list of geneset, as below:

image

hvg_union was a vector of highly variable genes we chose.

Reduced dimensions were generated by:
merged_sce <- runPCA(merged_sce)
merged_sce <- runUMAP(merged_sce, dimred = "PCA")
I tried both of them (UMAP and PCA) in scDECAF() but it threw the same error.

Do you have any idea about what could possibly be the problems? Thanks a lot :)

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

Hey :).
Since data is log-transformed, please set standardize=FALSE, as per example code on README. Hope this helps.

@Jade0904
Copy link

Jade0904 commented Jan 5, 2024

Hey :). Since data is log-transformed, please set standardize=FALSE, as per example code on README. Hope this helps.

Yes, it worked! Thank you so much for your help!

@soroorh
Copy link
Contributor

soroorh commented Jan 5, 2024

No worries. please close the issue, if this is done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants