Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process for building cellranger index to build-index.nf #66

Merged
merged 7 commits into from
Dec 9, 2021

Conversation

allyhawkins
Copy link
Member

Closes #65. This PR adds the cellranger_index process to the workflow for building the indices, build-index.nf ensuring that when we build all of our indices they will use the same assembly. I followed the same setup that was previously used in generating the cellranger index in alsf-scpca/workflows/rnaseq-ref-index/build-cellranger-index.nf.

This was pretty straight forward and I was able to test it and everything ran successfully, producing a new index with ensembl 104. The only question I had was about how we wanted to name this index. I included the assembly name that we use in the splici index and then appended _cdna_cellranger, but let me know if there is a different naming convention that would be preferred.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I made a couple of suggestions below, which might make this index more comparable to the one we use for salmon.

I was going to make an additional suggestion about naming params, but ended up putting it in a separate issue: #67

build-index.nf Outdated
cellranger mkgtf \
genome.gtf \
filtered.gtf \
--attribute=gene_biotype:protein_coding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be too stringent? I would think we should include ncRNA and such: The default cellranger index includes much more: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references#mkgtf

(I just looked at the full list of gene_biotypes in the reference we are using, which appears below (generated with grep -o 'gene_biotype [^;]*' Homo_sapiens.GRCh38.104.gtf |sort | uniq) and noted that the code used by 10X might miss lncRNA as it is looking for lincRNA, according to that page.)

Alternatively, we could probably just not filter the gtf file? This would probably be more comparable to what we have for the salmon index, which I do not think we filter at all (hence 60K "genes").

gene_biotype "IG_C_gene"
gene_biotype "IG_C_pseudogene"
gene_biotype "IG_D_gene"
gene_biotype "IG_J_gene"
gene_biotype "IG_J_pseudogene"
gene_biotype "IG_V_gene"
gene_biotype "IG_V_pseudogene"
gene_biotype "IG_pseudogene"
gene_biotype "Mt_rRNA"
gene_biotype "Mt_tRNA"
gene_biotype "TEC"
gene_biotype "TR_C_gene"
gene_biotype "TR_D_gene"
gene_biotype "TR_J_gene"
gene_biotype "TR_J_pseudogene"
gene_biotype "TR_V_gene"
gene_biotype "TR_V_pseudogene"
gene_biotype "lncRNA"
gene_biotype "miRNA"
gene_biotype "misc_RNA"
gene_biotype "polymorphic_pseudogene"
gene_biotype "processed_pseudogene"
gene_biotype "protein_coding"
gene_biotype "pseudogene"
gene_biotype "rRNA"
gene_biotype "rRNA_pseudogene"
gene_biotype "ribozyme"
gene_biotype "sRNA"
gene_biotype "scRNA"
gene_biotype "scaRNA"
gene_biotype "snRNA"
gene_biotype "snoRNA"
gene_biotype "transcribed_processed_pseudogene"
gene_biotype "transcribed_unitary_pseudogene"
gene_biotype "transcribed_unprocessed_pseudogene"
gene_biotype "translated_processed_pseudogene"
gene_biotype "translated_unprocessed_pseudogene"
gene_biotype "unitary_pseudogene"
gene_biotype "unprocessed_pseudogene"
gene_biotype "vault_RNA"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't do any filtering like this for the salmon index, so I can remove this to make it more comparable.

build-index.nf Outdated Show resolved Hide resolved
allyhawkins and others added 2 commits December 8, 2021 14:48
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

I went ahead and removed filtering and it worked fine, creating a larger index this time.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with two small suggestions I made, and one that I can't actually put in, which is to add

  cellranger_index = "s3://nextflow-ccdl-data/reference/homo_sapiens/ensembl-104/cellranger_index/Homo_sapiens.GRCh38.104_cellranger_full"

to the params in nextflow.config

build-index.nf Outdated Show resolved Hide resolved
build-index.nf Outdated Show resolved Hide resolved
build-index.nf Outdated
Comment on lines 84 to 91
cellranger mkgtf \
genome.gtf \
filtered.gtf

cellranger mkref \
--genome=${cellranger_index} \
--fasta=genome.fasta \
--genes=filtered.gtf \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can actually completely skip mkgtf as I don't think it is doing anything here.

Suggested change
cellranger mkgtf \
genome.gtf \
filtered.gtf
cellranger mkref \
--genome=${cellranger_index} \
--fasta=genome.fasta \
--genes=filtered.gtf \
cellranger mkref \
--genome=${cellranger_index} \
--fasta=genome.fasta \
--genes=genome.gtf \

@allyhawkins
Copy link
Member Author

@jashapiro I went ahead and made the edits that you suggested and then also had to increase the memory that the process is using. I noticed that it was only making some of the index files but not all of them even though it was completing the process successfully without error. It wasn't until I attempted to run spaceranger using the index later on that I realized there was missing files and only a partial index so had to bump up the memory required too.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@allyhawkins allyhawkins merged commit 5486b6b into main Dec 9, 2021
@allyhawkins allyhawkins deleted the allyhawkins/cellranger-index branch March 23, 2023 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include cellranger index in build-index.nf
2 participants