Add process for building cellranger index to build-index.nf #66

allyhawkins · 2021-12-08T18:49:21Z

Closes #65. This PR adds the cellranger_index process to the workflow for building the indices, build-index.nf ensuring that when we build all of our indices they will use the same assembly. I followed the same setup that was previously used in generating the cellranger index in alsf-scpca/workflows/rnaseq-ref-index/build-cellranger-index.nf.

This was pretty straight forward and I was able to test it and everything ran successfully, producing a new index with ensembl 104. The only question I had was about how we wanted to name this index. I included the assembly name that we use in the splici index and then appended _cdna_cellranger, but let me know if there is a different naming convention that would be preferred.

jashapiro

This looks good. I made a couple of suggestions below, which might make this index more comparable to the one we use for salmon.

I was going to make an additional suggestion about naming params, but ended up putting it in a separate issue: #67

jashapiro · 2021-12-08T20:04:43Z

build-index.nf

+    cellranger mkgtf \
+      genome.gtf \
+      filtered.gtf \
+      --attribute=gene_biotype:protein_coding


Is this going to be too stringent? I would think we should include ncRNA and such: The default cellranger index includes much more: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references#mkgtf

(I just looked at the full list of gene_biotypes in the reference we are using, which appears below (generated with grep -o 'gene_biotype [^;]*' Homo_sapiens.GRCh38.104.gtf |sort | uniq) and noted that the code used by 10X might miss lncRNA as it is looking for lincRNA, according to that page.)

Alternatively, we could probably just not filter the gtf file? This would probably be more comparable to what we have for the salmon index, which I do not think we filter at all (hence 60K "genes").

gene_biotype "IG_C_gene" gene_biotype "IG_C_pseudogene" gene_biotype "IG_D_gene" gene_biotype "IG_J_gene" gene_biotype "IG_J_pseudogene" gene_biotype "IG_V_gene" gene_biotype "IG_V_pseudogene" gene_biotype "IG_pseudogene" gene_biotype "Mt_rRNA" gene_biotype "Mt_tRNA" gene_biotype "TEC" gene_biotype "TR_C_gene" gene_biotype "TR_D_gene" gene_biotype "TR_J_gene" gene_biotype "TR_J_pseudogene" gene_biotype "TR_V_gene" gene_biotype "TR_V_pseudogene" gene_biotype "lncRNA" gene_biotype "miRNA" gene_biotype "misc_RNA" gene_biotype "polymorphic_pseudogene" gene_biotype "processed_pseudogene" gene_biotype "protein_coding" gene_biotype "pseudogene" gene_biotype "rRNA" gene_biotype "rRNA_pseudogene" gene_biotype "ribozyme" gene_biotype "sRNA" gene_biotype "scRNA" gene_biotype "scaRNA" gene_biotype "snRNA" gene_biotype "snoRNA" gene_biotype "transcribed_processed_pseudogene" gene_biotype "transcribed_unitary_pseudogene" gene_biotype "transcribed_unprocessed_pseudogene" gene_biotype "translated_processed_pseudogene" gene_biotype "translated_unprocessed_pseudogene" gene_biotype "unitary_pseudogene" gene_biotype "unprocessed_pseudogene" gene_biotype "vault_RNA"

We definitely don't do any filtering like this for the salmon index, so I can remove this to make it more comparable.

build-index.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2021-12-08T21:29:04Z

I went ahead and removed filtering and it worked fine, creating a larger index this time.

jashapiro

Looks good, with two small suggestions I made, and one that I can't actually put in, which is to add

  cellranger_index = "s3://nextflow-ccdl-data/reference/homo_sapiens/ensembl-104/cellranger_index/Homo_sapiens.GRCh38.104_cellranger_full"

to the params in nextflow.config

build-index.nf

jashapiro · 2021-12-08T22:22:26Z

build-index.nf

+    cellranger mkgtf \
+      genome.gtf \
+      filtered.gtf
+
+    cellranger mkref \
+      --genome=${cellranger_index} \
+      --fasta=genome.fasta \
+      --genes=filtered.gtf \


I think you can actually completely skip mkgtf as I don't think it is doing anything here.

Suggested change

cellranger mkgtf \

genome.gtf \

filtered.gtf

cellranger mkref \

--genome=${cellranger_index} \

--fasta=genome.fasta \

--genes=filtered.gtf \

cellranger mkref \

--genome=${cellranger_index} \

--fasta=genome.fasta \

--genes=genome.gtf \

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2021-12-09T16:10:02Z

@jashapiro I went ahead and made the edits that you suggested and then also had to increase the memory that the process is using. I noticed that it was only making some of the index files but not all of them even though it was completing the process successfully without error. It wasn't until I attempted to run spaceranger using the index later on that I realized there was missing files and only a partial index so had to bump up the memory required too.

jashapiro

LGTM!

allyhawkins added 2 commits December 8, 2021 12:45

add cellranger index process

5f19042

add cellranger docker container to config

d96d14a

allyhawkins requested a review from jashapiro December 8, 2021 18:49

jashapiro reviewed Dec 8, 2021

View reviewed changes

allyhawkins and others added 2 commits December 8, 2021 14:48

apply suggestions from code review

6ac28d1

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

remove filtering

b43a607

allyhawkins requested a review from jashapiro December 8, 2021 21:29

jashapiro reviewed Dec 8, 2021

View reviewed changes

build-index.nf Outdated Show resolved Hide resolved

build-index.nf Outdated Show resolved Hide resolved

jashapiro reviewed Dec 8, 2021

View reviewed changes

allyhawkins and others added 3 commits December 8, 2021 17:18

apply code review suggestions

6858647

increase memory for cellranger index building

75edd23

Apply suggestions from code review

f2b3e54

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins requested a review from jashapiro December 9, 2021 16:10

jashapiro approved these changes Dec 9, 2021

View reviewed changes

allyhawkins merged commit 5486b6b into main Dec 9, 2021

jashapiro mentioned this pull request Dec 14, 2021

Prepare for new workflow release (0.1.3) #70

Closed

allyhawkins deleted the allyhawkins/cellranger-index branch March 23, 2023 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process for building cellranger index to build-index.nf #66

Add process for building cellranger index to build-index.nf #66

allyhawkins commented Dec 8, 2021

jashapiro left a comment

jashapiro Dec 8, 2021

allyhawkins Dec 8, 2021

allyhawkins commented Dec 8, 2021

jashapiro left a comment

jashapiro Dec 8, 2021

allyhawkins commented Dec 9, 2021

jashapiro left a comment

Add process for building cellranger index to build-index.nf #66

Add process for building cellranger index to build-index.nf #66

Conversation

allyhawkins commented Dec 8, 2021

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Dec 8, 2021

Choose a reason for hiding this comment

allyhawkins Dec 8, 2021

Choose a reason for hiding this comment

allyhawkins commented Dec 8, 2021

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Dec 8, 2021

Choose a reason for hiding this comment

allyhawkins commented Dec 9, 2021

jashapiro left a comment

Choose a reason for hiding this comment