scProcessor is used for the processing of scRNAseq datasets in the IMMUcan scDB. It runs on R and is mostly based on the Seurat package.
- Quality control
- Measure and correct batch effect (harmony)
- Clustering optimization
- Supervised annotation (CHETAH)
- CNA calling (copyKat)
- Cell ontology (ebi.ac.uk/ols/ontologies/cl)
- Differential expression
- Universal output files (sceasy)
- Follow install instructions for sceasy (https://github.com/cellgeni/sceasy)
- Get CHETAH_reference_updatedAnnotation.RData from IMMUcan teams channel
- Install following R packages
install.packages(c("Seurat", "tidyverse", "readxl", "patchwork", "devtools", "data.table", "BiocManager", "remotes", "openxlsx", "pheatmap", "plyr", "DescTools", "future", "jsonlite"))
BiocManager::install(c("CHETAH", "SingleCellExperiment"))
devtools::install_github("mahmoudibrahim/genesorteR")
devtools::install_github("immunogenomics/harmony")
devtools::install_github("navinlabcode/copykat")
remotes::install_github("mojaveazure/seurat-disk")
Change the paths to files provided in the script
- cellMarker_path = PATH to TME_markerGenes.xlsx
- chetahClassifier_path = PATH to CHETAH_reference_updatedAnnotation.RData
- cellOntology_path = PATH to cell_ontology.xlsx
The core of scProcessor are three processing scripts.
- It takes a Seurat object as input (in the future this will be extended to other file formats)
- This step is optional, if data.json is filled in you can immediately run scProcessor_1
- Check validity of seurat object
- Estimate batch variable
- Return QC plots (in temp)
Rscript check_seurat.R [SEURAT] [BATCH]
- [SEURAT]: path to seurat object (if only one .rds file in directory it will also find it itself)
- [BATCH]: only necessary when you already know your batch variable
- scProcessor works without arguments to the Rscripts, therefore it needs an input file that specifies these variables. This is automatically generated by check_seurat and has to be reviewed to make sure scProcessor_1 processes the data how you want.
- Here is an overview of the data.json (NA in a json is indicated as null)
- object_path: full path where seurat object is stored
- batch: e.g. patient
- norm: boolean indicating if data is already normalized e.g. false
- QC_feature_min: threshold for minimal number of detected genes per cell e.g. 250
- QC_mt_max: threshold for maximal percentage of mitochondrial reads per cell e.g. 20
- pca_dims: number of PCA dimensions to take for further processing e.g. 30
- features_var: number of highly variable features to take for further processing e.g. 2000
- nSample: number of cells to take for intense computing steps and for cellxgene.h5ad at the end e.g. 10000
- cluster_resolution: a sequence of different cluster resolutions, scProcessor will select the most optimal resolution e.g. 0.5, 1, 1.5
- malignant: boolean indicating if maligant cell prediction is necessary e.g. TRUE
- normal_cells: cell type taken as normal cells to increase confidence of malingant cell prediction e.g. null (standard Macrophages are taken), false (no normal cells taken)
- annotation: columns in meta.data that contains annotation information
- metadata: other important columns contained in the meta.data slot e.g. biopsy, sample_id, treatment ...
- QC
- Batch integration and clustering
- Supervised classification and CNA calling
- Create marker gene plots
- Save summary statistics in misc
- Check plots in temp/plots:
- marker gene plots
- dotplot
- In out/annotation.xlsx, fill in cell types as defined in the abbreviation column of cell_ontology.xlsx
- Links cell ontology
- Differential expression
- Creates output files for SIB scRNAseq interface
- AverageExpression matrices and DE_results per annotation level
- geneIndex.tsv
- Metadata.tsv
- cellCount.tsv
- harmony.rds
- cellxgene.h5ad
on the terminal
zip -r AML_UNB_SW_GSE116256.zip AML_UNB_SW_GSE116256
md5sum AML_UNB_SW_GSE116256.zip
mv AML_UNB_SW_GSE116256 AML_UNB_SW_GSE116256_-_###PASTE_MD5SUM_OUTPUT_HERE###.zip
Login to SIB through sftp and transfer