- Phillipe Loher (http://cm.jefferson.edu/phillipe-loher/)
- Nestoras Karathanasis (http://cm.jefferson.edu/nestoras-karathanasis/)
We are part of the Computational Medicine Center at Thomas Jefferson University. To find out more about our research, please visit https://cm.jefferson.edu.
This is a technical report on the techniques and protocols used to perform the analysis on the Single Cell Transcriptomics challenge, see https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full. Please follow the steps below to rerun our analysis.
- Unix System
- Python 3.5+, packages PyTorch, numpy, pandas, pickle, csv, sys, random.
- R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
# Install required packages for Lasso-TopX workflow
# Start R
# Set working directory to SingleCell-DREAM_FOLDER
setwd("PATH_TO_SingleCell-DREAM_FOLDER/")
# Installing required dependencies
source("LASSO_topX_workflow/R/Dependencies.R")
# Setting Things up
# - Create folders tree
# - Download data from:
# https://shiny.mdc-berlin.de/DVEX/
source("R_Common/SettingThingsUp.R")
# Create NestedCV folds
# The created 10 folds are generated randomly and are used only for illustration purposes.
# The results in our paper are based on the 10 cross validation folds provided by the challenge's organiners and are available here, https://www.life-science-alliance.org/content/3/11/e202000867
source("R_Common/Create_NestedCV_folds.R")
We employed three methods to perform the feature selection step, namely Random, a modified LASSO workflow and Deep Neural Nets.
We randomly selected the desired number of genes to baseline our feature selection algorithms.
# Randomly select genes
source(file = "Baseline/Baseline.R")
In order to generate the cell's 3d positions, which we used for labels, we run DistMap using the code provided online. Then we extracted the “mcc.scores” object from within distMap’s output and the bins corresponding to the maximum mcc.score per cell were identified. There are 287 cells that are not uniquely mapped to only one bin. We used only the uniquely mapped cells, 1100 out of 1297, in our feature selection process.
# Run Distmap to identify 3d cell locations
source(file = "R_Common/ReproducePaperResults.R")
# Identify Cell positions
source(file = "LASSO_topX_workflow/R/ExtractRNAseqLocations.R")
We implemented Lasso-TopX by modifying LASSO workflow as described in our publication, https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full.
Our code and respective documentation can be found in
LASSO_topX_workflow/R/glmnetExtensionLibrary.R
and is provided as an extension of glmnet package.
# Reproduce our gene selection using LASSO-topX
# - To select only from the inSitu genes set: useOnlyInSitu <- TRUE.
# - To select across all genes set: useOnlyInSitu <- FALSE.
# Run feature selection process - Lasso-TopX
source(file = "LASSO_topX_workflow/R/LASSO_features_CVs_train.R")
# Plot error evolution and selected features of one of the provided Nested cross validation folds.
# Variable `fileCV` in the USER INPUT section of the script can be used to specify the results file that you want to use.
source(file = "LASSO_topX_workflow/R/Plot_Lasso_Features.R")
Please visit the sub-directory named "NeuralNetworks/DuringChallenge_Subchallenge2/" and then run the below 3 steps.
- Run: Rscript Step1_GenTrainingDataMatrix.R
- Run: python EvalData.py
- Run: bash Step3_GenerateRankedLists.sh
- Go to directory: NeuralNetworks/PostChallenge/
- Run: python EvalData.py
- Run: bash Step3_GenerateRankedLists.sh
After selecting the most informative genes, using Random, LASSO.topX and Deep Neural Nets, we predicted the 10 locations per cell using a modified version of DistMap, as described in our publication, https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full. The modified version of DistMap employs only the cells in the training set to calculate all DistMap parameters and predicts the cell locations in the both the training and test sets.
The modified version of DistMap can be found here
R_Common/distmap/R/myDistMap.R
# Predict Locations using the modified version of DistMap
source(file = "R_Common/LocationPredictions_DistMap_TrainTest.R")
# Predict Locations using the provided binarized table
source(file = "R_Common/LocationPredictions_DistMapOnTest_ProvidedBinarizedData.R")
We score our prediction using our in-house blind metric and the challenge's organizers score functions, which are available online at https://github.com/dream-sctc
.
For details see publication - https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full
# Score predictions - blind metric
source(file = "R_Common/Evaluate_Predictions_Blind_Metric.R")
# Score predictions - organizers score functions
# Open R script "R_Common/GenerateSubmissionFiles.R"
# - Uncomment lines 9 and 10 to generate the submission files needed using the Provided Binarized table.
source(file = "R_Common/GenerateSubmissionFiles.R")
# - Uncomment lines 13 and 14 to generate the submission files needed using the location predictions from the modified DistMap.
source(file = "R_Common/GenerateSubmissionFiles.R")
# Finally generate scores and figures for both approaches by running.
source(file = "R_Common/Evaluate_Predictions_Organizers_Scores.R")
Please use Issues github page for questions, suggestions and reporting bugs.