Repository accompanying "Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing"
This repository accompanies the paper:
Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing
Dongze He, Charlotte Soneson, Rob Patro
bioRxiv 2023.01.04.522742; doi: https://doi.org/10.1101/2023.01.04.522742
Note: This repository uses git submodules, please clone it with
git clone --recurse-submodules https://github.com/COMBINE-lab/scrna-ambiguity
cd scrna-ambiguity
To run the whole pipeline, please set all the paths in the run_me.config
, and run the run_me.sh
in a terminal by calling . run_me.sh
. Running the classification experiment and the analysis of the two mouse brain nuclei datasets will take approximately 450 GB of disk space. If you choose to run the analysis of the STARsolo simulation as well, please make sure there is another 200 GB of disk space left on your disk.
We recommend run the pipeline in a conda environment. To reduce the time of environment creation, we recommend to use mamba.
mamba create -n scrna-ambiguity r-essentials=4.1 r-doParallel=1.0.17 bioconductor-genomicfeatures=1.46.1 bioconductor-biostrings=2.62.0 bioconductor-bsgenome=1.62.0 r-ggplot2=3.4.0 star=2.7.10b kb-python=0.27.3 simpleaf=0.8.1 -y
conda activate scrna-ambiguity
If you want to use conda instead, simply replace the mamba
in the above command with conda
. The conda_env.yml
file in the GitHub repository can be used to create the exact conda env we used to run the pipeline. Note: The included conda environment was created under linux, and, in general, the code in the current repository is unlikely to work directly under other operating systems.
Some packages are not available on conda, so they need to be manually installed.
BBMap is used to simulate the sequencing reads for the classification experiment. We used BBMap 39.01 in the analysis, which can be directly downloaded from this link. The path to the decompressed folder of the downloaded file should be specified in the run_me.config
file. To download the path and print the path to the decompressed folder, please run the following bash command.
wget https://versaweb.dl.sourceforge.net/project/bbmap/BBMap_39.01.tar.gz && tar -xzvf BBMap_39.01.tar.gz
cd bbmap && pwd
cd ..
empirical_splice_status
is used to obtain the matching-based true splicing status of the simulated reads in the classification experiment. empirical_splice_status
is a rust program. To install it, please make sure rust is installed.
The source code of empirical_splice_status
will be included in this repository if you follow the above git clone --recurse-submodules
command to recursively clone this repository. Then you can run the following command without worrying about the empirical_splice_status
variable in the run_me.config
file.
cargo build --release --manifest-path=empirical_splice_status/Cargo.toml
If the repository is not cloned recursively, just clone empirical_splice_status
before build it. If you build it to somewhere else, please use the path to the release
folder printed from running the following bash command as theempirical_splice_status
variable in the run_me.config
rm -r empirical_splice_status
git clone https://github.com/COMBINE-lab/empirical_splice_status.git
cargo build --release --manifest-path=empirical_splice_status/Cargo.toml
cd empirical_splice_status/target/release
pwd
cd ../../..
piscem
is the new latest mapping tools in the alevin-fry ecosystem. If offers more concise and memory frugal index than any other methods tested in this work. To install it, you might need to set your CXX
and CC
path. Please use the path to the piscem
executable printed from running the following bash command as the piscem
variable in the run_me.config
.
git clone https://github.com/COMBINE-lab/piscem.git
cd piscem
cargo build --release
cd target/release && find ${PWD} -name "piscem"
cd ../../..
Please use the printed path from the following bash command as the HSHMP_2022
variable. In this work, we followed the code in commit cfd6958be6ab65fc340bf1df5f1b0e77f19ff967.
git clone https://github.com/pachterlab/HSHMP_2022.git
cd HSHMP_2022
git checkout cfd6958be6ab65fc340bf1df5f1b0e77f19ff967
pwd
cd ..
Please follow the instructions in the GitHub repository to install it, and use the path to the kallisto
executable as the kallistod
variable in the run_me.config
file. In this work, we used commit 26daa940c25ca1787de8aab8d9c8dda1802006ea
.
Please follow the instructions in the GitHub repository to install it, and use the path to the bustools
executable as the bustools
variable in the run_me.config
. In this work, we used the commit 89a8a5ddee4ec1c3bb95dfb42e13f2d2ba86e9a1
.
The root_dir
variable in the run_me.config
file serves as the root working directory of the whole pipeline. After specifying a path, please make sure that there are at least 450GB free space in the disk.
After you have set all the paths in the run_me.config
file and made sure that there are enough free disk space, the pipeline can be run by calling the run_me.sh
file. Notice that for all variants of alevin-fry, mapping was performed using pseudoalignment with structural constraints corresponding to the --sketch
flag in salmon alevin
(likewise, this is the mapping algorithm implemented in piscem
so far).
chmod +x *.sh
chmod +x *.R
. run_me.sh
The figures used in the manuscript can be generated by following the python notebooks in the notebooks
directory in this repository. You might need to install the python packages used by the notebooks.
- The
classification_experiment.ipynb
is used to generate Figure 2 and 3. - The
sn_mouse_analysis_adult.ipynb
andsn_mouse_analysis_E18.ipynb
notebooks are used to generate Figure 4, 5, 6, 7, and 8. Notice that thesn_mouse_analysis_adult.ipynb
notebook has to be run beforesn_mouse_analysis_E18.ipynb
, as it outputs a file the other one runs.
Notice that the filtered cellular barcode list of the E18 mouse brain nuclei dataset can be downloaded from this link. As the filtered count matrix is not provided in the dataset's webpage, the filtered barcodes are obtained by loading the Loupe Browser visualization file (CLOUPE)
file from the dataset's webpage in the Loupe browser, and exporting the clustering. In the final barcode list file, only the barcodes are included, not the clusters.
For the adult mouse nuclei dataset, the filtered barcode list TSV file can be downloaded from this link. As the filtered barcode list is included in the filtered count matrix file, Feature / cell matrix (filtered)
, the barcode included in the filtered count matrix is used as the filtered barcode list.
To run the analysis of the STARsolo simulation, please comment out the line #47-49 in the run_me.sh
. As this experiment is not mentioned in the manuscript, this analysis is optional.