# Single-cell RNA-seqs analysis using Python  
## Practicals 01: Raw reads to expression matrix

Adapted from:  
Single-cell best practices  
www.sc-best-practices.org

## 1. Raw data processing
The commands in this section are supposed to be exectued in a command-line shell terminal.  Lines that begin with a hash/pound sign are comment lines, and not meant to be run.  

Needs conda env `af`. 

### 1.1 Prepare the environment  
The environment should already have everything we need, but below is how you set up the tools.  You should have conda installed (via anaconda, miniconda, or mamba).

```
# For a LINUX machine
conda create -n af -y -c bioconda simpleaf
conda activate af
```

```
# For Apple silicon-based machine
CONDA_SUBDIR=osx-64 conda create -n af -y -c bioconda simpleaf   # create a new environment
conda activate af
conda env config vars set CONDA_SUBDIR=osx-64  # subsequent commands use intel packages

# To make your changes take effect please reactivate your environment
conda deactivate
conda activate af
```

*Dev note: Make this work, or remove `pyroe`*   
```
# It seems that pyroe is no longer a part of simpleaf, and should be installed separately.
# While inside the activated env, install pyroe via conda.
conda install -c bioconda pyroe
# Not yet sure if this will work becuase env solving seems to take a while and just doesn't complete
```

### 1.2 Get the data  
The data for this training are already provided.

```
# Create a working dir and go to the working directory
## The && operator helps execute two commands using a single line of code.
mkdir af_xmpl_run && cd af_xmpl_run

# Fetch the example dataset and CB permit list and decompress them
## The pipe operator (|) passes the output of the wget command to the tar command.
## The dash operator (-) after `tar xzf` captures the output of the first command.
## - example dataset
wget -qO- https://umd.box.com/shared/static/lx2xownlrhz3us8496tyu9c4dgade814.gz | tar xzf - --strip-components=1 -C .
## The fetched folder containing the fastq files are called toy_read_fastq.
fastq_dir="toy_read_fastq"
## The fetched folder containing the human ref files is called toy_human_ref.
ref_dir="toy_human_ref"

# Fetch CB permit list
## the right chevron (>) redirects the STDOUT to a file.
wget -qO- https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz | gunzip - > 3M-february-2018.txt
```

### 1.3 Generate a single-cell matrix with `simpleaf`
*Dev note: Test as is, recommend the Galaxy training here for a more detailed hands-on:*  
https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_alevin/tutorial.html

```
# simpleaf needs the environment variable ALEVIN_FRY_HOME to store configuration and data.
# For example, the paths to the underlying programs it uses and the CB permit list
mkdir alevin_fry_home & export ALEVIN_FRY_HOME='alevin_fry_home'

# the simpleaf set-paths command finds the path to the required tools and write a configuration JSON file in the ALEVIN_FRY_HOME folder.
simpleaf set-paths

# simpleaf index
# Usage: simpleaf index -o out_dir [-f genome_fasta -g gene_annotation_GTF|--refseq transcriptome_fasta] -r read_length -t number_of_threads
## The -r read_lengh is the number of sequencing cycles performed by the sequencer to generate biological reads (read2 in Illumina).
## Publicly available datasets usually have the read length in the description. Sometimes they are called the number of cycles.
simpleaf index \
-o simpleaf_index \
-f toy_human_ref/fasta/genome.fa \
-g toy_human_ref/genes/genes.gtf \
-r 90 \
-t 8
```

Above `simpleaf index` command should run for a few seconds.  
Standard output is something like below.  

```
2023-08-10T07:25:52.640381Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2023-08-10T07:25:52.647140Z  INFO grangers::reader::gtf: Finished parsing the input file. Found 3 comments and 2439 records.
2023-08-10T07:25:52.647874Z  INFO roers: Built the Grangers object for 2439 records
2023-08-10T07:25:52.652823Z  WARN grangers::grangers_info: The exon_number column contains null values. Will compute the exon number from exon start position .
2023-08-10T07:25:52.656748Z  INFO roers: Proceed 2148 exon records from 271 transcripts
2023-08-10T07:25:52.850408Z  INFO roers: Processing 1877 intronic records
2023-08-10T07:25:53.088741Z  INFO roers: Done!
2023-08-10T07:25:53.088977Z  INFO simpleaf::simpleaf_commands::indexing: salmon index cmd : /opt/anaconda3/envs/af/bin/salmon index -k 31 -i simpleaf_index/index -t simpleaf_index/ref/roers_ref.fa --threads 8
2023-08-10T07:25:54.089091Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
```

```
# Collecting sequencing read files
## The reads1 and reads2 variables are defined by finding the filenames with the pattern "_R1_" and "_R2_" from the toy_read_fastq directory.
reads1_pat="_R1_"
reads2_pat="_R2_"

## The read files must be sorted and separated by a comma.
### The find command finds the files in the fastq_dir with the name pattern
### The sort command sorts the file names
### The awk command and the paste command together convert the file names into a comma-separated string.
reads1="$(find -L ${fastq_dir} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)"
reads2="$(find -L ${fastq_dir} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)"
### If above doesn't work, try below (- was added in the paste command, possibly needed for macOSX non-GNU paste)
reads1="$(find -L ${fastq_dir} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${fastq_dir} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

# simpleaf quant
## Usage: simpleaf quant -c chemistry -t threads -1 reads1 -2 reads2 -i index -u [unspliced permit list] -r resolution -m t2g_3col -o output_dir
simpleaf quant \
-c 10xv3 -t 8 \
-1 $reads1 -2 $reads2 \
-i simpleaf_index/index \
-u -r cr-like \
-m simpleaf_index/index/t2g_3col.tsv \
-o simpleaf_quant
```

Above `simpleaf quant` command should take a few seconds to run.  
Standard output is something like below.

```
2023-08-10T08:52:33.893470Z  INFO simpleaf::simpleaf_commands::quant: salmon alevin cmd : /opt/anaconda3/envs/af/bin/salmon alevin --index simpleaf_index/index -l A -1 toy_read_fastq/selected_R1_reads.fastq -2 toy_read_fastq/selected_R2_reads.fastq --chromiumV3 --threads 8 -o simpleaf_quant/af_map --sketch
2023-08-10T08:52:35.061947Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-08-10T08:52:35.062094Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /opt/anaconda3/envs/af/bin/alevin-fry generate-permit-list -i simpleaf_quant/af_map -d fw --unfiltered-pl alevin_fry_home/plist/10x_v3_permit.txt --min-reads 10 -o simpleaf_quant/af_quant
2023-08-10T08:52:37.178490Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-08-10T08:52:37.178539Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry collate cmd : /opt/anaconda3/envs/af/bin/alevin-fry collate -i simpleaf_quant/af_quant -r simpleaf_quant/af_map -t 8
2023-08-10T08:52:37.196041Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-08-10T08:52:37.196077Z  INFO simpleaf::simpleaf_commands::quant: cmd : "/opt/anaconda3/envs/af/bin/alevin-fry" "quant" "-i" "simpleaf_quant/af_quant" "-o" "simpleaf_quant/af_quant" "-t" "8" "-m" "simpleaf_index/index/t2g_3col.tsv" "-r" "cr-like"
2023-08-10T08:52:37.213916Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
```

View the resulting files. Commands are preceded by `$` (the dollar sign should not be copied), and results are shown right below the commands.

```
# Each line in `quants_mat.mtx` represents
# a non-zero entry in the format row column entry
# values you see may be different
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat.mtx
138 58 1
139 9 1
139 37 1

# Each line in `quants_mat_cols.txt` is a splice status
# of a gene in the format (gene name)-(splice status)
# values you see may be different
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_cols.txt
ENSG00000120705-A
ENSG00000198961-A
ENSG00000245526-A

# Each line in `quants_mat_rows.txt` is a corrected
# (and, potentially, filtered) cell barcode
# values you see may be different
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_rows.txt
TTCGATTTCTGAATCG
TGCTCGTGTTCGAAGG
ACTGTGAAGAAATTGC
```

*Dev note: Check if this is necessary. If not and pyroe cannot be installed, scrap below*

In [None]:
# import pyroe

# quant_dir = 'simpleaf_quant/af_quant'
# adata_sa = pyroe.load_fry(quant_dir)

In [None]:
# import pyroe

# quant_dir = 'simpleaf_quant/af_quant'
# adata_usa = pyroe.load_fry(quant_dir, output_format={'X' : ['U','S','A']})

### 1.4 Generate a single-cell matrix with the `alevin-fry` pipeline

#### 1.4.1 Using the SC book method

Build the reference index

```
# make splici reference
## Usage: pyroe make-splici genome_file gtf_file read_length out_dir
## The read_length is the number of sequencing cycles performed by the sequencer. Ask your technician if you are not sure about it.
## Publicly available datasets usually have the read length in the description.
pyroe make-splici \
${ref_dir}/fasta/genome.fa \
${ref_dir}/genes/genes.gtf \
90 \
splici_rl90_ref

# Index the reference
## Usage: salmon index -t extend_txome.fa -i idx_out_dir -p num_threads
## The $() expression runs the command inside and puts the output in place.
## Please ensure that only one file ends with ".fa" in the `splici_ref` folder.
salmon index \
-t $(ls splici_rl90_ref/*\.fa) \
-i salmon_index \
-p 8
```

Above commands should take seconds each to run.  
To execute commands, had to switch between different conda envs because `pyroe` doesn't get installed with `simpleaf` via conda.

Mapping and quantification

```
# Collect FASTQ files
## The filenames are sorted and separated by space.
reads1="$(find -L $fastq_dir -name "*$reads1_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')"
reads2="$(find -L $fastq_dir -name "*$reads2_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')"
## Non-GNU paste (OSX)
reads1="$(find -L $fastq_dir -name "*$reads1_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ' -)"
reads2="$(find -L $fastq_dir -name "*$reads2_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ' -)"

# Mapping
## Usage: salmon alevin -i index_dir -l library_type -1 reads1_files -2 reads2_files -p num_threads -o output_dir
## The variable reads1 and reads2 defined above are passed in using ${}.
salmon alevin \
-i salmon_index \
-l ISR \
-1 ${reads1} \
-2 ${reads2} \
-p 8 \
-o salmon_alevin \
--chromiumV3 \
--sketch
```

```
# Cell barcode correction
## Usage: alevin-fry generate-permit-list -u CB_permit_list -d expected_orientation -o gpl_out_dir
## Here, the reads that map to the reverse complement strand of transcripts are filtered out by specifying `-d fw`.
alevin-fry generate-permit-list \
-u 3M-february-2018.txt \
-d fw \
-i salmon_alevin \
-o alevin_fry_gpl
```

Output of generating permit list

```
2023-08-11 10:55:18 INFO number of unfiltered bcs read = 6,794,880
2023-08-11 10:55:18 INFO paired : false, ref_count : 337, num_chunks : 7
2023-08-11 10:55:18 INFO read 2 file-level tags
2023-08-11 10:55:18 INFO read 2 read-level tags
2023-08-11 10:55:18 INFO read 1 alignemnt-level tags
2023-08-11 10:55:18 INFO File-level tag values FileTags { bclen: 16, umilen: 12 }
2023-08-11 10:55:18 INFO observed 33,206 reads (18,985 orientation consistent) in 7 chunks --- max ambiguity read occurs in 24 refs
2023-08-11 10:55:18 INFO minimum num reads for barcode pass = 10
2023-08-11 10:55:18 INFO num_passing = 139
2023-08-11 10:55:18 INFO found 139 cells with non-trivial number of reads by exact barcode match
2023-08-11 10:55:18 INFO There were 893 distinct unmatched barcodes, and 151 that can be recovered
2023-08-11 10:55:18 INFO Matching unmatched barcodes to retained barcodes took 175.209µs
2023-08-11 10:55:18 INFO Of the unmatched barcodes
============
2023-08-11 10:55:18 INFO 	195 had exactly 1 single-edit neighbor in the retained list
2023-08-11 10:55:18 INFO 	49 had >1 single-edit neighbor in the retained list
2023-08-11 10:55:18 INFO 	1,586 had no neighbor in the retained list
2023-08-11 10:55:18 INFO total number of distinct corrected barcodes : 151
```

```
# Filter mapping information
## Usage: alevin-fry collate -i gpl_out_dir -r alevin_map_dir -t num_threads
alevin-fry collate \
-i alevin_fry_gpl \
-r salmon_alevin \
-t 8
```

Output for filter mapping:  

```
2023-08-11 10:56:58 INFO filter_type = Unfiltered
2023-08-11 10:56:58 INFO collated rad file will not be compressed
2023-08-11 10:56:58 INFO paired : false, ref_count : 337, num_chunks : 7, expected_ori : Forward
2023-08-11 10:56:58 INFO read 2 file-level tags
2023-08-11 10:56:58 INFO read 2 read-level tags
2023-08-11 10:56:58 INFO read 1 alignemnt-level tags
2023-08-11 10:56:58 INFO File-level tag values FileTags { bclen: 16, umilen: 12 }
2023-08-11 10:56:58 INFO deserialized correction map of length : 290
2023-08-11 10:56:58 INFO Generated 1 temporary buckets.
  [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]       7/7       partitioned records into temporary files.                                                                                                       [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]       1/1       gathered all temp files.                                                                                                                      2023-08-11 10:56:58 INFO writing num output chunks (139) to header
2023-08-11 10:56:58 INFO expected number of output chunks 139
2023-08-11 10:56:58 INFO finished collating input rad file "salmon_alevin/map.rad".
```

```
# UMI resolution + quantification
## Usage: alevin-fry quant -r resolution -m txp_to_gene_mapping -i gpl_out_dir -o quant_out_dir -t num_threads
## The file ends with `3col.tsv` in the splici_ref folder will be passed to the -m argument.
## Please ensure that there is only one such file in the `splici_ref` folder.
alevin-fry quant -r cr-like \
-m $(ls splici_rl90_ref/*3col.tsv) \
-i alevin_fry_gpl \
-o alevin_fry_quant \
-t 8
```

Output of above command:  

```
2023-08-11 10:57:48 INFO quantifying from uncompressed, collated RAD file File { fd: 4, path: "/Users/irisyu/Library/CloudStorage/GoogleDrive-irisyu@ebi.ac.uk/My Drive/Notes/Notebooks/trainings/scrnaseq_python_2023/af_xmpl_run/alevin_fry_gpl/map.collated.rad", read: true, write: false }
2023-08-11 10:57:48 INFO paired : false, ref_count : 337, num_chunks : 139
2023-08-11 10:57:48 INFO tg-map contained 20 genes mapping to 337 transcripts.
2023-08-11 10:57:48 INFO read 2 file-level tags
2023-08-11 10:57:48 INFO read 2 read-level tags
2023-08-11 10:57:48 INFO read 1 alignemnt-level tags
2023-08-11 10:57:48 INFO File-level tag values FileTags { bclen: 16, umilen: 12 }
  [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]     139/139     finished quantifying 139 cells.                                                                                                               2023-08-11 10:57:48 INFO processed 17,350 total read records
```

#### 1.4.2 Using the Atlas pipeline  
Try and replicate the Atlas pipeline here. Below are the references:  
[Tutorial](https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_alevin/tutorial.html)  
[Galaxy workflow](https://humancellatlas.usegalaxy.eu/published/workflow?id=bd7bd54eaf932585)

#### Convert from mtx to 10x  
Below is a script that also uses pyroe, with USA mode  (use conda env `pyroe` or `parse_alevin_fry` in codon to run).
In codon, the script is copied to `/homes/irisyu/training/scrnaseq_python_2023/bin` (add to PATH).

export PATH=/homes/irisyu/training/scrnaseq_python_2023/bin:$PATH

```
(pyroe) % alevinFryMtxTo10x.py alevin_fry_quant alevin_fry_parsed single_cell 
USA mode: True
Using pre-defined output format: scrna
Will populate output field X with sum of counts frorm ['S', 'A'].
Will combine ['U'] into output layer unspliced.
```

#### Remove empty drops  

Below, use conda env with `bioconductor-dropletutils` and `dropletutils-scripts`.  In codon, the scripts `/nfs/production/irene/ma/fg_atlas_sc/nextflow_scxa_test/envs/dropletutils-a2f6101409ff7f11d8d66c59f1f32ead/bin/dropletutils-*.R` are copied to `/homes/irisyu/training/scrnaseq_python_2023/bin` (add to PATH).  The scripts should also be in the dropletutils-scripts env bin.  

```
(dropletutils)[fg_atlas@hl-codon-101-03 af_xmpl_run]$ dropletutils-read-10x-counts.R -s alevin_fry_parsed -c TRUE -o alevin_fry_parsed/matrix.rds
...
# Object summary
class: SingleCellExperiment
dim: 20 139
metadata(0):
assays(1): counts
rownames(20): ENSG00000131507 ENSG00000131508 ... ENSG00000198961                                        
  ENSG00000245526
rowData names(2): ID Symbol
colnames(139): ACTTTCAAGATCACCT GTGGAGACAATTAGGA ... TGCTCGTGTTCGAAGG                                    
  ACTGTGAAGAAATTGC
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):

# Metadata sample
DataFrame with 6 rows and 2 columns
                            Sample          Barcode
                       <character>      <character>
ACTTTCAAGATCACCT alevin_fry_parsed ACTTTCAAGATCACCT
GTGGAGACAATTAGGA alevin_fry_parsed GTGGAGACAATTAGGA
GTGTGGCGTAGTGTGG alevin_fry_parsed GTGTGGCGTAGTGTGG
GTGGAGATCTTCCTAA alevin_fry_parsed GTGGAGATCTTCCTAA
TTACAGGAGCTCTGTA alevin_fry_parsed TTACAGGAGCTCTGTA
AGACACTTCGACGCGT alevin_fry_parsed AGACACTTCGACGCGT


(dropletutils)[fg_atlas@hl-codon-101-03 af_xmpl_run]$ dropletutils-empty-drops.R -i alevin_fry_parsed/matrix.rds --lower 5 --niters 1000 --filter-empty TRUE --filter-fdr 0.01 -o empty_drops/nonempty.rds -t empty_drops/nonempty.txt                                               ...
At an FDR of 0.01, estimate that 34 barcodes have cells.                                                 
Will filter to 34 barcodes.

Parameter values:
             value
lower            5
niters        1000
test_ambient FALSE
filter_empty  TRUE
filter_fdr    0.01
```

#### Convert from rds to h5ad using sceasy

Below, use conda env with `r-sceasy` (e.g., `sc_py_training`).  

```
(sc_py_training)[fg_atlas@codon-slurm-login-02 af_xmpl_run]$ Rscript -e 'library(sceasy)' \              
> -e 'sce <- readRDS("empty_drops/nonempty.rds")' \
> -e 'sceasy::convertFormat(sce, outFile="empty_drops/nonempty.h5ad", from="sce", to="anndata", main_layer
="counts")' \
> -e 'print(sce)'
...
Loading required package: reticulate
Loading required namespace: SingleCellExperiment
/hps/nobackup/irene/ma/isl/envs_test/sc_py_training/lib/R/library/reticulate/python/rpytools/loader.py:117
: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public class
es are available in the top-level namespace.
  return _find_and_load(name, import_)
/hps/software/users/ma/service/isl/conda/config/envs/r-anndata/lib/python3.11/site-packages/anndata/core/a
nndata.py:1328: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use i
s_categorical_dtype instead.
  and not is_categorical(df[key])
/hps/software/users/ma/service/isl/conda/config/envs/r-anndata/lib/python3.11/site-packages/anndata/core/a
nndata.py:1328: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use i
s_categorical_dtype instead.
  and not is_categorical(df[key])
/hps/software/users/ma/service/isl/conda/config/envs/r-anndata/lib/python3.11/site-packages/anndata/core/a
nndata.py:101: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is
_categorical_dtype instead.
  if is_string_dtype(df[k]) and not is_categorical(df[k]):                                               
/hps/software/users/ma/service/isl/conda/config/envs/r-anndata/lib/python3.11/site-packages/anndata/core/a
nndata.py:108: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is
_categorical_dtype instead.
  elif is_categorical(df[k]):
AnnData object with n_obs <U+00D7> n_vars = 34 <U+00D7> 20                                               
    obs: 'Barcode', 'emptyTotal', 'emptyLogProb', 'emptyPValue', 'emptyLimited', 'emptyFDR'              
    var: 'ID', 'Symbol'
Warning message:
In .regularise_df(as.data.frame(SummarizedExperiment::colData(obj)),  :                                  
  Dropping single category variables:Sample
class: SingleCellExperiment
dim: 20 34
metadata(0):
assays(1): counts
rownames(20): ENSG00000131507 ENSG00000131508 ... ENSG00000198961                                        
  ENSG00000245526
rowData names(2): ID Symbol
colnames(34): TCACGCTAGTCCCAAT TGTCCTGCAAGTCGTT ... CAGCAGCGTACGGGAT                                     
  CACAGATCACTAAACC
colData names(7): Sample Barcode ... emptyLimited emptyFDR                                               
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
```

#### Questions  
How many cell barcodes are left after the emptyDrops treatment?  

### Exercises
Answer the Questions in each section above. 

### 2.  Quality Control  
Activate `sc_py_training` with the following before opening this notebook.  
```
conda activate sc_py_training
```

#### 2.1 Filtering low quality reads

In [None]:
import numpy as np
import scanpy as sc
import seaborn as sns
from scipy.stats import median_abs_deviation

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

In [None]:
adata = sc.read_10x_h5(
    filename="../../training_data/filtered_feature_bc_matrix.h5",
    backup_url="https://figshare.com/ndownloader/files/39546196",
)
adata

In [None]:
adata.var_names_make_unique()
adata

The dataset has 16,934 barcodes and 36,601 transcripts.

Below, we add series of booleans to identify which variables (transcripts) are from mitochondrial, ribosomal, or hemoglobin genes.

In [None]:
# mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes.
adata.var["hb"] = adata.var_names.str.contains(("^HB[^(P)]"))
adata

Below, we calculate qc metrics, which will evaluate the raw observations and therefore will add data to the `adata.obs` dataframe. 

In [None]:
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo", "hb"], inplace=True, percent_top=[20], log1p=True
)
adata

In [None]:
adata.obs.columns

In [None]:
# Ditribution of barcodes (cells) in terms of the numbr of genes per barcode (cell)
p1 = sns.displot(adata.obs["total_counts"], bins=100, kde=False)
# Above distribution viewed as a violon plot:
# sc.pl.violin(adata, 'total_counts')

# Distribution of barcodes (cells) in terms of the %mt of counts per barcode (cell)
p2 = sc.pl.violin(adata, "pct_counts_mt")

# A scatter representing total_counts (x), n_genes_by_counts (y), and %mt of counts (color) per barcode (cell)
p3 = sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")

Below defines a function that checks if of an observation is an outlier based on namad * MAD.  The output is a series of boleans.

In [None]:
def is_outlier(adata, metric: str, nmads: int):
    M = adata.obs[metric]
    outlier = (M < np.median(M) - nmads * median_abs_deviation(M)) | (
        np.median(M) + nmads * median_abs_deviation(M) < M
    )
    return outlier

Below: `|` is a bitwise or operator comparing a 3 serieses of booleans (metrics).  If an observation (barcode) is an outlier in any of the three metrics, that barcode is tagged as an outlier.

In [None]:
adata.obs["outlier"] = (
    is_outlier(adata, "log1p_total_counts", 5)
    | is_outlier(adata, "log1p_n_genes_by_counts", 5)
    | is_outlier(adata, "pct_counts_in_top_20_genes", 5)
)
adata.obs.outlier.value_counts()

Below, observations (barcodes) are tagged as outliers based on metrics  for the presence of mitochondrial transcripts.

In [None]:
adata.obs["mt_outlier"] = is_outlier(adata, "pct_counts_mt", 3) | (
    adata.obs["pct_counts_mt"] > 8
)
adata.obs.mt_outlier.value_counts()

Use outlier information to filter barcodes (cells).  Below, we use `~` which is a bitwise `not`, an inverse boolean mask (`True` to `False` and vice versa).

In [None]:
print(f"Total number of cells: {adata.n_obs}")
adata = adata[(~adata.obs.outlier) & (~adata.obs.mt_outlier)].copy()
print(f"Number of cells after filtering of low quality cells: {adata.n_obs}")

In [None]:
p4 = sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")

*Dev note: If needed, save anndata to hda5 and change conda env to one that can process the rest of the stuff.*

#### 2.2 Correction of ambient RNA 

In [None]:
import anndata2ri
import logging

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
%%R
library(SoupX)

In [None]:
adata_pp = adata.copy()
sc.pp.normalize_per_cell(adata_pp)
sc.pp.log1p(adata_pp)

In [None]:
sc.pp.pca(adata_pp)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="soupx_groups")

# Preprocess variables for SoupX
soupx_groups = adata_pp.obs["soupx_groups"]

In [None]:
del adata_pp

In [None]:
cells = adata.obs_names
genes = adata.var_names
data = adata.X.T

In [None]:
adata_raw = sc.read_10x_h5(
    filename="raw_feature_bc_matrix.h5",
    backup_url="https://figshare.com/ndownloader/files/39546217",
)
adata_raw.var_names_make_unique()
data_tod = adata_raw.X.T

In [None]:
del adata_raw

In [None]:
%%R -i data -i data_tod -i genes -i cells -i soupx_groups -o out 

# specify row and column names of data
rownames(data) = genes
colnames(data) = cells
# ensure correct sparse format for table of counts and table of droplets
data <- as(data, "sparseMatrix")
data_tod <- as(data_tod, "sparseMatrix")

# Generate SoupChannel Object for SoupX 
sc = SoupChannel(data_tod, data, calcSoupProfile = FALSE)

# Add extra meta data to the SoupChannel object
soupProf = data.frame(row.names = rownames(data), est = rowSums(data)/sum(data), counts = rowSums(data))
sc = setSoupProfile(sc, soupProf)
# Set cluster information in SoupChannel
sc = setClusters(sc, soupx_groups)

# Estimate contamination fraction
sc  = autoEstCont(sc, doPlot=FALSE)
# Infer corrected table of counts and rount to integer
out = adjustCounts(sc, roundToInt = TRUE)

In [None]:
adata.layers["counts"] = adata.X
adata.layers["soupX_counts"] = out.T
adata.X = adata.layers["soupX_counts"]

In [None]:
print(f"Total number of genes: {adata.n_vars}")

# Min 20 cells - filters out 0 count genes
sc.pp.filter_genes(adata, min_cells=20)
print(f"Number of genes after cell filter: {adata.n_vars}")

Note: Above count for number of genes is not the same as the book. 

In [None]:
# Save above data, so as not to repeat all of above when kernel stops
adata.write("../../training_data/s4d8_corrected_ambient_RNA.h5ad")

#### 2.3 Doublet Detection  

In [None]:
%%R
library(Seurat)
library(scater)
library(scDblFinder)
library(BiocParallel)

In [None]:
data_mat = adata.X.T

In [None]:
%%R -i data_mat -o doublet_score -o doublet_class

set.seed(123)
sce = scDblFinder(
    SingleCellExperiment(
        list(counts=data_mat),
    ) 
)
doublet_score = sce$scDblFinder.score
doublet_class = sce$scDblFinder.class

In [None]:
adata.obs["scDblFinder_score"] = doublet_score
adata.obs["scDblFinder_class"] = doublet_class
adata.obs.scDblFinder_class.value_counts()

In [None]:
adata.write("../../training_data/s4d8_quality_control.h5ad")

### 3. Normalization

In [None]:
import scanpy as sc
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import anndata2ri
import logging
from scipy.sparse import issparse

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    # color_map="YlGnBu",
    frameon=False,
)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
adata = sc.read(
    filename="../../training_data/s4d8_quality_control.h5ad",
    backup_url="https://figshare.com/ndownloader/files/40014331",
)

In [None]:
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False)

#### 3.1 Shifted logarithm

In [None]:
scales_counts = sc.pp.normalize_total(adata, target_sum=None, inplace=False)
# log1p transform
adata.layers["log1p_norm"] = sc.pp.log1p(scales_counts["X"], copy=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(adata.layers["log1p_norm"].sum(1), bins=100, kde=False, ax=axes[1])
axes[1].set_title("Shifted logarithm")
plt.show()

In [None]:
from scipy.sparse import csr_matrix, issparse

In [None]:
%%R
library(scran)
library(BiocParallel)

In [None]:
# Preliminary clustering for differentiated normalisation
adata_pp = adata.copy()
sc.pp.normalize_total(adata_pp)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps=15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="groups")

In [None]:
data_mat = adata_pp.X.T
# convert to CSC if possible. See https://github.com/MarioniLab/scran/issues/70
if issparse(data_mat):
    if data_mat.nnz > 2**31 - 1:
        data_mat = data_mat.tocoo()
    else:
        data_mat = data_mat.tocsc()
ro.globalenv["data_mat"] = data_mat
ro.globalenv["input_groups"] = adata_pp.obs["groups"]

In [None]:
del adata_pp

In [None]:
%%R -o size_factors

size_factors = sizeFactors(
    computeSumFactors(
        SingleCellExperiment(
            list(counts=data_mat)), 
            clusters = input_groups,
            min.mean = 0.1,
            BPPARAM = MulticoreParam()
    )
)

In [None]:
adata.obs["size_factors"] = size_factors
scran = adata.X / adata.obs["size_factors"].values[:, None]
adata.layers["scran_normalization"] = csr_matrix(sc.pp.log1p(scran))

In [None]:
adata.write("../../training_data/s4d8_log1p_normalization.h5ad")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(
    adata.layers["scran_normalization"].sum(1), bins=100, kde=False, ax=axes[1]
)
axes[1].set_title("log1p with Scran estimated size factors")
plt.show()

#### 3.2 Analytic Pearson Residuals

In [1]:
# Run this if kernel stops below
# adata = sc.read("../../training_data/s4d8_log1p_normalization.h5ad")

Cell below is memory-intensive

In [None]:
analytic_pearson = sc.experimental.pp.normalize_pearson_residuals(adata, inplace=False)
adata.layers["analytic_pearson_residuals"] = csr_matrix(analytic_pearson["X"])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(
    adata.layers["analytic_pearson_residuals"].sum(1), bins=100, kde=False, ax=axes[1]
)
axes[1].set_title("Analytic Pearson residuals")
plt.show()

In [None]:
adata.write("../../training_data/s4d8_normalization.h5ad")

### 4. Feature Selection

One of the commands below is memory-intensive (goes over the VM limit).  Export notebook with results to HTML first, then restart kernel before proceeding below. 

In [None]:
import scanpy as sc
import anndata2ri
import logging
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
%%R
library(scry)

In [None]:
adata = sc.read(
    filename="../../training_data/s4d8_normalization.h5ad",
    backup_url="https://figshare.com/ndownloader/files/40015741",
)

In [None]:
ro.globalenv["adata"] = adata

In [None]:
%%R
sce = devianceFeatureSelection(adata, assay="X")

In [None]:
binomial_deviance = ro.r("rowData(sce)$binomial_deviance").T

In [None]:
idx = binomial_deviance.argsort()[-4000:]
mask = np.zeros(adata.var_names.shape, dtype=bool)
mask[idx] = True

adata.var["highly_deviant"] = mask
adata.var["binomial_deviance"] = binomial_deviance

In [None]:
sc.pp.highly_variable_genes(adata, layer="scran_normalization")

In [None]:
ax = sns.scatterplot(
    data=adata.var, x="means", y="dispersions", hue="highly_deviant", s=5
)
ax.set_xlim(None, 1.5)
ax.set_ylim(None, 3)
plt.show()

In [None]:
adata.write("../../training_data/s4d8_feature_selection.h5ad")