# Single-cell RNA-seqs analysis using Python  
## Exercises 01: Raw reads to expression matrix

## 1. Raw data processing
__Experiment:__ [E-MTAB-6945](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6945)  
__Dataset:__ Raw scRNA-seq reads from T-cells of mouse neonate thymus, subset of one sample (400k reads)  
__Technology:__ Drop-seq

Reads can be obtained from the links below.  
https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k.fastq  
https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq  

Both reads and references are pre-donwloaded and saved in the following locations in the VMs:
```
data_dir="../../training_data"
fastq_dir="${data_dir}/ex_read_fastq"
ref_dir="${data_dir}/ex_mouse_ref"
```

__Instructions:__ Write and run a script that executes raw data processing on the given data set.  One may run the full alevin-fry workflow, or the simpler `simpleaf` method.  There is no unfiltered permit list file for drop-seq experiments, so use one of the other barcode correction methods when generating a barcode permit list.  

### Questions  
1.  Which fastq file contains the cell barcodes + UMIs, and which contains the transcript sequences?  
2.  How long is the cell barcode and how long is the UMI?  
3.  What is the length of the transcript reads?  
4.  How many barcodes do you end up with after quantification?  
5.  How many genes are included in your quantification results?  

## 2.  Quality Control  
Perform QC, use conda env `sc_py_training`.  Compose commands below as instructed, to complete the QC processing.  Answer questions at the bottom of the section.  

### 2.1 Filtering low quality barcodes

In [None]:
import numpy as np
import scanpy as sc
import seaborn as sns
from scipy.stats import median_abs_deviation

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

Complete command below to open the dataset as an annData object.  The data is saved in `../../training_data/E-MTAB-6945.h5ad`

In [None]:
# For exercises, we are using the raw counts of E-MTAB-6945 (not the subset)
adata = sc.read()

In [None]:
# Mitochondrial transcripts are already annotated
adata.var.loc[adata.var.chromosome == "MT", ["gene_symbols", "chromosome", "mito"]] 

Below, calculate qc metrics, and include variable indicating whether or not a gene is mitochondrials as a variable to control for.

In [None]:
sc.pp.calculate_qc_metrics(
    adata, qc_vars=[], inplace=True, percent_top=[20], log1p=True
)
adata

In the third plot below, show the log-transformed versions of the `total_counts` and `n_genes_by_counts` in their respective axes.

In [None]:
# Ditribution of barcodes (cells) in terms of the numbr of genes per barcode (cell)
p1 = sns.displot(adata.obs["total_counts"], bins=100, kde=False)
# Above distribution viewed as a violon plot:
# sc.pl.violin(adata, 'total_counts')

# Distribution of barcodes (cells) in terms of the %mt of counts per barcode (cell)
p2 = sc.pl.violin(adata, "pct_counts_mito")

# A scatter representing total_counts (x), n_genes_by_counts (y), and %mt of counts (color) per barcode (cell)
p3 = sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mito")

Below is a function that evaluates whether or not a value is an outlier.

In [None]:
def is_outlier(adata, metric: str, nmads: int):
    M = adata.obs[metric]
    outlier = (M < np.median(M) - nmads * median_abs_deviation(M)) | (
        np.median(M) + nmads * median_abs_deviation(M) < M
    )
    return outlier

Create a boolean series where outliers are based on metrics `log1p_total_counts`, `log1p_n_genes_by_counts`, and`pct_counts_in_top_20_genes`. Use `nmads = 5`

In [None]:
adata.obs["outlier"] = ()
adata.obs.outlier.value_counts()

Create a series of booleans indicating whether observations or barcodes are outliers based on `nmads = 6`.  Do not save the boolean series within the annData object.  

In [None]:
outliers_6nmad = (
    is_outlier(adata, "log1p_total_counts", 6)
    | is_outlier(adata, "log1p_n_genes_by_counts", 6)
    | is_outlier(adata, "pct_counts_in_top_20_genes", 6)
)
outliers_6nmad.value_counts()

In [None]:
adata.obs["mt_outlier"] = is_outlier(adata, "pct_counts_mito", 3) | (
    adata.obs["pct_counts_mito"] > 8
)
adata.obs.mt_outlier.value_counts()

Remove low-quality barcodes from the matrix.

In [None]:
print(f"Total number of cells: {adata.n_obs}")
adata = adata[(~adata.obs.outlier) & (~adata.obs.mt_outlier)].copy()
print(f"Number of cells after filtering of low quality cells: {adata.n_obs}")

Edit the plotting command below, so that the plot shows the new adata, its log-transformed versions of the `total_counts` and `n_genes_by_counts` in their respective axes. Compare with the plot above.

In [None]:
p4 = sc.pl.scatter(adata, "log1p_total_counts", "log1p_n_genes_by_counts", color="pct_counts_mito")

#### Questions
1. How many barcodes and genes are present in the dataset?  
2. What variable in the annData object indicates whether a gene (`.var`) is mitochondrial or not?
3. After `scanpy.pp.calculate_qc_metrics`, where were the calculated metrics they added?
4. If `nmad = 6`, how many outliers are identified based on the common qc metrics.
5. How many cells are left after filtering low quality barcodes?  

### 2.2 Correction of ambient RNA 

In [None]:
import anndata2ri
import logging

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
%%R
library(SoupX)

In [None]:
# Normalise and log1p-transform the data
adata_pp = adata.copy()
sc.pp.normalize_per_cell(adata_pp)
sc.pp.log1p(adata_pp)

In [None]:
# Compute principal components
sc.pp.pca(adata_pp)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="soupx_groups")

# Preprocess variables for SoupX
soupx_groups = adata_pp.obs["soupx_groups"]

In [None]:
del adata_pp

In [None]:
cells = adata.obs_names
genes = adata.var_names
data = adata.X.T

In [None]:
# Load the raw again
adata_raw = sc.read("../../training_data/E-MTAB-6945.h5ad")
data_tod = adata_raw.X.T

In [None]:
del adata_raw

In [None]:
%%R -i data -i data_tod -i genes -i cells -i soupx_groups -o out 

# specify row and column names of data
rownames(data) = genes
colnames(data) = cells
# ensure correct sparse format for table of counts and table of droplets
data <- as(data, "sparseMatrix")
data_tod <- as(data_tod, "sparseMatrix")

# Generate SoupChannel Object for SoupX 
sc = SoupChannel(data_tod, data, calcSoupProfile = FALSE)

# Add extra meta data to the SoupChannel object
soupProf = data.frame(row.names = rownames(data), est = rowSums(data)/sum(data), counts = rowSums(data))
sc = setSoupProfile(sc, soupProf)
# Set cluster information in SoupChannel
sc = setClusters(sc, soupx_groups)

# Estimate contamination fraction
sc  = autoEstCont(sc, doPlot=FALSE)
# Infer corrected table of counts and rount to integer
out = adjustCounts(sc, roundToInt = TRUE)

In [None]:
type(out)

In [None]:
# Move counts to a layer
adata.layers["counts"] = adata.X

# Put the output of soupx into a layer
adata.layers["soupX_counts"] = out.T

# Use the soupx layer as the new main data layer, X
adata.X = adata.layers["soupX_counts"]

In [None]:
print(f"Total number of genes: {adata.n_vars}")

# Min 20 cells - filters out 0 count genes
sc.pp.filter_genes(adata, min_cells=20)
print(f"Number of genes after cell filter: {adata.n_vars}")

In [None]:
# Save above data, so as not to repeat all of above when kernel stops
adata.write("../../training_data/E-MTAB-6945_corrected_ambient_RNA.h5ad")

### Question
How many genes are removed by the correction of ambient mRNA?

#### 2.3 Doublet Detection  

In [None]:
%%R
library(Seurat)
library(scater)
library(scDblFinder)
library(BiocParallel)

In [None]:
data_mat = adata.X.T

In [None]:
%%R -i data_mat -o doublet_score -o doublet_class

set.seed(123)
sce = scDblFinder(
    SingleCellExperiment(
        list(counts=data_mat),
    ) 
)
doublet_score = sce$scDblFinder.score
doublet_class = sce$scDblFinder.class

In [None]:
adata.obs["scDblFinder_score"] = doublet_score
adata.obs["scDblFinder_class"] = doublet_class
adata.obs.scDblFinder_class.value_counts()

In [None]:
adata.write("../../training_data/E-MTAB-6945_quality_control.h5ad")

### Question
How many barcodes are considered doublet?  

## 3. Normalization

In [None]:
import scanpy as sc
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import anndata2ri
import logging
from scipy.sparse import issparse

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    # color_map="YlGnBu",
    frameon=False,
)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
adata = sc.read("../../training_data/E-MTAB-6945_quality_control.h5ad")
adata

In [None]:
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False)

### 3.1 Shifted logarithm

In [None]:
scales_counts = sc.pp.normalize_total(adata, target_sum=None, inplace=False)
# log1p transform
adata.layers["log1p_norm"] = sc.pp.log1p(scales_counts["X"], copy=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(adata.layers["log1p_norm"].sum(1), bins=100, kde=False, ax=axes[1])
axes[1].set_title("Shifted logarithm")
plt.show()

In [None]:
from scipy.sparse import csr_matrix, issparse

In [None]:
%%R
library(scran)
library(BiocParallel)

In [None]:
# Preliminary clustering for differentiated normalisation
adata_pp = adata.copy()
sc.pp.normalize_total(adata_pp)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps=15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="groups")

In [None]:
data_mat = adata_pp.X.T
# convert to CSC if possible. See https://github.com/MarioniLab/scran/issues/70
if issparse(data_mat):
    if data_mat.nnz > 2**31 - 1:
        data_mat = data_mat.tocoo()
    else:
        data_mat = data_mat.tocsc()
ro.globalenv["data_mat"] = data_mat
ro.globalenv["input_groups"] = adata_pp.obs["groups"]

In [None]:
del adata_pp

In [None]:
%%R -o size_factors

size_factors = sizeFactors(
    computeSumFactors(
        SingleCellExperiment(
            list(counts=data_mat)), 
            clusters = input_groups,
            min.mean = 0.1,
            BPPARAM = MulticoreParam()
    )
)

In [None]:
adata.obs["size_factors"] = size_factors
scran = adata.X / adata.obs["size_factors"].values[:, None]
adata.layers["scran_normalization"] = csr_matrix(sc.pp.log1p(scran))

In [None]:
adata.write("../../training_data/E-MTAB-6945_log1p_normalization.h5ad")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(
    adata.layers["scran_normalization"].sum(1), bins=100, kde=False, ax=axes[1]
)
axes[1].set_title("log1p with Scran estimated size factors")
plt.show()

## 4. Feature Selection

In [None]:
import scanpy as sc
import anndata2ri
import logging
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
%%R
library(scry)

In [None]:
adata = sc.read("../../training_data/E-MTAB-6945_log1p_normalization.h5ad")

In [None]:
ro.globalenv["adata"] = adata

In [None]:
%%R
sce = devianceFeatureSelection(adata, assay="X")

In [None]:
binomial_deviance = ro.r("rowData(sce)$binomial_deviance").T

Try to select the 5000 most deviant genes.

In [None]:
idx = binomial_deviance.argsort()[]
mask = np.zeros(adata.var_names.shape, dtype=bool)
mask[idx] = True

adata.var["highly_deviant"] = mask
adata.var["binomial_deviance"] = binomial_deviance

In [None]:
sc.pp.highly_variable_genes(adata, layer="scran_normalization")

In [None]:
ax = sns.scatterplot(
    data=adata.var, x="means", y="dispersions", hue="highly_deviant", s=5
)
ax.set_xlim(None, 1.5)
ax.set_ylim(None, 3)
plt.show()

In [None]:
adata.write("../../training_data/E-MTAB-6945_feature_selection.h5ad")

### Question  
What can you say about the genes selected (highly deviant)?