<a href="https://colab.research.google.com/github/Ken-Lau-Lab/single-cell-lectures/blob/main/section04_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## __Section 4:__ Clustering & Differential Expression Homework

March 1, 2022

---

In [None]:
!git clone git://github.com/Ken-Lau-Lab/single-cell-lectures  # for Colab users
!pip install scanpy  # for Colab users
!pip install leidenalg  # for Colab users
!pip install scikit-learn # for Colab users
!cp -r single-cell-lectures/data/ .  # for Colab users

In [None]:
%env PYTHONHASHSEED=0
import scanpy as sc; sc.settings.verbosity = 3  # Set scanpy verbosity to 3 for in depth function run information
import numpy as np
import random; random.seed(22)
from sklearn.preprocessing import normalize
np.random.seed(22)

---
#### Import peripheral blood mononuclear cells (PBMC) dataset.  This has already been **filtered** and **feature-selected**.  Your assignment is to **cluster** and create a **UMAP embedding** of the cells, identifying the constituent cell types by their **differentially expressed genes**.

In [None]:
adata = sc.read("data/PBMC_3k_small.h5ad") ; adata

#### Below, I've included some code snippets that we used during the lecture for processing.

#### We don't need to worry about mitochondrial counts or cell cycle phase inference for this exercise.

In [None]:
# let's first define a custom function that operates on AnnData objects
def arcsinh_norm(adata, layer=None, norm="l1", scale=1000):
    """
    return arcsinh-normalized values for each element in anndata counts matrix
    l1 normalization (sc.pp.normalize_total) should be performed before this transformation
        adata = AnnData object
        layer = name of lauer to perform arcsinh-normalization on. if None, use AnnData.X
        norm = normalization strategy prior to Log2 transform.
            None: do not normalize data
            'l1': divide each count by sum of counts for each cell
            'l2': divide each count by sqrt of sum of squares of counts for cell
        scale = factor to scale normalized counts to; default 1000
    """
    if layer is None:
        mat = adata.X
    else:
        mat = adata.layers[layer]

    adata.layers["arcsinh_norm"] = np.arcsinh(normalize(mat, axis=1, norm=norm) * scale)

In [None]:
# preprocess AnnData for downstream dimensionality reduction
adata.layers["raw_counts"] = adata.X.copy()  # save raw counts in layer
arcsinh_norm(adata, layer="raw_counts", norm="l1", scale=1000)  # arcsinh-transform normalized counts and add to .layers['arcsinh_norm']
adata.X = adata.layers["arcsinh_norm"].copy()  # set normalized counts as .X slot in scanpy object

In [None]:
sc.tl.pca(adata, n_comps=50, random_state=0, use_highly_variable=False)  # perform 50-component PCA on our feature-selected dataset
sc.pl.pca_overview(
    adata,
    components=["1,2","2,3"],
)  # view first two PCs, feature loadings, and variance

In [None]:
n_neighbs = int(np.sqrt(adata.n_obs))  # determine optimal number of neighbors by sqrt(n_obs)
print("Number of nearest neighbors: {}".format(n_neighbs))

In [None]:
sc.pp.neighbors(adata, n_neighbors=n_neighbs, n_pcs=6, random_state=0)  # generate kNN graph with 6 PCs
sc.tl.leiden(adata, resolution=0.2, random_state=1)  # determine dataset clusters

In [None]:
sc.tl.paga(adata)  # PAGA uses the kNN graph and Leiden clusters to create a cluster-cluster similarity graph
sc.pl.paga(
    adata,
    color="leiden",
    node_size_scale=3,
    fontsize=12,
    fontoutline=2,
    frameon=False,
)  # plot PAGA graph. Edge thickness and distance describe cluster similarity

In [None]:
sc.tl.umap(adata, init_pos="paga", random_state=0)  # initialize UMAP with PAGA coordinates
sc.pl.umap(
    adata,
    color="leiden",
    legend_fontsize=12,
    legend_fontoutline=2,
    size=75,
    frameon=False,
)  # plot embedding with Leiden cluster overlay

In [None]:
sc.tl.rank_genes_groups(adata, groupby="leiden")
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5, cmap="viridis", standard_scale="var")

In [None]:
sc.pl.rank_genes_groups(adata, ncols=4)

---
#### Now we should have Leiden clusters, a UMAP embedding, and differentially-expressed genes for each cluster.

#### The assignment is to identify the clusters present in the dataset (at coarse resolution - `resolution=0.2` in `sc.tl.leiden` above).  There should be **three major groups**, and your task is to say which group each cluster belongs to:
1. T lymphocytes
2. B lymphocytes
3. Myeloid cells

In [None]:
# there should be 4 clusters if you use resolution=0.2 in leiden clustering above
celltypedict = {
    "0":"",  # input the name of the cell type corresponding to each cluster ID here
    "1":"",
    "2":"",
    "3":"",
}

# Remap the values of the dataframe
adata.obs["cell_type"] = adata.obs["leiden"]
adata.obs = adata.obs.replace({"cell_type" : celltypedict})

In [None]:
sc.pl.umap(
    adata,
    color=["cell_type"],
    legend_fontsize=12,
    legend_fontoutline=2,
    size=75,
    frameon=False,
    legend_loc="on data",
)  # plot embedding with Leiden cluster overlay