<a href="https://colab.research.google.com/github/Ken-Lau-Lab/single-cell-lectures/blob/main/section04_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## __Section 4:__ Dimension Reduction Homework

February 22, 2022

---

In [None]:
!git clone git://github.com/Ken-Lau-Lab/single-cell-lectures  # for Colab users
!pip install scanpy  # for Colab users
!pip install leidenalg  # for Colab users
!pip install scikit-learn # for Colab users
!cp -r single-cell-lectures/data/ .  # for Colab users

In [None]:
%env PYTHONHASHSEED=0
import scanpy as sc; sc.settings.verbosity = 3  # Set scanpy verbosity to 3 for in depth function run information
import numpy as np
import random; random.seed(22)
from sklearn.preprocessing import normalize
np.random.seed(22)

---
#### Import peripheral blood mononuclear cells (PBMC) dataset.  This has already been **filtered** and **feature-selected**.

#### Your assignment is to **perform principal component analysis** and determine **which gene features drive the major PCs in the dataset**.

In [None]:
adata = sc.read("data/PBMC_3k_small.h5ad") ; adata

#### Below, I've included some code snippets that we used during the lecture for processing.

#### We don't need to worry about mitochondrial counts or cell cycle phase inference for this exercise.

In [None]:
# let's first define a custom function that operates on AnnData objects
def arcsinh_norm(adata, layer=None, norm="l1", scale=1000):
    """
    return arcsinh-normalized values for each element in anndata counts matrix
    l1 normalization (sc.pp.normalize_total) should be performed before this transformation
        adata = AnnData object
        layer = name of lauer to perform arcsinh-normalization on. if None, use AnnData.X
        norm = normalization strategy prior to Log2 transform.
            None: do not normalize data
            'l1': divide each count by sum of counts for each cell
            'l2': divide each count by sqrt of sum of squares of counts for cell
        scale = factor to scale normalized counts to; default 1000
    """
    if layer is None:
        mat = adata.X
    else:
        mat = adata.layers[layer]

    adata.layers["arcsinh_norm"] = np.arcsinh(normalize(mat, axis=1, norm=norm) * scale)

In [None]:
# preprocess AnnData for downstream dimensionality reduction
adata.layers["raw_counts"] = adata.X.copy()  # save raw counts in layer
arcsinh_norm(adata, layer="raw_counts", norm="l1", scale=1000)  # arcsinh-transform normalized counts and add to .layers['arcsinh_norm']
adata.X = adata.layers["arcsinh_norm"].copy()  # set normalized counts as .X slot in scanpy object

#### In the code block below, perform a **50-component PCA**, plot the overview containing the PC scatterplots, gene loadings, and variance ratio

#### List the **top 3 genes in each of the first 3 principal components**, which describe the "loadings" for the major PCs that explain the most variance across all cells