# Advanced Tutorial: consice EDA and preprocessing of the totalVI_10x gex dataset
In last tutorial we preprocessed gex dataset. But can we do it faster?

In future we will find out some new preprocessing nuance and we will want to preprocess the raw dataset again. That is why we have script, so we can do preprocessing quickly.

In [1]:
import anndata as ad
import scanpy as sc
import seaborn as sns

# Change current directory to repo root
# We use this in each notebook
from lab_scripts.utils import utils
utils.change_directory_to_repo()

from lab_scripts.data.preprocessing.common import gex_normalization, gex_qc

sc.settings.verbosity = 3  # show info messages
sc.set_figure_params(figsize=(5, 3))  # set figsize for plots


In [2]:
data = ad.read_h5ad("data/raw/gex_adt/totalVI_10x_gex.h5ad")

We already know good QC parameters from EDA. Let's apply them.

In [3]:
qc_parameters = {
    # Minimal number of counts in cell
    'cell_min_counts': 2200,

    # Maximal number of counts in cell
    'cell_max_counts': 10000,
}

data = gex_qc.standard_qc(data, qc_parameters)

filtered out 167 cells that have less than 2200 counts
filtered out 199 cells that have more than 10000 counts


You can set more parameters. See `lab_scripts/data/preprocessing/common/gex_qc.py`.

In [4]:
qc_parameters = {
    # Remove cells with more than 20% mitochondrial genes
    'mito_max_fraction': 0.2,

    # Remove cells, expressing less than 100 different genes
    'cell_min_genes': 100,

    # Remove genes, which are expressed in less than 3 cells
    'gene_min_cells': 3,
}

We won't apply them now.

## Normalization

Remember scary R code from the last tutorial? This is it now. Feel old yet?

In [5]:
clusters = gex_normalization.get_clusters(data)
data = gex_normalization.calculate_size_factors(data, clusters)
data = gex_normalization.normalize(data)

normalizing by total count per cell
    finished (0:00:00): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
computing neighbors
         Falling back to preprocessing with `sc.pp.pca` and default params.
computing PCA
    with n_comps=50
    finished (0:00:21)
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:24)
running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished: found 7 clusters and added
    'groups', the cluster labels (adata.obs, categorical) (0:00:00)


Read more at `lab_scripts/data/preprocessing/common/gex_normalization.py`