# Introduction

In this notebook, we analyse a sample of PBMCs. Here, we cover clustering and cell type annotation of the data.


Install all packages for the tutorial.

In [3]:
!pip install scanpy==1.6.1 umap-learn==0.4.6 anndata==0.7.5 numpy==1.19.5 scipy==1.4.1 pandas matplotlib scrublet seaborn python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3

Collecting scanpy==1.6.1
[?25l  Downloading https://files.pythonhosted.org/packages/76/75/677a6c4e78e8e3ff9eedef49b7b1dd4ab1cfd442ecf3131be329ad374186/scanpy-1.6.1-py3-none-any.whl (10.2MB)
[K     |████████████████████████████████| 10.2MB 4.4MB/s 
[?25hCollecting umap-learn==0.4.6
[?25l  Downloading https://files.pythonhosted.org/packages/ac/21/e1eb2eb1c624a84f4a23237adb974e94ff7371ee8c178d246194af83fb80/umap-learn-0.4.6.tar.gz (69kB)
[K     |████████████████████████████████| 71kB 7.1MB/s 
[?25hCollecting anndata==0.7.5
[?25l  Downloading https://files.pythonhosted.org/packages/81/b1/743cc79f89d9db6dccbfb7e6000795acb218a6c6320b7a2337cad99bd047/anndata-0.7.5-py3-none-any.whl (119kB)
[K     |████████████████████████████████| 122kB 49.3MB/s 
Collecting scrublet
  Downloading https://files.pythonhosted.org/packages/21/74/82308f7bdcbda730b772a6d1afb6f55b9706601032126c4359afb3fb8986/scrublet-0.2.3-py3-none-any.whl
Collecting python-igraph==0.8.3
[?25l  Downloading https://files.pyth

Load all required packages.

In [4]:
import scanpy as sc
import anndata as ann
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import colors

import os 
#doublet detection
import scrublet as scr


#pretty plotting
import seaborn as sb



In [5]:
plt.rcParams['figure.figsize']=(8,8) #rescale figures
sc.settings.verbosity = 3
#sc.set_figure_params(dpi=200, dpi_save=300)
sc.logging.print_header()


scanpy==1.6.1 anndata==0.7.5 umap==0.5.1 numpy==1.19.5 scipy==1.4.1 pandas==1.1.5 scikit-learn==0.22.2.post1 statsmodels==0.10.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3


Of note, this notebook was created as part of a workshop, so we use extra large legend texts in all seaborn plots. You can set the context as well to 'talk' or 'paper'.

In [6]:
sb.set_context(context='poster')


# Set project file paths

Let us set up the connection with Google Drive.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We set up the file paths to the respective directories.

In [8]:
file_path = '/content/drive/My Drive/' #this is the file path to your google drive (main directory)

In [9]:
import os 

In [None]:
os.listdir(file_path)

File path to the raw data. They are usually stored at a different location than the rest of the project.

In [11]:
file_path_raw = file_path + '3k_PBMC/'

The data directory contains all processed data and `anndata` files. 

In [12]:
data_dir = file_path + 'PBMC_Colabs/data/' 

The tables directory contains all tabular data output, e.g. in `.csv` or `.xls` file format. That applies to differential expression test results or overview tables such as the number of cells per cell type.

In [13]:
table_dir = file_path + 'PBMC_Colabs/tables/'

The default figure path is a POSIX path calles 'figures'. If you don't change the default figure directory, scanpy creates a subdirectory where this notebook is located.  

In [14]:
sc.settings.figdir = file_path + 'PBMC_Colabs/figures/'

**Comment:** When you repeat certain analyses, it might be helpful to set a `date` variable and add it to every figure and table (see `datetime` Python package).

In [None]:
import datetime

today = datetime.date.today().strftime('%y%m%d') #creates a YYMMDD string of today's date

# Read data

The normalised dataset, which consists of 3k PBMCs (Human) provided by 10X Genomics.

In [12]:
adata = sc.read(data_dir + 'data_processed.h5ad')

# Downstream analysis

## Clustering

Clustering is a central component of the scRNA-seq analysis pipeline. To understand the data, we must identify cell types and states present. The first step of doing so is clustering. Performing Modularity optimization by Louvain community detection on the k-nearest-neighbour graph of cells has become an established practice in scRNA-seq analysis. Thus, this is the method of choice in this tutorial as well.

Here, we perform clustering at two resolutions. Investigating several resolutions allows us to select a clustering that appears to capture the main clusters in the visualization and can provide a good baseline for further subclustering of the data to identify more specific substructure.

Clustering is performed on the highly variable gene data, dimensionality reduced by PCA, and embedded into a KNN graph. (see `sc.pp.pca()` and `sc.pp.neighbors()` functions used in the visualization section).

Compute a `louvain` clustering with two different resolutions (`0.5` and `1`). Compare the clusterings in a table and visualize the clustering in an embedding. Optional: Compute a clustering with the `leiden` algorithm. 

In [None]:
# Perform clustering - using highly variable genes
sc.tl.louvain(adata, resolution=1.5, key_added='louvain_r1.5')
sc.tl.louvain(adata, resolution=0.5, key_added='louvain_r0.5')

In [None]:
pd.crosstab(adata.obs['louvain_r0.5'], adata.obs['louvain_r1.5'])

In [None]:
#Visualize the clustering and how this is reflected by different technical covariates
sc.pl.umap(adata, color=['louvain_r1.5', 'louvain_r0.5'], wspace=0.6)
sc.pl.umap(adata, color=['log_counts', 'mt_frac'])

## Marker genes and cluster annotation 

To annotate the clusters we obtained, we find genes that are up-regulated in the cluster compared to all other clusters (marker genes). This differential expression test is performed by a *Welch t-test with overestimated variance* to be conservative. This is the default in `scanpy`. The test is automatically performed on the `.raw` data set, which is uncorrected and contains all genes. All genes are taken into account, as any gene may be an informative marker.

As we are using the relevant mouse gut atlas from the literature in this case study, there is no other reference atlas which we use to annotate the cells by automated annotation. Thus, we do not use scmap or garnett here.

Compute the differential expression profile for each cluster with `rank_genes_groups` and visualize the results.

In [None]:
#Calculate marker genes
sc.tl.rank_genes_groups(adata, groupby='louvain_r0.5', key_added='rank_genes_r0.5')

In [None]:
#Plot marker genes
sc.pl.rank_genes_groups(adata, key='rank_genes_r0.5', fontsize=12)

**Tasks:** Calculate and visualise marker genes for the louvain clustering with resolution `1.5`.

In [None]:
#Calculate marker genes


In [None]:
#Plot marker genes


Here, we observe potentially characteristic gene expression patterns, but we also observe a considerable ribosomal proteins (*RPL* and *RPS*), which are part of the ribosomes. Thus, they are involved in mRNA translational processes. Usually, these genes are difficult to interpret. 

Furthermore, the score itself is not interpretable in terms of specificity and significance in the case of clustering, because the clusters were previously defined as a group of cells being different from the rest. Therefore, we compare a group that is a priori different to the rest and the resulting scores (or p-values) are inflated. Furthermore, the smaller a cluster is, the smaller is the observed score, unless a gene is very specific to the cluster. Typically, we may find marker genes in the gene lists of the `rank_genes_groups` test, but not all marker genes have a high expression level.

When it comes to cluster annotation, we usually have to tap into prior knowledge of the cell type. Depending on the data set, this may involve extensive literature search. In the case of brain cell types, we may refer to several studies and several web resources to extract marker gene sets. Alternative approaches such as `scmap` use annotated reference data to predict the cell type identity of new data, or train a classifier based on marker genes (e.g. `Garnett`).


In the case of PBMCs, we may refer to several studies and single-cell RNA-sequencing data analysis tutorials to extract marker gene sets. 
The following list is extracted from the Seurat tutorial on PBMCs.


|Marker Gene|Cell Type|
|---------|-------|
|IL7R|CD4 T cells|
|CD14, LYZ|CD14+ Monocytes|
|MS4A1|B cells|
|CD8A|CD8 T cells|
|FCGR3A, MS4A7|FCGR3A+ Monocytes|
|GNLY, NKG7|NK cells|
|FCER1A, CST3|Dendritic Cells|
|PPBP|Megakaryocytes|


Let us define a list of marker genes from literature.

In [None]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

**Tasks:** Annotate the clusters. 
Check briefly, if all marker genes are present in the dataset and visualise the marker genes in a UMAP (or another visualisation of your choice).
You can use auxiliary plots like `matrixplot`, `dotplot`, `heatmap` or `violin` plots or coloring an embedding (e.g. UMAP, t-SNE, FA) by the marker genes.

Let us check if the marker genes are expressed in our dataset.

In [None]:
np.in1d(marker_genes, adata.var_names)

In [None]:
#plots

In [None]:
sc.pl.dotplot(adata=, 
              var_names =,
              groupby=, 
              use_raw=False)

In [None]:
sc.pl.heatmap(adata=, var_names=, 
              figsize=(5,10),
              groupby=, 
              use_raw=False, vmin=0)

In [None]:
sc.pl.matrixplot(adata=, var_names=,
                 groupby=, 
                 use_raw=False, vmin=0)

In [None]:
sc.pl.stacked_violin(adata = ,var_names = , groupby=, 
                     use_raw=False)

Annotate clusters and create a new covariate.


|Marker Gene|Cell Type|
|-------|-------|
|IL7R|CD4 T cells|
|CD14, LYZ|CD14+ Monocytes|
|MS4A1|B cells|
|CD8A|CD8 T cells|
|FCGR3A, MS4A7|FCGR3A+ Monocytes|
|GNLY, NKG7|NK cells|
|FCER1A, CST3|Dendritic Cells|
|PPBP|Megakaryocytes|

Use the `pandas` data frame functionality to rename your clusters and visualize your annotation. Here, we use a dictionary to annotate the clusters. Please note that the number and order of cluster might change depending on the pre-processing decisions.

In [None]:
cluster2annotation = {
    '0': '', #replace '' by the name of the cell type, e.g. 'NK cells'
    '1': '',
    '2': '',
    '3': '',
    '4': '',
    '5': '',
    '6': '',
    '7': '',
    '8': '',
    '9': '',
    '10': '',
    '11': '',
    '12': '',
    '13': '',
    '14': '',
    '15': '',
    '16': '',
    '17': '',
    '18': '',
    '19': '', #adapt the number of clusters according to your results
}

In [None]:
#map new annotation
adata.obs['annotated'] = adata.obs['annotated'].map(cluster2annotation).astype('category')

In [None]:
adata.obs['annotated'].value_counts()

**Task:** Visualise your annotation on a UMAP as well as in a `matrixplot`, `dotplot`, `heatmap` or `violin` plots.

In [None]:
sc.pl.umap(adata, color='annotated', legend_loc='on data', title='', frameon=False)
sc.pl.umap(adata, color='annotated',  title='', frameon=True)

In [None]:
sc.pl.dotplot(adata=, 
              var_names =,
              groupby=, 
              use_raw=False)

In [None]:
sc.pl.heatmap(adata=, var_names=, 
              figsize=(5,10),
              groupby=, 
              use_raw=False, vmin=0)

In [None]:
sc.pl.matrixplot(adata = , 
                 var_names = , 
                 groupby= , 
                 use_raw=False, vmin=0)

In [None]:
sc.pl.stacked_violin(adata=   , 
                     var_names=   , 
                     groupby=    , 
                     use_raw=False)

### Inspect subpopulations of B cells

Let us determine the differences in the B cell clusters by differential expression. Subcluster the B cells first.

In [None]:
sc.tl.leiden(adata, resolution=0.2, restrict_to = ['annotated',['B cells']], key_added='leiden_R')

In [None]:
rcParams['figure.figsize']= (5,5)
sc.pl.umap(adata, color='leiden_R')

In [None]:
sc.tl.rank_genes_groups(adata = adata, groupby='leiden_R', groups= ['B cells,1'], reference='B cells,0', rankby_abs=True)

In [None]:
rcParams['figure.figsize']=(10,5)
sc.pl.rank_genes_groups(adata, size=10, n_genes=40)

In [None]:
sc.pl.rank_genes_groups_violin(adata, groups='B cells,1', n_genes=10, use_raw=False)

**Questions:** 
Which differences do you see in the B cell subpopulations? 

Are there genes exclusively expressed in one of the populations? 

Which conclusions do you draw from the expression pattern?

**BONUS:** Investigate NK cells and try to distinguish NK cells and NKT cells.

## Save annotated data to file

At this point, we have finished the data annotation. This represents another milestone in the data analysis of single cell data. Once the annotation is finished, we won't have to touch this part of the analysis again.   

In [None]:
adata.write(data_dir + 'data_annotated.h5ad')

**Comment:** End of fifth session. In the next session, we will use the annotated data in the cellxgene browser. Please close and shutdown your jupyter session for this.