# Practical workshop on scRNA-seq analysis
This workshop is conducted by ***Andreas Møller*** and ***Kedar Natarajan***. 
<br>You can contact us at<br>
Andreas Møller: andreasfm@bmb.sdu.dk<br>
Kedar Natarajan: knn@bmb.sdu.dk<br> 
***

## scRNA-seq exercises
This notebook contains the exercises which are required for the single cell genomics teaching module. 
The submission deadline is **May 18, 2020** <br>

Towards the exercise, we have selected a smaller 10X Genomics dataset containing scRNA-seq of ~3k Peripheral blood mononuclear cells (PBMCs).
You'll be required to 

- Write syntax to execute analysis questions
- Produce visualisation, clustering and optionally mark cell types
- Comment on analysis/motivation (selected cells)
- Submit the complete notebook as Markdown for evaluation 
<br>

_Note: You can use the PBMC10K jupyter notebook as a reference for the exercises_

#### Please state your full name below

#### *** Assignment submitted by: " " ***

## Import packages

In [2]:
import numpy as np
import numpy as np
import pandas as pd
import scanpy as sc

import matplotlib.pyplot as plt
import matplotlib.colors as mplcol
import matplotlib.font_manager
import matplotlib as mpl

import io
import anndata
from matplotlib import rcParams

Please find scanpy documentation <a href='https://scanpy.readthedocs.io/en/latest/api/index.html'>here<a/>

## Downloading the raw data

In [7]:
# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz

mkdir: cannot create directory ‘data’: File exists
--2020-05-14 09:02:11--  http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 13.225.103.117, 13.225.103.118, 13.225.103.8, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|13.225.103.117|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7621991 (7.3M) [application/x-tar]
Saving to: ‘data/pbmc3k_filtered_gene_bc_matrices.tar.gz’


2020-05-14 09:02:11 (109 MB/s) - ‘data/pbmc3k_filtered_gene_bc_matrices.tar.gz’ saved [7621991/7621991]



#### ***Note: The dataset are already downloaded and availible at "/KNN_data/Data_3k/filtered_gene_bc_matrices/hg19/" ***

## Setup

In [8]:
sc.settings.verbosity = 3           
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80)
results_file = 'pbmc3k.h5ad'

scanpy==1.4.6 anndata==0.7.1 umap==0.4.2 numpy==1.17.0 scipy==1.4.1 pandas==0.25.1 scikit-learn==0.22.1 statsmodels==0.10.1 python-igraph==0.8.2 louvain==0.6.0


## Loading the data

In [20]:
#Load the pre-downloaded data by:
adata = sc.read_10x_mtx(
    '/KNN_data/Data_3k/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols')

--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.


## Filtering

### Exercise 1

Filter the dataset using the following chriteria: 

- Remove cells with less than 200 genes detected
- Remove genes that are expressed in less than 3 cells
- Remove cells with more than 2500 genes detected
- Remove cells with more than 5% mitochondrial reads

Likewise, describe the motivation for setting such thresholds on min/max number of genes, and the mitochondrial percentage. 

The motivation behind filtering the data is.....

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20)

## Preprocessing steps

### Exercise 2

- Normalize the counts to 10k / cell
- Transform the data to log(x+1) scale

Also, describe the rationale of normalization and this log transformation.

In [None]:
adata.raw = adata

## Highly variable genes

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

In [None]:
adata = adata[:, adata.var['highly_variable']]

## Further processing

### Exercise  3

- Regress out variation caused by the library size (n_couts) and mitochondrial percentage (percent_mito)
- Scale all genes to mean 0 and unit variance

Also describe the reationale of variate regression and feature scaling.

## Linear dimensionality reduction

### Exercise 4
- Perform PCA and visualize a few immune marker genes (e.g. 'CST3', 'NKG7' and 'PPBP')

Describe the concept of PCA and how it is used in single cell analysis

Discuss pros and cons of using PCA

## Non-linear dimensionality reduction

In [None]:
#First compute the neighbor graph
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

### Exercise 5
- Run one non-linear dimensionality reduction (t-SNE or UMAP) and visualize a few immune marker genes (e.g. 'CST3', 'NKG7' and 'PPBP')

Describe the concept of non-linear dimensionality reduction and how it is used in single cell analysis

Discuss pros and cons of non-linear dimensionality reductions with respect to PCA

## Clustering

### Exercise 6

Run a graph based clustering(sc.tl.louvain() or sc.tl.leiden()) on the expression matrix. (play with the resolution parameter till the clusters stable)

- Superimpose clusters on the t-SNE or UMAP

## Differential expression

### Exercise 7
- Perform simple differential expression using t-test (use sc.tl.rank_gene_groups())

- Show the top 5 marker DE genes per cluster as a dataframe

### Exercise 8 (Optional)
- Try using your biological knowledge or look up immune cell type marker genes in scientific litterature.
- Try annotating the clusters with cell type labels

## Exporting notebook as PDF for submission

The notebook can be exported as PDF (by pressing: File > Export Notebook As... > Export Notebook To PDF)