<a href="https://colab.research.google.com/github/QSBSC/QSBSC_Class_2020/blob/master/KleinFig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install and Load Necessary Packages

In [0]:
!pip install scanpy[louvain]
!pip install scanpy[leiden]

Collecting scanpy[louvain]
[?25l  Downloading https://files.pythonhosted.org/packages/b7/84/dd977049bafc0942a29cd277c55586277776fb4f895e02e6e1c76f250210/scanpy-1.4.5.1-py3-none-any.whl (6.5MB)
[K     |████████████████████████████████| 6.5MB 2.4MB/s 
Collecting legacy-api-wrap
  Downloading https://files.pythonhosted.org/packages/a4/68/da997bc56bb69dcdcee4054f0bc42266909307b905389fbc54c9158f42da/legacy_api_wrap-1.2-py3-none-any.whl
Collecting setuptools-scm
  Downloading https://files.pythonhosted.org/packages/4b/c1/118ec08816737cc46b4dd93b22f7a138fbfb14b53f4b4718fd9983e70a50/setuptools_scm-3.5.0-py2.py3-none-any.whl
Collecting anndata>=0.7
[?25l  Downloading https://files.pythonhosted.org/packages/5b/c8/5c594a95ba293433dfe1cf188075ccbabe495bf2d291be744974aca85ffc/anndata-0.7.1-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 10.4MB/s 
Collecting h5py>=2.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/60/06/cafdd44889200e5438b897388f3075b52a8e

In [0]:
import numpy as np
import pandas as pd

In [0]:
import scanpy as sc

**This just makes sure ScanPy will tell you what is going on**

In [0]:
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80)

# Mount Google Drive and Load Data Files

**Need to have the QSBSC folder in MyDrive prior to this step!**

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!ls "/content/drive/My Drive/Quantitative Systems Biology 2020/Data"

Read in the data using ScanPy

In [0]:
adata = sc.read_csv(
    filename = "/content/drive/My Drive/Quantitative Systems Biology 2020/Data/GSM3067189_04hpf.csv").transpose() 

Read in cluster annotations and names using Pandas

In [0]:
anno = pd.read_csv("/content/drive/My Drive/Quantitative Systems Biology 2020/Data/GSM3067189_04hpf_clustID.txt", header = None)
adata.obs['Cluster'] = list(anno[0])
adata.obs['Cluster'] = adata.obs['Cluster'].astype('category')

In [0]:
names = pd.read_csv("/content/drive/My Drive/Quantitative Systems Biology 2020/Data/GSE112294_ClusterNames.csv")
new_cluster_names = list(names['ClusterName'][0:4])
adata.rename_categories('Cluster', new_cluster_names)

What does your data look like?

In [0]:
adata

# Inspecting, Cleaning, and Normalizing Data

In [0]:
adata.var_names_make_unique()

Checking to make sure nothing is weird

In [0]:
sc.pl.highest_expr_genes(adata, n_top=20)

In [0]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

Add the total counts per cell as observations-annotation to adata


In [0]:
adata.obs['n_counts'] = adata.X.sum(axis=1)

In [0]:
sc.pl.violin(adata, ['n_genes', 'n_counts'],
             jitter=0.4, multi_panel=True)

In [0]:
sc.pl.scatter(adata, x='n_counts', y='n_genes')

In [0]:
adata = adata[adata.obs.n_genes < 2500, :]

Normalize and transform data matrix

In [0]:
sc.pp.normalize_total(adata, target_sum=1e4)

In [0]:
sc.pp.log1p(adata)

Identify genes that are highly expressed and variable

In [0]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

Subset data by variable genes and scale data

In [0]:
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['n_counts'])
sc.pp.scale(adata, max_value=10)

# Plotting Data using Dimensionality Reduction Techniques

In [0]:
sc.tl.pca(adata)
sc.pl.pca(adata, color = ['Cluster'])

In [0]:
sc.tl.tsne(adata)
sc.pl.tsne(adata, color = ['Cluster'])

In [0]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color = ['Cluster'])

# Identifying genes that contribute to clusters

In [0]:
sc.tl.rank_genes_groups(adata, 'Cluster', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

# Unsupervised Clustering and Comparing Groups


Clustering to determine if similar groups arise to proposed clusters

In [0]:
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden'])

In [0]:
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)