## Preprocessing

Check python version since Scanpy is only compatable with python 3

In [None]:
from platform import python_version

print(python_version())

Install Scanpy

In [None]:
!pip install scanpy

Import necessary packages

In [None]:
import numpy as np
import pandas as pd
import scanpy as sc

Set up some global settings in Scanpy

In [None]:
sc.settings.verbosity = 3
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Make an AnnData object using the count matrix with Scanpy

In [None]:
adata = sc.read('/content/drive/MyDrive/Python final project/Extracted/Copy of GSM3972018/Copy of GSM3972018_159_matrix.mtx', var_names='gene_symbols', cache=True).T

Set variable names and observation names of the AnnData project as gene names and cell barcode names respectively

In [None]:
adata.var_names = pd.read_csv('/content/drive/MyDrive/Python final project/Extracted/Copy of GSM3972018/Copy of GSM3972018_159_genes.tsv', header = None, sep ='\t')[1]
adata.obs_names = pd.read_csv('/content/drive/MyDrive/Python final project/Extracted/Copy of GSM3972018/Copy of GSM3972018_159_barcodes.tsv', header = None)[0]

Make the variable names of the AnnData unique and delete duplicates

In [None]:
adata.var_names_make_unique()

Call the AnnData object

In [None]:
adata

Plot the top 20 highly expressed genes

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20, )

Filter out cells that have less than 200 genes detected and genes that are detected in less than 3 cells

In [None]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

Annotate the group of mitochondrial genes as 'mt' and use it as control variable to calculate quality control metrics

In [None]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

Violin plots showing the distribution of some of the quality control matrices
*   the number of genes expressed in the count matrix
*   the total counts per cell
*   the percentage of counts in mitochondrial genes

In [None]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

Violin plots of some of the quality control matrices

*   the percentage of counts in mitochondrial genes by the total counts per cell
*   the number of genes expressed in the count matrix by the total counts per cell


In [None]:
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

Filter out cells that have more than 2500 total counts and 5% of counts in mitochondrial genes

In [None]:
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]

Normalize the count matrix to 10,000 counts per cell, so that counts become comparable among cells.

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)

Logarithmize the data

In [None]:
sc.pp.log1p(adata)

Identify highly-variable genes with mean of gene expression distribution between 0.0125 - 3 and dispersion greater than 0.5

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

Scatter plot of highly-variable genes

In [None]:
sc.pl.highly_variable_genes(adata)

Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression

In [None]:
adata.raw = adata

Actually do the filtering

In [None]:
adata = adata[:, adata.var.highly_variable]

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed

In [None]:
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

Scale each gene to unit variance and clip values exceeding standard deviation 10.

In [None]:
sc.pp.scale(adata, max_value=10)

## Principal component analysis

Perform PCA analysis on highly variable genes using the fault "arpack" as the singular value decomposition solver

In [None]:
sc.tl.pca(adata, svd_solver='arpack')

Plot PCA result using the first two PCs

---



In [None]:
sc.pl.pca(adata, color='CST3')

Plot showing contribution of single PCs to the total variance in the data which help to decide how many PCs to consider to compute the neighborhood relations of cells

In [None]:
sc.pl.pca_variance_ratio(adata, log=True)

In [None]:
adata

## Computing the neighborhood graph

In [None]:
sc.pp.neighbors(adata)

## Embedding the neighborhood graph

In [None]:
sc.tl.umap(adata)

In [None]:
sc.pl.umap(adata, color='CST3')

## Clustering the neighborhood graph

Install leidenalg package

In [None]:
!pip install leidenalg

Perform Leiden clustering

In [None]:
sc.tl.leiden(adata)

Plot the clusters

In [None]:
sc.pl.umap(adata, color='leiden')

Save the result

In [None]:
adata.write(Save)

## Investigating marker gene expression

Violin plot of expression of 3 marker genes across clusters

In [None]:
sc.pl.violin(adata, ['ACE2', 'FABP6', 'ANPEP'], groupby='leiden')

Dot plot of expression of 3 marker genes across clusters

In [None]:
sc.pl.dotplot(adata, ['ACE2', 'FABP6', 'ANPEP'], groupby='leiden')

Dot plot of ACE2 expression across clusters

In [None]:
sc.pl.dotplot(adata, 'ACE2', groupby='leiden')

Dot plot of FABP6 expression across clusters

In [None]:
sc.pl.dotplot(adata, 'FABP6', groupby='leiden')

Dot plot of ANPEP expression across clusters

In [None]:
sc.pl.dotplot(adata, 'ANPEP', groupby='leiden')

Stacked violin plot of expression of 3 marker genes across clusters

In [None]:
sc.pl.stacked_violin(adata, ['ACE2', 'FABP6', 'ANPEP'], groupby='leiden', rotation=90)

Stacked violin plot of ACE2 expression across clusters

In [None]:
sc.pl.stacked_violin(adata, 'ACE2', groupby='leiden', rotation=90)

Stacked violin plot of FABP6 expression across clusters

In [None]:
sc.pl.stacked_violin(adata, 'FABP6', groupby='leiden', rotation=90)

Stacked violin plot of ANPEP expression across clusters

In [None]:
sc.pl.stacked_violin(adata, 'ANPEP', groupby='leiden', rotation=90)

Color expression of marker genes on UMAP

In [None]:
sc.pl.umap(adata, color=['ACE2', 'FABP6', 'ANPEP'])