# Mid-gestation fetal cortex dataset: Cluster identification and characterization

__Upstream Steps__

* QC filter on cells
* Expression filter on genes
* Normalization and log10 transformation by Scanpy
* HVG by Triku
* Integration by Harmony
* Dimensionality reduction after integration

__This notebook__

* Define resolution for cluster identification
* Cluster identification
* Cluster characterization


----

# 1. Environment Set Up

## 1.1 Library upload

In [None]:
import numpy as np
import pandas as pd
import scanpy as sc
import scanpy.external as sce
import seaborn as sns
import igraph as ig
from scipy.sparse import csr_matrix, isspmatrix
from datetime import datetime

from gprofiler import GProfiler

In [None]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80)

## 1.2 Start Computation time

In [None]:
print(datetime.now())

## 1.3 Result file

In [None]:
#results_file = '/home/..../brainomics/Dati/4_AdataClusters.h5ad'

----

# 2. Read input files  

In [None]:
#adata = sc.read('/home/..../brainomics/Data/3_AdataDimRed.h5ad')
#adata = sc.read('/group/brainomics/course_material/Day2/data/Ongoing/3_AdataDimRed.h5ad')

In [None]:
print('Loaded Normalizes AnnData object: number of cells', adata.n_obs)
print('Loaded Normalizes AnnData object: number of genes', adata.n_vars)

print('Available metadata for each cell: ', adata.obs.columns)

----

# 3. Clustering

We use graph-based clustering algorithms, that, starting from neighbourhood graph, aim at identifying “communities” of cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that is then subjected to downstream characterization.

Here we test Leiden algorithm ([reference paper](https://www.nature.com/articles/s41598-019-41695-z)). Advantages: computationally efficient; solves issue of detection of badly connected communities that somentimes happens with Louvain algorithm. 

Different resolution values are tested; lower values results in a smaller number of bigger clusters, while higher values detect a bigger number of small clusters.

## 3.1 Test resolutions with Leiden 

In [None]:
res = [0.2, 0.4, 0.6, 0.8, 1, 1.2]
leiden_labels = []

for x in res:
    label = "Leiden_" + str(x).replace('.', '')
    leiden_labels.append(label) 
    sc.tl.leiden(adata, resolution = x, key_added= label)

In [None]:
sc.pl.umap(adata, color=leiden_labels)

In [None]:
sc.pl.draw_graph(adata, color=leiden_labels)

## 3.2 Choose granularity

In [None]:
chosen_leiden = 'Leiden_06'
key_leiden = 'rank_L' + chosen_leiden[-2:]

In [None]:
adata.obs['FinalLeiden'] = adata.obs['Leiden_06']
adata.obs['FinalLeiden'].value_counts()

In [None]:
sc.pl.umap(adata, color=['Leiden_06', 'Cluster'])

In [None]:
pd.crosstab(adata.obs['Cluster'], adata.obs['FinalLeiden'], margins=True)

----

# 4. Top markers

In [None]:
adata.uns['log1p']['base'] = None
sc.tl.rank_genes_groups(adata, groupby='FinalLeiden', method='wilcoxon', key_added='rank_L06', 
                       use_raw=False)

In [None]:
GroupMarkers = pd.DataFrame(adata.uns['rank_L06']['names']).head(101)
GroupMarkers.columns = 'Cl_' + GroupMarkers.columns

GroupMarkers.head(21)

In [None]:
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5, key='rank_L06')

----

# 5. Functional analysis on top markers

In [None]:
def CustomGO(adata, cluster, rank, n_markers=40,  show=10):
    """  
        GO analysis with GProfiler for cluster top-marker genes.
    """
    
    GroupMarkers = pd.DataFrame(adata.uns[rank]['names']).head(n_markers)   
    q = GroupMarkers[cluster].tolist()
    u = adata.var_names.tolist()
    return gp.profile(organism='hsapiens', sources=['GO:BP', 'GO:CC'], query=q, 
           background=u, no_iea=True).head(show)

In [None]:
gp = GProfiler(return_dataframe=True)

In [None]:
Cl = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

for i in Cl: 
    print("\n\n {}".format(i)) 
    display(CustomGO(adata, cluster=i, rank=key_leiden, n_markers=40,  show=8))

# 6. Save

## 6.1 Save AData

In [None]:
adata.write(results_file)

## 6.2 Timestamp finished computations 

In [None]:
print(datetime.now())

## 6.3 Save python and html versions

In [None]:
nb_fname = '4_Clusters'
nb_fname

In [None]:
%%bash -s "$nb_fname"
jupyter nbconvert "$1".ipynb --to="python"
jupyter nbconvert "$1".ipynb --to="html"