# Mid-gestation fetal cortex dataset: Batch correction and Dimensionality reduction


__Upstream Steps__

* QC filter on cells
* expression filter on genes
* Normalization and log10 transformation by Scanpy
* HVG by Triku

__This notebook__

* Integration by Harmony
* Dimensionality reduction by
    * UMAP
    * Diffusion Map
    * Force-Directed Graph

# 1. Environment Set Up

## 1.1 Library upload

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import igraph as ig
import matplotlib.pyplot as plt
from datetime import datetime
from scipy.sparse import csr_matrix, isspmatrix

import scanpy as sc
import scanpy.external as sce


## 1.2 Settings

In [None]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80)

In [None]:
#results_file = '/home/..../brainomics/Dati/3.1_AdataDimRed.h5ad'

## 1.3 Start Computation time

In [None]:
print(datetime.now())

----

# 2. Read input files  

In [None]:
#adata = sc.read('/home/..../brainomics/Data/2_AdataNorm.h5ad'')
#adata = sc.read('/group/brainomics/course_material/Day2/data/Ongoing/2_AdataNorm.h5ad')

In [None]:
print('Loaded Normalizes AnnData object: number of cells', adata.n_obs)
print('Loaded Normalizes AnnData object: number of genes', adata.n_vars)

print('Available metadata for each cell: ', adata.obs.columns)

----

In [None]:
adata_check = adata.copy()

In [None]:
sc.pp.pca(adata_check, n_comps=50, use_highly_variable=True, svd_solver='arpack')

In [None]:
sc.pp.neighbors(adata_check, n_neighbors=30, n_pcs=25)

In [None]:
sc.tl.umap(adata_check)

In [None]:
sc.pl.umap(adata_check, color=['Donor', 'Layer'], size=10)

In [None]:
plt.rcParams['figure.dpi'] = 120
sc.pl.umap(adata_check, color=['Donor'], size=10)

In [None]:
del adata_check

----

# 3. Integration with Harmony

## 3.1 Calculate PCA

In [None]:
sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')

In [None]:
sc.external.pp.harmony_integrate(adata, 'Donor')

## 3.3 Repeat plot with batch-corrected PCA

New corrected PC are saved in `.obs["X_pca_harmony"]`

In [None]:
sc.pl.embedding(adata, basis="X_pca_harmony", color=['Donor'])

----

# 4. Calculate dimensionality reductions

## Compute neighborhood graph

We compute the neighborhood graph of cells using the harmony-corrected PCA representation of the data. This identifies how similar a cell is to another cell, definying cells that are close from those that are not.

This step is propedeutic for UMAP plotting and for clustering. 

__Key parameters:__ 

* n_pcs: number of PC used for compute the kNN graph
* n_neighbors: number of neighbors. Larger neighbor values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
* metrics: distance metric used in the calculation



## n_pcs

The elbow plot is helpful when determining how many PCs we need to capture the majority of the variation in the data. The elbow plot visualizes the standard deviation of each PC. Where the elbow appears is usually the threshold for identifying the majority of the variation. However, this method can be a bit subjective about where the elbow is located.

In [None]:
sc.settings.set_figure_params(dpi=80)
sc.pl.pca_variance_ratio(adata)

In [None]:
sc.pl.pca_variance_ratio(adata, log=True)

## Choise of paramaters


* n_pcs: we will use 20 PCs



### n_neighbors
From documentation:
> The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. **Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.**
> In general values should be in the range 2 to 100. 
> + If knn is True (Default), number of nearest neighbors to be searched. 
>>The k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P.
> + If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor.
>>This transition matrix is computed using a nearest neighbor graph whose edge weights have a Gaussian distribution with respect to Euclidian distance in gene expression space; transition probabilities correspond to edge weights


Overall:
- lower n_neighbor = more local structures of data preserved
- higher n_neighbor = more global structure of data

In [None]:
neigh = [5, 20, 50, 80]  
#neigh = [5, 10, 15, 20, 25]  
#neigh = [3, 5, 7, 9, 11]  

dict_neigh = {}

for x in neigh:
    
    dict_key = 'Neighbours_' + str(x)
    dict_neigh[dict_key] = []
    
    print('# neighbors:', x)
    sc.pp.neighbors(adata, n_neighbors=x, n_pcs=20, use_rep="X_pca_harmony", key_added="harmony")
    sc.tl.umap(adata, neighbors_key="harmony", random_state=1)
    sc.pl.umap(adata, color=['Donor', 'Cluster'], 
          palette=sc.pl.palettes.vega_20_scanpy, size=8)
    
    
    sc.tl.leiden(adata, resolution=0.3, key_added='Leiden_03')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_03'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.4, key_added='Leiden_04')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_04'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.5, key_added='Leiden_05')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_05'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.6, key_added='Leiden_06')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_06'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.7, key_added='Leiden_07')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_07'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.8, key_added='Leiden_08')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_08'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=0.9, key_added='Leiden_09')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_09'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=1.0, key_added='Leiden_10')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_10'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=1.1, key_added='Leiden_11')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_11'].astype('int')) + 1 )
    sc.tl.leiden(adata, resolution=1.2, key_added='Leiden_12')
    dict_neigh[dict_key].append(max(adata.obs['Leiden_12'].astype('int')) + 1 )
    
    #color=['Leiden_03', 'Leiden_04', 'Leiden_05', 'Leiden_06', 'Leiden_07', 'Leiden_08', 'Leiden_09']
    sc.pl.umap(adata, color=['Leiden_04', 'Leiden_08', 'Leiden_12'], 
          palette=sc.pl.palettes.vega_20_scanpy, size=8,  legend_loc='on data')

## Choise of paramaters
 
* n_neighbors: we will use 80 as number of neighbors



### Metrics

+ **minkowski**: Metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.
    1. Euclidean distance: p = 2 (l2 norm) 
    2. Manhattan distance: p = 1 (l1 norm) 
    3. Max distance: p = ∞ (max/l∞ norm)


+ **cosine**: It is defined to equal the cosine of the angle between two vectors, which is also the same as the inner product of the same vectors normalized to both have length 1. It is thus a judgment of orientation and not magnitude: 
     - two vectors with the same orientation have a cosine similarity of 1, 
     - two vectors oriented at 90° relative to each other have a similarity of 0,  
     - two vectors diametrically opposed have a similarity of -1, independent of their magnitude.
     

+ **jaccard**: Ratio of Intersection over Union 

+ **correlation**: Pearson, Kendall, Spearman

Benchmarking Paper: https://academic.oup.com/bib/article/20/6/2316/5077112
>Distance-based metrics such as Euclidean distance are sensitive to data scaling, whereas correlation-based metrics such as Pearson’s correlation are invariant to scaling. Such property makes correlation-based metrics robust to data noise and normalisation procedure

In [None]:
dist = ['euclidean', 'l2', 'manhattan', 'l1', 'minkowski', 'cosine', 'jaccard', 'correlation'] #'cityblock'

dict_metric = {}


for x in dist:
    
    dict_key = 'Metric_' + x
    dict_metric[dict_key] = []

    
    print('Metric:', x)
    sc.pp.neighbors(adata, n_neighbors=80, n_pcs=20, metric=x, use_rep="X_pca_harmony", key_added="harmony")
    sc.tl.umap(adata, neighbors_key="harmony", random_state=1)
    sc.pl.umap(adata, color=['Donor', 'Cluster'])
    
    
    

## Choise of paramaters
 
* metrics: we will use the default distance in scanpy: "euclidean"



In [None]:
sc.pp.neighbors(adata, n_neighbors=80, n_pcs=20, use_rep="X_pca_harmony", key_added="harmony")
#sc.pp.neighbors(adata, n_neighbors=80, n_pcs=15, use_rep="X_pca_harmony", key_added="harmony", metric='cosine')

## 5.1  UMAP

In [None]:
sc.tl.umap(adata, neighbors_key="harmony", random_state=1)

In [None]:
sc.pl.umap(adata, color=['Donor', 'Cluster'])

## 5.2  Diffusion Map

In [None]:
sc.tl.diffmap(adata, neighbors_key="harmony")

In [None]:
sc.pl.diffmap(adata, color=['Donor', 'Cluster'],components=['2,3'])

## 5.4 Force-directed graph

In [None]:
#Takes quite some time 
sc.tl.draw_graph(adata, neighbors_key="harmony")

In [None]:
sc.pl.draw_graph(adata, color=['Donor', 'Cluster'])

----

----

# 6. Save

## 6.1 Save AData

In [None]:
adata.write(results_file)

## 6.2 Timestamp finished computations 

In [None]:
print(datetime.now())

## 6.3 Save python and html versions

In [None]:
nb_fname = '3_1_UMAP_parameters'
nb_fname

In [None]:
%%bash -s "$nb_fname"
jupyter nbconvert "$1".ipynb --to="python"
jupyter nbconvert "$1".ipynb --to="html"