# Project: Explore Braun dataset


__Upstream Steps__

* Assemble adata
* Filtered to select cortex, cerebellum, thalamus
* Sub-sampled to 75000 cells


__This notebook__

* QC filter on cells
* Expression filter on genes
* Normalization and log10 transformation by Scanpy functions
* Feature selection (HVG) by Scanpy functions
* Dimensionality reduction
* Batch correction by Harmony



# Dataset Information

## References

* [Paper Reference: Comprehensive cell atlas of the first-trimester developing human brain](https://www.science.org/doi/10.1126/science.adf1226)
* [Data and code repository: GitHub Repo](https://github.com/linnarsson-lab/developing-human-brain)


> Short description: 26 brain specimens spanning 5 to 14 postconceptional weeks (pcw) were dissected into 111 distinct biological samples. Each sample was subjected to single-cell RNA sequencing, resulting in 1,665,937 high-quality single-cell


__Subsample of the dataset selecting cortex, cerebellum and thalamus__

<div class="alert alert-block alert-info"><b>Cell populations identified by the authors:</b> 

* __Erythrocyte__
* __Fibroblast__
* __Glioblast__
* __Immune__
* __Neural crest__
* __Neuroblast__
* __Neuron__
* __Neuronal IPC__
* __Oligo__
* __Placodes__
* __Radial glia__
* __Vascular__

-----

# 1. Environment

## 1.1 Modules

In [None]:
import os
import sys

import numpy as np
import pandas as pd
import igraph as ig
import scanpy as sc
import scanpy.external as sce
from scipy.sparse import csr_matrix, isspmatrix

#Plotting
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

#utils
from datetime import datetime

In [1]:
# Custom functions
import Helper as fn

## 1.2 Settings


In [None]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80)

## 1.3 Input file

You can set parameters useful for your analysis in the cells below and then recall them during the analysis

In [None]:
path = '../DataDir/InputData/'
Id = 'Project_AssembledAdata.h5ad'
input_file = path + Id

## 1.4 Other parameters

You can set parameters useful for your analysis in the cells below and then recall them during the analysis

----

# 2. Data Load


## 2.1 Read adata file

In [None]:
adata = sc.read_h5ad(input_file)
adata.var_names_make_unique()

## 2.2 Explore metadata

### ⚠️❓ *Which are the main / most interesting metadata associated with barcodes (.obs)? How can you plot/inspect them?* 

<details>

<summary>Hint</summary>

> adata.obs.columns # to check information stored in .obs
> 
> adata.obs['myCol'].value_counts().plot.bar() #Specifying the column of interest
>
> adata.obs['myCol'].value_counts().plot.bar()
>
> pd.crosstab(adata.obs['myCol1'], adata.obs['myCol2'])

</details>

In [None]:
adata.obs.columns

## 2.3 Calculate QCs

In [None]:
#Find mito and ribo genes
mito_genes = adata.var_names.str.startswith('MT-')   
ribo_genes = adata.var_names.str.contains('^RPS|^RPL')

#qc_vars wants a column of adataata.var containing T/F or 1/0 indicating the genes to be selected for sub-statistics
adata.var['mito'] = adata.var_names.str.startswith('MT-')    
adata.var['ribo']= adata.var_names.str.contains('^RPS|^RPL')

#Compute metrics (inplace=True to append to adata)
sc.pp.calculate_qc_metrics(adata, log1p=True, qc_vars=['mito','ribo'], inplace=True, percent_top=None)

In [None]:
adata

----

# 3. Discard low quality barcodes and lowly expressed genes

### 3.1 ⚠️❓ *Which parameters would you evaluate to discard low-quality barcodes?*

In [None]:
# write your code 

<details>

<summary>Hint</summary>

__You can evaluate the following metrics related to quality as preliminary step for filtering:__ 

* __Mitochondrial gene counts:__ high proportions are indicative of poor-quality cells, related to loss of cytoplasmic RNA from perforated cells: mitochondrial transcripts are protected by mitochondrial membrane and therefore less likely to escape through tears in the cell membrane. 
* __Ribosomal Protein gene counts:__ high proportion are indicative of a shallow sequencing, because very highly expressed genes occupy most of the reads
* __Number of genes:__ related to sequencing depth/quality
* __Number of UMI counts for each gene:__ gene-wise sum of UMI counts (in all the cells) 


</details>

### 3.2 ⚠️❓ *How would you describe overall the quality of this dataset?*

In [None]:
# write your code 

<details>

<summary>Hint</summary>

__You can using diagnostic plots to check the distribution of QC, such as violin plots or density plots__ 

> sc.pl.violin(adata, keys=['total_counts', 'n_genes_by_counts', 'pct_counts_mito', 'pct_counts_ribo'], groupby='Meta_Col',
             jitter=False, multi_panel=True, rotation=45) # specifying the column of interest

</details>



### 3.3 ⚠️❓ *Which thresholds would you set for the filtering of low-quality barcodes?*
__Once you have set the thresholds, proceed with the barcode filtering steps__

In [None]:
# write your code 

<details>

<summary>Hint</summary>

__Apply sc.pp.filter_cells function for cell filtering and sc.pp.filter_genes for gene filtering__ 

> sc.pp.filter_cells(adata, min_genes=MIN_GENES) #specifying the chosen threshold


</details>

### What about feature filtering? 

__Set the threshold on the basis of your considerations, and proceed with the feature filtering step__

### 3.4 ⚠️❓ *How many barcodes (obs) and features (var) are in your anndata at the end of the filtering steps?*

In [None]:
# write your code 

----

# 4. Normalize and Log Transform 

* Store raw counts in 'counts' layer
* Normalize and log-transform

Some useful parameters to keep in mind from the scanpy documentation for [sc.pp.normalize_total](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html)
>- `target_sum` : If None, after normalization, each observation (cell) has a total count equal to the **median of total counts for observations (cells) before normalization**.
>- `exclude_highly_expressed` : Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell. **A gene is considered highly expressed, if it has more than max_fraction of the total counts in at least one cell**. The not-excluded genes will sum up to target_sum.
>- `max_fraction` : float (**default: 0.05**) If exclude_highly_expressed=True, consider cells as highly expressed that have more counts than max_fraction of the original total counts in at least one cell.



In [None]:
adata.layers['counts'] = adata.X.copy()

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4, exclude_highly_expressed=True)
sc.pp.log1p(adata)

----------------------

# 5. Feature selection: Highly Variable Genes



In [None]:
# specified values are the default
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
print('Number of Higly Variable Genes', len(adata.var_names[adata.var['highly_variable'] == True]))

<details>

<summary>Hint</summary>

If in donwstream exploration you indentify sources of batch effect, you ca take them into consideration already at this level by specifying the variable as below:

> sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key=BATCH_KEY) #specifying the identified batch variable

</details>

--------------------

# 6. Dimensionality Reduction


## 6.1 PCA

* PCA applies an orthogonal transformation of the original dataset creating a new set of uncorrelated variables (principal components, PC) that are a linear combination of the original features. 
* In the context of scRNASeq, PCA is used to select the top PCs that are used for downstream analytical tasks.
* The number of PCs to retain for downstream analyses is generally chosen on the basis of the amount of variance explained by each of them. Using more PCs will retain more biological signal at the cost of including more noise.

In [None]:
sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')

###  ⚠️❓ *Which metadata are you curious to plot on the PCA?* 
###  ⚠️❓ *What information are you retrieving?* 

<details>

<summary>Hint</summary>

> sc.pl.pca(adata, color=['MetaCols'], ncols=2) #specifying the meta column of interest

</details>

### ⚠️❓ *How many PC would you select for the calculation of neighbors?*

Specify your choice in the N_PCs variable

<details>

<summary>Hint</summary>

Check the plot below:

> sc.pl.pca_variance_ratio(adata, log=True) 

> N_PCs = ChosenValue

</details>

In [None]:
# Complete below
N_PCs = 

## 6.2 Neighbours

[sc.pp.neighbors](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.neighbors.html) computes a neighborhood graph of observations. The cells are embedded in a graph structure with edges drawn between cells with similar feature expression patterns. A k-nearest neigbour graph will connect each cell with the its k-nearest neigbours.

__Key parameters:__ 
> * `n_pcs`: number of PC used for compute the kNN graph
> * `n_neighbors`: number of neighbors. Larger neighbor values will result in more global structure being preserved at the loss of detailed local structure. 
> * `metrics`: distance metric used in the calculation

In [None]:
N_NB = int(0.5 * len(adata) ** 0.5)
if N_NB > 80:
    N_NB = 80
print(N_NB) 

In [None]:
sc.pp.neighbors(adata, n_neighbors=N_NB, n_pcs=N_PCs, key_added="pca")

### ⚠️❓ *How does the choice of neighbours impact on downstream steps? How changes the UMAP if you select a bigger/smaller number of neighbours?*

In an alternative workflow, try to increase / decrease the n_neighbors parameter and observe the impact on the resulting UMAP

## 6.3 UMAP



In [None]:
sc.tl.umap(adata, random_state=1, neighbors_key="pca")

### ⚠️❓ Which metadata are you curious to plot on the UMAP? 
### ⚠️❓ What information are you retrieving? 
### ⚠️❓ Are there indications of batch effects?

<details>

<summary>Hint</summary>

> adata.obsm["X_umap_nocorr"] = adata.obsm["X_umap"].copy() #to store UMAP coordinates in a new slot

> del adata.obsm["X_umap"]

> sc.pl.embedding(adata, basis="X_umap_nocorr", 
                color=['n_genes_by_counts',"total_counts", 'pct_counts_mito', 'pct_counts_ribo'])

> sc.pl.embedding(adata, basis="X_umap_nocorr", 
                color=['MetaCols'])

</details>

-------------

# 7. Batch correction by Harmony

### ⚠️❓ Do you think the dataset is affected by a batch effect?

__If you think there is a potential batch effect, set the BATCH_KEY coordinate accordingly and then run the integration by Harmony__

In [None]:
#Write below
BATCH_KEY = ''

In [None]:
sc.external.pp.harmony_integrate(adata, BATCH_KEY)

In [None]:
sc.pp.neighbors(adata, n_neighbors=N_NB, n_pcs=N_PCs, use_rep='X_pca_harmony', key_added='harmony')

-----------------------

# 8. Batch-corrected dimensionality reduction

You can now check the results of your strategy, plotting the integrated UMAP

In [None]:
sc.tl.umap(adata, random_state=1, neighbors_key="harmony")
adata.obsm["X_umap_harmony"] = adata.obsm["X_umap"].copy()
del adata.obsm["X_umap"]

In [None]:
sc.pl.embedding(adata,  basis="X_umap_harmony", color=meta_dim, ncols=1)

### ⚠️❓Which other dimensionality reduction approach could you apply? 

-------


# 9. OPTIONAL SECTION: Cell type annotation

### ⚠️❓How would you proceed to have a first idea of the cell populations that have been profiled in the dataset? 

<details>

<summary>Hint</summary>

Once you have get your own idea of annotation, you can compare with the metadata that are available in the obs slot.

</details>

-----

# 10. SAVE ANNDATA

Save your anndata so that it will be available for the second part of the project. 

In [None]:
# write your code