# Mid-gestation fetal cortex dataset: QC and Filtering



_**Single cell transcriptomics dataset from paper published by Polioudakis et al. (Geschwind lab, Neuron 2019) characterizing human fetal cortex at mid-gestation**._


<nav> <b> References: </b>

<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6831089/#SD1"> Paper </a> 
    
<a href="http://solo.bmap.ucla.edu/shiny/webapp/">CoDEx Viewer </a> 
 </nav>

<img src="https://www.cell.com/cms/attachment/1d932d66-fe94-4d36-abec-b4949f92ae61/fx1.jpg" width="500">

> Mehtodological approach: Fetal cortical speciments at mid-gestation (gestation week (GW) 17 to 18): germinal zones, developing cortical laminae containing migrating and newly born neurons. To optimize detection of distinct cell types, **prior to single-cell isolation** cortex was separated into:
>- the germinal zones [ventricular zone (VZ) and subventricular zone (SVZ)
>- developing cortex [subplate (SP) and cortical plate (CP)] 


- Sequencing method: **Drop-seq** (Macosko et al., 2015)
- Obtained number of cells: **~40,000**

-----

# 1. Environment

## 1.1 Libraries

In [None]:
import numpy as np
import pandas as pd
import igraph as ig
import scanpy as sc
import scanpy.external as sce
from scipy.sparse import csr_matrix, isspmatrix

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns

#ultils
from datetime import datetime

In [None]:
sc.logging.print_header()

## 1.2 Settings

* Scanpy verbosity
* Figure size    
* Result file: the file that will store the analysis results

In [None]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80)

## 1.3 Custom Plotting Functions

In [None]:
def densityQCs(adataObj, hue=None):   
    #Plot them in line so they take up less space
    fig, ax = plt.subplots(1, 4, figsize=(20,5))
    fig.tight_layout(pad=2)   #space between plots
    
    if hue != None:
        hue_s = adata.obs[hue].astype('string')
    else:
        hue_s = None

    ### Genes ---------------
    d1 = sns.kdeplot(np.log10(adataObj.obs['n_genes_by_counts']), fill=True, color='cornflowerblue', hue=hue_s, ax=ax[0])
    min_x, max_x = d1.get_xlim() 

    #Threshold lines and fill
    if MIN_GENES != None:
        d1.axvline(np.log10(MIN_GENES), 0, 1, c='red')  #set manually for chosen threshold
        d1.axvspan(min_x, np.log10(MIN_GENES), alpha=0.2, color='red')
    if MAX_GENES != None:
        d1.axvline(np.log10(MAX_GENES), c='red')
        d1.axvspan(np.log10(MAX_GENES), max_x, alpha=0.2, color='red')

    ### UMI ---------------
    d2 = sns.kdeplot(np.log10(adataObj.obs['total_counts']), fill=True, color='forestgreen', hue=hue_s, ax=ax[1])
    min_x, max_x = d2.get_xlim() 
        
    if MIN_COUNTS != None:
        d2.axvline(np.log10(MIN_COUNTS), 0, 1, c='red')  #set manually for chosen threshold
        d2.axvspan(min_x, np.log10(MIN_COUNTS), alpha=0.2, color='red')
    if MAX_COUNTS != None:
        d2.axvline(np.log10(MAX_COUNTS), c='red')
        d2.axvspan(np.log10(MAX_COUNTS), max_x, alpha=0.2, color='red')

    ### Mito % ---------------
    d3 = sns.kdeplot(adataObj.obs['pct_counts_mito'], fill=True, color='coral', hue=hue_s, ax=ax[2])
    min_x, max_x = d3.get_xlim() 

    #Threshold lines and fill
    if MT_PERCENTAGE != None:
        d3.axvline(MT_PERCENTAGE, 0, 1, c='red')  #set manually for chosen threshold
        d3.axvspan(MT_PERCENTAGE, max_x, alpha=0.2, color='red')


    ### Ribo % ---------------
    d4 = sns.kdeplot(adataObj.obs['pct_counts_ribo'], fill=True, color='orchid', hue=hue_s, ax=ax[3])
    min_x, max_x = d4.get_xlim() 
    #ax[3].legend(loc='center left', bbox_to_anchor=(1.0, 1.0)) #upper right
    
    #Threshold lines and fill
    if MT_PERCENTAGE != None:
        d4.axvline(RB_PERCENTAGE, 0, 1, c='red')  #set manually for chosen threshold
        d4.axvspan(RB_PERCENTAGE, max_x, alpha=0.2, color='red')
    
    #Remove additional legends at need
    if hue != None:
        ax[0].get_legend().remove()
        ax[1].get_legend().remove()
        ax[2].get_legend().remove()
        
    # Remove all borders
    sns.despine(bottom = False, left = True)

## 1.4 Results File

In [None]:
#results_file = '/home/..../brainomics/Data/1_AdataFilt.h5ad'

----

# 2. Data Load

> Data are already structured as ann AnnData Object. 
> Anndata stores: 
> 1. a data matrix __(adata.X)__ 
> 2. dataframe-like annotation of observations __(adata.obs)__ and variables __(adata.var)__ 
> 3. unstructured dict-like annotation __(adata.uns)__. 


<nav> <b> References: </b>

<a href="https://anndata.readthedocs.io/en/latest/"> Anndata ReadTheDocs </a> 
 </nav>



<img src="https://anndata.readthedocs.io/en/latest/_images/anndata_schema.svg" width="650">

## 2.1 Read adata file

In [None]:
adata = sc.read('/group/brainomics/course_material/Day2/data/0_AdataStart.h5ad')

## 2.2 Explore Adata

In [None]:
adata

### A. __Count Matrix__
Stores the matrix of values: expression value of each gene in each cell.

In [None]:
adata.X

In [None]:
print(adata[:10, :2].X)

In [None]:
adata.X.toarray()[:10, :10] #so see it more matrix like

In [None]:
# To store the values in a new object
ACTB_counts = adata[:,['ACTB']].X
ACTB_counts

In [None]:
print(ACTB_counts[:10,])

### B. __Cell metadata__
Adata.obs stores the metadata about the observations: cells (rows of the expression matrix). 

__Check number and names of cells__

In [None]:
print('Initial number of cells:', adata.n_obs)

In [None]:
print('Cell names: ', adata.obs_names[:10].tolist())

__Check cell metadata__

In [None]:
print('Available metadata for each cell: ', adata.obs.columns)

In [None]:
adata.obs[:5]

### C. __Gene metadata__

Adata.var stores the metadata about features: genes (columns of the expression matrix). 

In [None]:
adata.var[:5]

__Check number and names of genes__

In [None]:
print('Initial number of genes:', adata.n_vars)

In [None]:
print('Gene names: ', adata.var_names[:10].tolist())

__Check gene metadata__

In [None]:
# To see the gene metadata (information available for each gene)  
print('Available metadata for each gene: ', adata.var.columns)

## 2.3 Explore metadata

### A. __Donors__

> Samples processed for Dropseq were obtained from 4 donors:
>
> - 3 females (2 GW17, 1 GW18);
> - 1 male (GW18)

In [None]:
adata.obs.Donor = adata.obs.Donor.astype('int').astype('category')

In [None]:
adata.obs['Donor'].value_counts().plot.bar(color=['#279e68', '#d62728', '#ff7f0e', '#1f77b4']) 

### B. __Gestational Week__

In [None]:
adata.obs['Gestation_week'].value_counts()

In [None]:
adata.obs['Gestation_week'].value_counts().plot.bar(color=['limegreen', 'orange'])

### C. __Layer__

>Coronal sections were prepared from fetal cortices. The coronal sections were then **further dissected** at the intermediate zone (IZ) to divide them into two regions: germinal zones (GZ) and developing cortex (CP). Following dissection, **GZ and CP sections were separately dissociated**.

In [None]:
adata.obs['Layer'].value_counts()

In [None]:
adata.obs['Layer'].value_counts().plot.bar(color=['magenta', 'turquoise'])

### D. __Clusters__

In [None]:
adata.obs['Cluster'].value_counts()

In [None]:
adata.obs['Cluster'].value_counts().plot.barh()

<div class="alert alert-block alert-info"><b>Cell populations identified by the authors:</b> 

* Cycling Progenitor S (PgS)
* Cycling Progenitor G2/M (PgG2M)
* Vetricular Radial Glia (vRG)
* Outer Radial Glia (oRG)
* Intermediate Progenitor (IP)
* Migrating Excitatory (ExN)
* Maturing Excitatory (ExM)
* Maturing Excitatory upper enriched (ExM-U)
* Excitatory Deep Layer 1 (ExDp1)
* Excitatory Deep Layer 2 (ExDp2)
* Interneuron MGE (InMGE)
* Interneuron CGE (InCGE)
* Oligodendrocyte Progenitor (OPC)
* Microglia (Mic)
* Pericytes (Per)
* Endothelia (End)
    

<img src="http://solo.bmap.ucla.edu/shiny/webapp/img/cell_types_layers.png" width="550">

----

# 3. Top-expressed genes
The plot shows those genes that yield the highest fraction of counts in each single cells, across all cells.

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20)

----

# 4. Quality Check

__Evaluate metrics related to quality as preliminary step for filtering.__ 

* __Mitochondrial gene counts:__ high proportions are indicative of poor-quality cells, related to loss of cytoplasmic RNA from perforated cells: mitochondrial transcripts are protected by mitochondrial membrane and therefore less likely to escape through tears in the cell membrane. 
* __Ribosomal Protein gene counts:__ high proportion are indicative of a shallow sequencing, because very highly expressed genes occupy most of the reads
* __Number of genes:__ related to sequencing depth/quality
* __Number of UMI counts for each gene:__ gene-wise sum of UMI counts (in all the cells) 




<div class="alert alert-block alert-info"><b> NOTE: </b> Thresholds defined below are first showed on diagnostic plots and then applied in the filtering step. 
    You can inspect and change them iteratively until you are satisfied by the results. 
</div>

### Filtering thresholds
#### **Cell Filtering**

In [None]:
#Cell filtering
MIN_GENES = 500
MAX_GENES = 5000

MIN_COUNTS= 750
MAX_COUNTS= 10000

MT_PERCENTAGE = 5
RB_PERCENTAGE = 20

#### **Gene Filtering**
We want to discard genes that are too lowly expressed and thus not informative.  

A good approach is to keep genes that are expressed in at least _min_cells_. We identify this minimum number based on the total number of cells in the dataset, selecting around 0.5% of the total fraction of cells

In [None]:
#Gene Filtering
MIN_CELLS = np.rint((adata.n_obs*0.5)/100) # Filtering genes on minimum cells: 0.5%
MIN_CELLS 

<div class="alert alert-block alert-warning"><b>FOOD for THOUGHTS:</b>
to be more stringent on gene filtering, the minimum number of cells may be set <b> after </b> filtering low quality cells. 

It is convenient especially when starting from a very high number of cells and a lot of them are discarded, as setting the thresholds before filtering may keep some genes even with very few counts in the remaining cells.
</div>

### 4.1 Identify Mitocondrial and Ribosomal genes
**The string based selection is not completely accurate** because it can also include genes that are not mitocondrial/ribosomal but share the string used for the selection. 

It should be anyway a good enough approximation for our purposes.


<div class="alert alert-block alert-info"><b>NOTE:</b>
    Metadata already contains mito % etc, we recompute them and check if they are coherent.
</div>

In [None]:
#Find mito and ribo genes
mito_genes = adata.var_names.str.startswith('MT-')    
ribo_genes = adata.var_names.str.contains('^RPS|^RPL')

adata.var_names[mito_genes] 

In [None]:
adata.var['mito'] = adata.var_names.str.startswith('MT-')    
adata.var['ribo']= adata.var_names.str.contains('^RPS|^RPL')

### 4.2 Automated QC metrics

We use the scanpy automated QC metrics function:
    calculate_qc_metrics in module scanpy.preprocessing._qc. See [calculate_qc_metrics docs](https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.calculate_qc_metrics.html)

In [None]:
sc.pp.calculate_qc_metrics(adata, qc_vars=['mito','ribo'], inplace=True,
                           log1p=False, percent_top=None)

In [None]:
adata.obs.head()

### 4.3 Inspect quality-related parameters


__4.3.1 Mitochondrial genes__

In [None]:
#Print quantiles to get an idea of the value distributions
print(np.quantile(adata.obs['pct_counts_mito'], np.arange(0, 1.1, 0.1))) 

__4.3.2 Ribosomal Protein genes__

In [None]:
print(np.quantile(adata.obs['pct_counts_ribo'], np.arange(0, 1.1, 0.1)))

__4.3.3 Total number of counts for each cell__

In [None]:
print(np.quantile(adata.obs['total_counts'], np.arange(0, 1.1, 0.1)))

__4.3.4 Number of genes detected in each cell__

In [None]:
print(np.quantile(adata.obs['n_genes_by_counts'], np.arange(0, 1.1, 0.1)))

### 4.4 Visualization

#### Violin plots: linear values 

In [None]:
sc.pl.violin(adata, keys=['total_counts', 'n_genes_by_counts', 'pct_counts_mito', 'pct_counts_ribo'],
             jitter=False, multi_panel=True, groupby='Donor', rotation=45)

In [None]:
adata.uns['Donor_colors'] 

#### Violin plots: log values 

In [None]:
sc.pl.violin(adata, keys=['total_counts', 'n_genes_by_counts', 'pct_counts_mito', 'pct_counts_ribo'],
             jitter=False, multi_panel=True, log=True, groupby='Donor', rotation=45)

#### Density plots


##### **Single Plots**

In [None]:
densityQCs(adata)

##### **Double Plots**

In [None]:
sns.jointplot(x=np.log10(adata.obs['total_counts']), 
              y=np.log10(adata.obs['n_genes_by_counts']), 
              kind="kde", color="cornflowerblue", fill=True)

In [None]:
sns.jointplot(x=adata.obs['pct_counts_mito'], 
              y=adata.obs['pct_counts_ribo'], 
              kind="hist", color="coral")

# 5. Filtering
Modify the __threshold variables__ cell at the start of the QC to overwrite the default filtering values

In [None]:
print('\nThe selected filtering parameters are: \n Minimum cells: ' , MIN_CELLS, 
      '\n Minimum counts: ' , MIN_COUNTS, '\n Maximum counts:' , MAX_COUNTS,
      '\n Minimum genes: ' , MIN_GENES, '\n Maximum genes:' , MAX_GENES,
      '\n Mitocondia: ' , MT_PERCENTAGE, '%', '\n Ribosomal: ', RB_PERCENTAGE, '%')

## 5.1 Filtering Cells

### 5.1.1 Detected Genes

In [None]:
sc.pp.filter_cells(adata, min_genes=MIN_GENES)
print('After filtering on min detected genes:number of cells:', adata.n_obs)
print()

### 5.1.2 UMI Counts

In [None]:
sc.pp.filter_cells(adata, min_counts = MIN_COUNTS)
print('After filtering on min UMI counts:number of cells:', adata.n_obs)

### 5.1.3 Mitochondrial RNA

In [None]:
#adata.obs[adata.obs['percent_mito'] > 0.05]  
adata = adata[adata.obs['pct_counts_mito'] < MT_PERCENTAGE, :]

print('After filtering on mitochondrial RNA: number of cells:', adata.n_obs)

### 5.1.4 Ribosomal RNA

In [None]:
#adata = adata[adata.obs['percent_ribo'] < 0.35, :]
adata = adata[adata.obs['pct_counts_ribo'] < RB_PERCENTAGE, :]

print('After filtering on ribosomal protein RNA: number of cells:', adata.n_obs)

## 5.2 Filtering genes
**Filtered out genes expressed in less than 0.5% of cells**

In [None]:
print('Before gene filtering: number of genes:', adata.n_vars)
print('Before gene filtering: number of cells:', adata.n_obs)

In [None]:
print(MIN_CELLS) # Filtering genes on minimum cells: 1%
sc.pp.filter_genes(adata, min_cells=MIN_CELLS)

In [None]:
print('After gene filtering: number of genes:', adata.n_vars)
print('After filtering: number of cells:', adata.n_obs)

## 5.3 Numbers after filtering

In [None]:
print('After applied filtering: number of cells:', adata.n_obs)
print('After applied filtering: number of genes:', adata.n_vars)

In [None]:
adata.obs['Donor'].value_counts().plot.bar(color=['#279e68', '#d62728', '#ff7f0e', '#1f77b4'])

In [None]:
sc.pl.violin(adata, keys=['total_counts', 'n_genes_by_counts', 'pct_counts_mito', 'pct_counts_ribo'], groupby='Donor',
             jitter=False, multi_panel=True, rotation=45)

----

# 6. Save file

## 6.1 Save Adata

In [None]:
type(adata.X)

In [None]:
adata.X

In [None]:
adata.write(results_file)

## 6.2 Finished computations: timestamp

In [None]:
print(datetime.now())

## 6.3 Save notebook

In [None]:
#nb_fname = ipynbname.name()
nb_fname = '1_QC_Filt'
nb_fname

In [None]:
%%bash -s "$nb_fname"
jupyter nbconvert "$1".ipynb --to="python"
jupyter nbconvert "$1".ipynb --to="html"