# Data Preprocessing

The data preparation before analysis is based on the following tutorial: <https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_01_qc.html> and was included here to ensure the reproducability of the analysis. 

**Loading the necessary libraries**

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np

In [None]:
# settings can be adapted individually
sc.settings.verbosity = 3            
sc.logging.print_header()             
sc.settings.set_figure_params(dpi = 100, format = 'png')

**Load the scRNA-seq data**

Data was taken form the following publication of Cano-Gamez et al.: <https://www.nature.com/articles/s41467-020-15543-y>

In [None]:
canogamez = sc.read_h5ad("/corgi/scdata/henriksson/canogamez/cleaned.h5ad") # change to your data path 
canogamez.var_names_make_unique()

In [None]:
# create a path to store the preprocessed file
results_file = 'write/canogamez_preprocessing.h5ad' # change to your data path 

First, the `iTreg` sample derived from `Donor2` was removed due to an unusually high dropout rate as described by Cano-Gamez et al. :  

In [None]:
canogamez_D2 = canogamez[(canogamez.obs['donor.id']=='D2')&(canogamez.obs['cytokine.condition']=='iTreg')]
remove = canogamez_D2.obs.index.to_list()
index = canogamez.obs.index.to_list()
keep = list(set(index)-set(remove))

In [None]:
canogamez = canogamez[keep,: ]

## Quality Control (QC)

Identify genes belonging to a specific group

In [None]:
 # mitochondrial genes
canogamez.var['mt'] = canogamez.var_names.str.startswith('MT-') 
# ribosomal genes
canogamez.var['ribo'] = canogamez.var_names.str.startswith(("RPS","RPL"))
# hemoglobin genes.
canogamez.var['hb'] = canogamez.var_names.str.contains(("^HB[^(P)]"))

Calculate Quality of the data 

calculation of: 
- per cell (`.obs`): 
    - n_genes_by_counts
    - total_counts
    - mitochondrial genes: total_counts_mt, pct_counts_mt
- per gen(`.var`):`
    - n_cells_by_counts
    - mean_counts,total_counts
    - pct_dropout_by_counts = percentage of cells with counts of zero for each gene
    

In [None]:
sc.pp.calculate_qc_metrics(canogamez, qc_vars=['mt','ribo','hb'], percent_top=None, log1p=False, inplace=True) 

Plot QC 

In [None]:
sc.pl.violin(canogamez, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'pct_counts_hb'],
             jitter=False, groupby = 'cytokine.condition', rotation= 45, wspace=1)

## Filtering 

Set thresholds:

In [None]:
sc.pp.filter_cells(canogamez, min_genes=200)
sc.pp.filter_genes(canogamez, min_cells=3)

Highest expressed genes: 

In [None]:
sc.pl.highest_expr_genes(canogamez, n_top=20)

Mito/Ribo filtering:

In [None]:
canogamez = canogamez[canogamez.obs.pct_counts_mt < 7.5, :]
canogamez = canogamez[canogamez.obs.pct_counts_ribo > 5, :]

Filter for UMI content: 

In [None]:
canogamez = canogamez[canogamez.obs.n_genes_by_counts > 500, :]

Remove mt and hb genes as well as MALAT1 

MALAT1:  act as a transcriptional regulator for numerous genes -> highly expressed in almost every cell and therefore removed from the analysis lateron

In [None]:
# calculate new as low expressed genes were removed 
mito_genes = canogamez.var_names.str.startswith('MT-')
hb_genes = canogamez.var_names.str.contains('^HB[^(P)]')

malat1 = canogamez.var_names.str.startswith('MALAT1')

remove = np.add(mito_genes, malat1)
remove = np.add(remove, hb_genes)
keep = np.invert(remove)

canogamez = canogamez[:,keep]

## Chromosomal Information and Sample Sex

*Note: Not necessary here as only male donors were used. Only for consistency.*

Create annotation

In [None]:
annot = sc.queries.biomart_annotations(
        "hsapiens",
        ["ensembl_gene_id", "external_gene_name", "start_position", 
         "end_position", "chromosome_name"],).set_index("external_gene_name")

Identify chromosome y and x genes

In [None]:
chrY_genes = canogamez.var_names.intersection(annot.index[annot.chromosome_name == "Y"])
chrX_genes = canogamez.var_names.intersection(annot.index[annot.chromosome_name == "XIST"])

*Here: no XIST genes present*

## Cell Cycle Scores

- score list: difference of mean expression of the given list and the mean expression of reference genes
- reference genes: function randomly chooses a bunch of genes matching the distribution of the expression of the given list
- adds three more variables to data: 
    - score for S phase
    - a score for G2M phase
    - predicted cell cycle phase

Normalize data before running

In [None]:
# normalize to depth 10 000
# Note: Check for mean count depth before by `.obs.describe()`
sc.pp.normalize_per_cell(canogamez, counts_per_cell_after=1e4)

# save normalized counts as raw data
canogamez.raw = canogamez

# logaritmize
sc.pp.log1p(canogamez)

Load all cell cycle genes
<br>
downloaded here: <https://github.com/theislab/scanpy_usage/blob/master/180209_cell_cycle/data/regev_lab_cell_cycle_genes.txt>

In [None]:
cell_cycle_genes = [x.strip() for x in open('/home/plutowski/notebooks/Markers/regev_lab_cell_cycle_genes.txt')]

In [None]:
# Split into 2 lists
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]

In [None]:
# check how many genes are present in the data 
cell_cycle_genes = [x for x in cell_cycle_genes if x in canogamez.var_names]
print(len(cell_cycle_genes))

Calculate and plot score 

In [None]:
sc.tl.score_genes_cell_cycle(canogamez, s_genes=s_genes, g2m_genes=g2m_genes)

In [None]:
sc.pl.violin(canogamez, ['S_score', 'G2M_score'],
             jitter=False, groupby = 'cytokine.condition', rotation=45, save='cell_cycle_cytokines')

## Doublet Prediction

Import needed library

In [None]:
import scrublet as scr

Run prediction 

In [None]:
scrub = scr.Scrublet(canogamez.raw.X)
canogamez.obs['doublet_scores'], canogamez.obs['predicted_doublets'] = scrub.scrub_doublets()
scrub.plot_histogram()

sum(canogamez.obs['predicted_doublets'])

In [None]:
# add in column with doublet info
canogamez.obs['doublet_info'] = canogamez.obs["predicted_doublets"].astype(str)

In [None]:
sc.pl.violin(canogamez, 'n_genes_by_counts',
             jitter=False, groupby = 'doublet_info', rotation=45)

*Optional: Doublets can be removed* 

In [None]:
# canogamez = canogamez.raw.to_canogamez() 
# canogamez = canogamez[canogamez.obs['doublet_info'] == 'False',:]

## Highly variable genes

In [None]:
sc.pp.highly_variable_genes(canogamez, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(canogamez)

Take only highly variable genes for the further analysis

In [None]:
canogamez = canogamez[:, canogamez.var.highly_variable]

### Further preprocessing

Regress out the effects of number of counts, mt genes, cell cycle and batch.10x

In [None]:
sc.pp.regress_out(canogamez, ['total_counts', 'pct_counts_mt','batch.10X', 'S_score', 'G2M_score'])

Scale for unique variance 

In [None]:
sc.pp.scale(canogamez, max_value=10)

### Store the preprocessed data 

In [None]:
canogamez.write(results_file)