# Pre Processing Tutorial

This is a tutorial for how to use the built-in pre processing workflow.<br> 
<font color="red">**It's worth noting however that the user is not required to use this before using the model. One can implement their own pre processing pipeline and use that instead. The main point is that the input to the model should be structured as a Anndata object, where adata.X contains normalized counts, and adata.obs should contain a key for cell types and a key for bacth effect.**</font>

**Before runing tutorial:**
- Make sure you've installed the CELLULAR package
- Download the data named *Tutorial.rar* from [here](https://doi.org/10.5281/zenodo.10959788) and place the data folder under the */Tutorial* folder in this repository

In [None]:
# Load required packages
import scanpy as sc
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score
import numpy as np

import CELLULAR_CL as CELLULAR

  from .autonotebook import tqdm as notebook_tqdm


## Pre process data

The pre_process_data can either take adata as input by specifying the *adata_* variable, or you can give a count matrix, list of genes, and list of barcodes and the function will create a Anndata object from this information. <br>
This function will calculate multiple quality control metrics and filter the data using this information. <br>
It will filter samples by:
- 'log_n_counts': Shifted log of sum of counts per cell.
- 'log_n_genes': Shifted log of number of unique genes expressed per cell.
- 'pct_counts_in_top_20_genes': Fraction of total counts among the top 20 genes with the highest counts.
- 'mt_frac': Fraction of mitochondrial counts.

It will also filter genes by removing genes that are expressed in less than 20 samples.

Worth noting is that suitable thresholds for Median Absolut Deviation (MAD) when filtering based on these metrics depend on the dataset, hence it's always a good idea to visualize the data to make sure outliers were taken care of during pre processing.

In [2]:
adata = CELLULAR.pre_process_data(count_file_path="data/GSM3396161_A/GSM3396161_matrix_A.mtx.gz",
                                gene_data_path="data/GSM3396161_A/GSM3396161_genes_A.tsv.gz",
                                barcode_data_path="data/GSM3396161_A/GSM3396161_barcodes_A.tsv.gz")



Number of cells before QC filtering: 2994
Number of cells removed by log_n_genes filtering: 66
Number of cells removed by log_n_counts filtering: 39
Number of cells removed by pct_counts_in_top_20_genes filtering: 620
Number of cells removed by mt_frac filtering: 22
Number of cells post QC filtering: 2311
Number of genes before filtering: 33694
Number of genes after filtering so there is a minimum of 20 unique cells per gene: 12438


## Normalize

Calculates a scaler for each sample and applies it. Then it performs log1p normalization. <br>
adata.X will now be normalized.

In [3]:
adata = CELLULAR.log1p_normalize(adata)