## SpaRED Library Processing DEMO

In this tutorial, we will explore the data processing functions available in the SpaRED library, focusing on three key areas:

* Gene Features
* Filtering
* Layer Operations
* Denoising

These processing functions are essential for preparing and refining spatial transcriptomics data, ensuring that it is ready for accurate and efficient analysis. This demonstration will showcase the preprocessing steps used in our paper, providing a detailed look at how to clean your data, extract meaningful features, and perform various operations on data layers.


In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as im
import os
import sys
from pathlib import Path

currentdir = os.getcwd()
parentdir = str(Path(currentdir).parents[2])
sys.path.insert(0, parentdir)
print(parentdir)

import spared

### Load Datasets

The `datasets` file has a function to get any desired dataset and return the adata as well as the parameter dictionary. This function returns a filtered and processed adata. This function has a parameter called *visualize* that allows for all visualizations if set to True. The fuction also saves the raw_adata (not processed) in case it is required. 

We will begin by loading a dataset and setting the *visualize* parameter as False since no images are required for the functions analized in this DEMO.

In [None]:
from spared.datasets import get_dataset
import anndata as ad

#get dataset
data = get_dataset("vicari_mouse_brain", visualize=False)

#adata
adata = data.adata

#parameters dictionary
param_dict = data.param_dict

#loading raw adata 
dataset_path = os.getcwd()
files_path = os.path.join(dataset_path, "processed_data/vicari_data/vicari_mouse_brain/")
files = os.listdir(files_path)
adata_path = os.path.join(files_path, files[0], "adata_raw.h5ad")
raw_adata = ad.read_h5ad(adata_path)

### Gene features functions

In this section, we will explore the gene features functions available in the SpaRED library. These functions provide the tools to compute the relative and global expression fractions for all genes as well as the Moran's I genes for an adata.  These calculations provide insights into gene expression patterns and their spatial distribution and are also used in the pre-processing steps. 

Lets begin with `get_exp_frac`. This function receives as input:

* **adata (ad.AnnData):** adata collection where non-expressed genes have a value of `0` in the `adata.X` matrix

And returns an uptades adata with the expression fraction information added into the `adata.var['exp_frac']` column. The expression fraction of a gene in a slide is defined as the proportion of spots where that gene is expressed. 

In [None]:
from spared.gene_features import get_exp_frac

adata_exp = get_exp_frac(raw_adata)

`get_exp_frac` receives as input:

* **adata (ad.AnnData):** adata collection where non-expressed genes have a value of `0` in the `adata.X` matrix

And returns an uptades adata with the expression fraction information added into the `adata.var['glob_exp_frac']` column. The global expression fraction of a gene in a dataset is defined as the proportion of spots where that gene is expressed. Its difference with the expression fraction is that the global expression fraction is computed for the whole dataset and not for each slide.

In [None]:
from spared.gene_features import get_glob_exp_frac

adata_exp = get_glob_exp_frac(raw_adata)

`compute_moran`receives as input:

* **adata (ad.AnnData):** adata object to update. Must have expression values in `adata.layers[from_layer]`.
* **from_layer (str):** key in adata.layers with the values used to compute Moran's I
* **hex_geometry (bool):** whether the geometry is hexagonal or not

And returns the updated adata with the average Moran's I for each gene in the `adata.var[f'{from_layer}_moran']` column.

In [None]:
from spared.gene_features import compute_moran

adata_moran = compute_moran(adata=adata, from_layer="c_d_log1p", hex_geometry=param_dict["hex_geometry"])

### Filtering functions

In this section, we will explore the filtering functions available in the SpaRED library. These functions are designed to refine your spatial transcriptomics data by filtering it based on specific criteria. Specifically, we will demonstrate how to filter AnnData objects (adata) by Moran's I genes, using the parameters defined in `param_dict` and based on specific slides.

Lets begin with `filter_by_moran`. This function receives as input:
* **adata (ad.AnnData):** adata object to uptade. The adata must contain an `adata.var[f'{from_layer}_moran']` column
* **n_keep (int):** The number of genes to keep in the filtering process
* **from_layer (str):** the layer for which the Moran's I was previously computed.

And returns and updated adata with the filtered genes.

In [None]:
from spared.filtering import filter_by_moran

adata_moran = filter_by_moran(adata, n_keep=param_dict['top_moran_genes'], from_layer='d_log1p')

`filter_dataser` receives as input:
* **adata(ad.AnnData):** an unfiltered adata collection
* **param_dict (dict):** Dictionary that contains filtering and processing parameters. 

In the param_dict, the following keys must be present:
* `cell_min_counts` (*int*):      Minimum total counts for a spot to be valid.
* `cell_max_counts` (*int*):      Maximum total counts for a spot to be valid.
* `gene_min_counts` (*int*):      Minimum total counts for a gene to be valid.
* `gene_max_counts` (*int*):      Maximum total counts for a gene to be valid.
* `min_exp_frac` (*float*):       Minimum fraction of spots in any slide that must express a gene for it to be valid.
* `min_glob_exp_frac` (*float*):  Minimum fraction of spots in the whole collection that must express a gene for it to be valid.
* `wildcard_genes` (*str*):       Path to a `.txt` file with the genes to keep or `None` to filter genes based on the other keys.

The function returns a filtered adata collection. The function initially filters out observations with `total_counts` outside the range `[param_dict['cell_min_counts'], param_dict['cell_max_counts']]`. Then computes the `exp_frac` and `glob_exp_frac` for each gene. Filter out genes depending of the `param_dict['wildcard_genes']` value, where if the value is equal to `None` the filtered genes correspond to does that:
* Are not expressed in at least `param_dict['min_exp_frac']` of spots in each slide.
* Are not expressed in at least `param_dict['min_glob_exp_frac']` of spots in the whole collection.
* Have counts outside the range `[param_dict['gene_min_counts'], param_dict['gene_max_counts']]`.

Finally, the genes with zero counts are removed. 


In [None]:
from spared.filtering import filter_dataset

adata_filter = filter_dataset(adata, param_dict)

`get_slide_from_collection` receives as input:

* **collection (ad.Anndata):** adata object with all the slides concatenated
* **slide (str):** name of the slide to get from the collection.

And returns a filtered adata with only the specified slide.

In [None]:
from spared.filtering import get_slide_from_collection

slide_id = adata.obs.slide_id.unique()[0]
slide_adata = get_slide_from_collection(collection = adata,  slide=slide_id)

`get_slides_adata` receives as input:

* **collection (ad.Anndata):** adata object with several slides concatenated
* **slide_list (str):** a string with a list of slides separated by commas

And returns a list of adatas one for every slide included in the `slide_list`. 

In [None]:
from spared.filtering import get_slides_adata

all_slides = ",".join(adata.obs.slide_id.unique().to_list())
slides_list = get_slides_adata(collection=adata, slide_list=all_slides)