This notebook is a *how-to* guide on normalizing gene expression matrice using GEMprospector.
It does not cover considerations as to which normalization should be preformed.


The following papers were consulted in preparing this notebook.

+ Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

Common normalization methods include:

+ ...


Regardless of which normalization method you choose to use, consider that some transforms are not 'reversable'. To maintain data integrity one should maintain the raw counts, as well as any other variables needed (e.g. gene length) to compute other normalizations.

---

***Setting up the notebook***

In [None]:
import os
import GSForge as gsf
from pathlib import Path
import numpy as np
import holoviews as hv

hv.extension("bokeh")

***Declare paths used***

In [None]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
assert AGEM_PATH.exists()

***Load an AnnotatedGEM***

In [None]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

---

## Adding Normalizations to an AnnotatedGEM

If a normalization is expensive to compute it can be worth saving to the `AnnoatedGEM` object.

In [None]:
%%time
uq_counts = gsf.operations.UpperQuartile(agem)

In [None]:
agem.data["uq_counts"] = uq_counts
agem.data

### Save the AnnotatedGEM as a .netcdf file

In [None]:
# agem.save(AGEM_PATH)

***Viewing the effect of transforms and normalizations.***

In [None]:
gsf.plots.gem.GenewiseAggregateScatter(agem, count_variable="uq_counts", datashade=True)

In [None]:
gsf.plots.gem.GenewiseAggregateScatter(agem, count_variable="counts", datashade=True)