# Saving Count Normalizations to `AnnotatedGEM`s

This notebook is a *how-to* guide on normalizing gene expression matrice using GEMprospector.
It does not cover considerations as to which normalization should be preformed.

Regardless of which normalization method you choose to use, consider that some transforms are not 'reversable'. To maintain data integrity one should maintain the raw counts, as well as any other variables needed (e.g. gene length) to compute other normalizations.

***Set up the notebook***

In [None]:
import os
import GSForge as gsf
from pathlib import Path
import pandas as pd
import numpy as np
import xarray as xr
import holoviews as hv

hv.extension("bokeh")

***Declare used paths***

In [None]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage")).expanduser()
HYDRO_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_raw.nc")
TPM_COUNT_PATH = OSF_PATH.joinpath("GEMmaker_GEMs", "Osativa_heat_drought_PRJNA301554.GEM.TPM.txt")
assert TPM_COUNT_PATH.exists()
assert HYDRO_GEM_PATH.exists()

Declare an path to which the created `.nc` file will saved.

In [None]:
HYDRO_NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_normed.nc")
HYDRO_NORMED_GEM_PATH

***Load an AnnotatedGEM***

In [None]:
agem = gsf.AnnotatedGEM(HYDRO_GEM_PATH)
agem

---

## Adding Normalizations to an AnnotatedGEM

Here we demonstrate adding an externally generated (TPM) counts to an existing `AnnotatedGEM` object.

In [None]:
%%time
tpm_count_df = pd.read_csv(TPM_COUNT_PATH, sep="\t", index_col=0)

In [None]:
tpm_count_df.head()

There is a `pandas.DataFrame.to_xarray()` function, but the coordinates are not quite what we want.
Instead we can generate an `xarray.DataArray` quickly through the standard creation call.

In [None]:
tpm_counts = xr.Dataset(
    data_vars={"TPM_counts": (("Sample", "Gene"), tpm_count_df.values.transpose())},
    coords={
        "Sample": tpm_count_df.columns.values,
        "Gene": tpm_count_df.index.values
    }
)
tpm_counts

Adding to the existing gem `xarray.Dataset` can be done via a call to `update()`.

In [None]:
agem.data.update(tpm_counts)

In [None]:
agem.count_array_names

In [None]:
%%time
uq_raw_counts = gsf.operations.UpperQuartile(agem)

In [None]:
%%time
uq_tpm_counts = gsf.operations.UpperQuartile(agem, count_variable='TPM_counts')

We can also use dictionary-like assignment.

In [None]:
agem.data["uq_raw_counts"] = uq_raw_counts
agem.data["uq_tpm_counts"] = uq_tpm_counts
agem.count_array_names

### Save the AnnotatedGEM as a .netcdf file

Recall that `.nc` files cannot be overwritten, or have variables added to them. So we either need to delete and save the file with the same name, or save as a new file.

In [None]:
if not HYDRO_NORMED_GEM_PATH.exists():
    agem.save(HYDRO_NORMED_GEM_PATH)