## 02 Saving Normalizations to AnnotatedGEMs

This notebook is a *how-to* guide on normalizing gene expression matrices using GSForge.
It does not cover considerations as to which normalization should be preformed.

Recall that .netcdf files cannot be modified once written; meaning the choice to add a normalized count matrix
should be considered carefully. Normalizations that are not 'reversible' are good candidates to save to an
`AnnotatedGEM` object; as they may require other data (i.e. gene lengths). Many other
normalization methods can be run 'as-needed'.


***Set up the notebook***

In [None]:
import pandas as pd
import xarray as xr
import umap
import umap.plot
import sklearn.preprocessing
import GSForge as gsf
import holoviews as hv
hv.extension("bokeh")

***Declare used paths***

In [None]:
# OS-independent path management.
from os import  environ
from pathlib import Path

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage/oryza_sativa")).expanduser()
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_raw.nc")
TPM_GEM_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.TPM.txt")

Declare an path to which the created `.nc` file will saved.

In [None]:
NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_hisat2_normed.nc")
NORMED_GEM_PATH

***Load an AnnotatedGEM***

In [None]:
agem = gsf.AnnotatedGEM(GEM_PATH)
agem

---

### Adding Normalizations to an AnnotatedGEM

Here we demonstrate adding an externally generated (TPM) counts to an existing `AnnotatedGEM` object.

In [None]:
%%time
tpm_count_df = pd.read_csv(TPM_GEM_PATH, sep="\t", index_col=0)

In [None]:
tpm_count_df.head()

There is a `pandas.DataFrame.to_xarray()` function, but the coordinates are not quite what we want.
Instead we can generate an `xarray.DataArray` quickly through the standard creation call.

In [None]:
tpm_counts = xr.Dataset(
    data_vars={"TPM_counts": (("Sample", "Gene"), tpm_count_df.values.transpose())},
    coords={
        "Sample": tpm_count_df.columns.values,
        "Gene": tpm_count_df.index.values
    }
)
tpm_counts

Adding to the existing gem `xarray.Dataset` can be done via a call to `update()`.

In [None]:
agem.data.update(tpm_counts)

In [None]:
agem.count_array_names

In [None]:
%%time
uq_raw_counts = gsf.operations.UpperQuartile(agem)

In [None]:
%%time
uq_tpm_counts = gsf.operations.UpperQuartile(agem, count_variable='TPM_counts')

We can also use dictionary-like assignment.

In [None]:
agem.data["uq_raw_counts"] = uq_raw_counts
agem.data["uq_tpm_counts"] = uq_tpm_counts
agem.count_array_names

#### Normalizations from Scikit-Learn

***Select counts using `get_gem_data()`***

In [None]:
counts, _ = gsf.get_gem_data(agem)

In [None]:
%%time
quantile_counts = sklearn.preprocessing.quantile_transform(counts, axis=1, output_distribution='normal', copy=True)
quantile_counts = xr.DataArray(quantile_counts, coords=counts.coords)

In [None]:
agem.data["quantile_counts"] = quantile_counts
agem.count_array_names

### Using R to Normalize GEMs

See TODO for more information and examples on interacting with R.

***R integration setup***

In [None]:
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Supresses verbose R output.

In [None]:
%%R
library("edgeR")

***Select counts using `get_gem_data()`***

In [None]:
counts, _ = gsf.get_gem_data(agem)

***Prepare the counts for R***

Notice the counts are transposed after this step to a form more common in R. (features by samples).

In [None]:
ri_counts = gsf.utils.R_interface.Py_counts_to_R(counts)
ri_counts.shape

***Run the normalization within R***

In [None]:
%%R -i ri_counts -o tmm_counts

dge_list <- DGEList(counts=ri_counts)
dge_list <- calcNormFactors(dge_list, method="TMM")
tmm_counts <- cpm(dge_list, normalized.lib.sizes=TRUE, log=FALSE)

In [None]:
tmm_counts = xr.DataArray(tmm_counts.T, coords=counts.coords, name='tmm_counts')
tmm_counts

***Add the counts to the GEM .data attribute.***

In [None]:
agem.data['tmm_counts'] = tmm_counts

### Save the AnnotatedGEM as a .netcdf file

Recall that `.nc` files cannot be overwritten, nor have variables added to them. 
So we either need to delete and save the file with the same name, or save as a new file.

In [None]:
if not NORMED_GEM_PATH.exists():
    agem.save(NORMED_GEM_PATH)