# 01 Loading data into an AnnotatedGEM

This notebook describes how to create and save an `AnnotatedGEM` object from separate count and label text files.


***Download the demo data***

A demo gene expression matrix and accompanying annotation text files are stored in a public [OSF](https://osf.io) project.
You can download them by:
+ Navigating to the [data repository on osf](https://osf.io/t3xpw/) and manually download them.

or

+ Installing the [OSF CLI utility](https://osfclient.readthedocs.io/en/latest/index.html) and clone to a directory:
    
    **Linux**
    ```bash
    # Install the osfclient.
    pip install osfclient
  
    # To clone the entire osf project:
    osf -p rbhfz clone ~/GSForge_demo_data
    
    # To pull the minimum number of files to complete the examples:
    osf 
    ```
  
    
The paths used in this example assume the second method was used.

***Set up the notebook***

In [None]:
# OS-independent path management.
from os import  environ
from pathlib import Path
import pandas as pd
import GSForge as gsf

***Declare used paths***

Declare the OSF project directory path. This is the root directory of the data files used in this notebook.

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage/oryza_sativa")).expanduser()
RAW_COUNT_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.raw.txt")
HYDRO_LABEL_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt")
MULTIQC_DIR = OSF_PATH.joinpath("GEMmakerGEMs", "multiqc_data")

Ensure these files exist.

In [None]:
assert RAW_COUNT_PATH.exists()
assert HYDRO_LABEL_PATH.exists()

Finally, declare an path to which the created `.nc` file will saved.

In [None]:
HYDRO_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_hydro_raw.nc")

## Loading data with `pandas`

***Loading the count matrix***

In [None]:
%%time
count_df = pd.read_csv(RAW_COUNT_PATH, sep="\t", index_col=0)

In [None]:
print(count_df.shape)
count_df.head()

***Loading the annotation table***

In [None]:
%%time
label_df = pd.read_csv(HYDRO_LABEL_PATH, index_col=1, sep="\t")
label_df['genotype'] = label_df['genotype'].str.split(" ", expand=True).iloc[:, 0]
label_df['time'] = label_df['time'].str.split(' ', expand=True).iloc[:, 0].astype(int)

In [None]:
label_df.head(2)

## Combine the dataframes into an AnnotatedGEM:

`AnnotatedGEM.from_pandas` does a bit of data wrangling, and loads the data into a single `xarray.Dataset`.

In [None]:
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Oryza sativa")
agem

***Examine the data***

In [None]:
agem.data

## Save the `AnnotatedGEM`

In [None]:
if not HYDRO_GEM_PATH.exists():
    agem.save(HYDRO_GEM_PATH)

## Creating an AnnotatedGEM from files

If you are fortunate enough to have consistently formatted data (like the above example) you can directly
load your data into an AnnotatedGEM.

If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, `GSForge`
will attempt to infer it by reading the first line of each file.

In [None]:
del agem

agem = gsf.AnnotatedGEM.from_files(
    count_path=RAW_COUNT_PATH,
    label_path=HYDRO_LABEL_PATH,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0, sep="\t"),
    label_kwargs=dict(index_col=1, sep="\t"),
)
agem