# Creating an AnnotatedGEM from Text Data

This notebook describes how to create and save an `AnnotatedGEM` object from separate count and label text files.

A count matrix and an annotation table are often created as separate text files.
The count matrix is often formatted with samples as columns and genes as rows due to the way counts are calculated.
An annotation file must have a matching 'sample' index to the count file.

***Download the demo data***

A demo gene expression matrix and accompanying annotation text files are stored in a public [OSF]() project.
You can download them by:
+ Navigating to the [data repository on osf](https://osf.io/t3xpw/) and manually download them.

or

+ Installing the [OSF CLI utility](https://osfclient.readthedocs.io/en/latest/index.html) and clone to a directory:
    
    **Linux**
    ```bash
    osf -p rbhfz clone ~/GSForge_demo_data
    ```
  
    
The paths used in this example assume the second method was used.

***Set up the notebook***

In [2]:
import pandas as pd
import GSForge as gsf

***Declare used paths***

In [3]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

Declare the OSF project directory path. This is the root directory of the data files used in this notebook.

I use all capitals to denote 'global' notebook variables.

In [4]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage")).expanduser()
OSF_PATH

WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage')

View the files within:

In [5]:
list(OSF_PATH.glob("*/*"))

[WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/AnnotatedGEMs/oryza_sativa_hydro_normed_raw.nc'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/AnnotatedGEMs/oryza_sativa_hydro_raw.nc'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/Collections/boruta'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/Collections/literature'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/GEMmaker_GEMs/Osativa_heat_drought_PRJNA297424.GEM.raw.txt'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/GEMmaker_GEMs/Osativa_heat_drought_PRJNA301554.GEM.raw.txt'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/GEMmaker_GEMs/Osativa_heat_drought_PRJNA301554.GEM.TPM.txt'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/raw_annotation_data/PRJNA297424.infield.annotations.txt'),
 WindowsPath('C:/Users/tyler/GSForge_demo_data/osfstorage/raw_annotation_data/PRJNA301554.hydroponic.annotations.txt'),
 WindowsPath('C:/Users/tyler/G

Declare the paths to the count and label files.

In [6]:
RAW_COUNT_PATH = OSF_PATH.joinpath("GEMmaker_GEMs", "Osativa_heat_drought_PRJNA301554.GEM.raw.txt")
HYDRO_LABEL_PATH = OSF_PATH.joinpath("raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt")

Ensure these files exsist.

In [7]:
assert RAW_COUNT_PATH.exists()
assert HYDRO_LABEL_PATH.exists()

Finally, declare an path to which the created `.nc` file will saved.

In [8]:
HYDRO_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_raw.nc")

---

### Loading data with `pandas`

***Loading the count matrix***

In [8]:
%%time
count_df = pd.read_csv(RAW_COUNT_PATH, sep="\t", index_col=0)

Wall time: 3.7 s


In [9]:
count_df.head()

Unnamed: 0,SRX1423934,SRX1423935,SRX1423936,SRX1423937,SRX1423938,SRX1423939,SRX1423940,SRX1423941,SRX1423942,SRX1423943,...,SRX1424399,SRX1424400,SRX1424401,SRX1424402,SRX1424403,SRX1424404,SRX1424405,SRX1424406,SRX1424407,SRX1424408
LOC_Os01g01010.1,693.55,205.0,717.702,190.556,588.875,966.481,446.77,1135.4,618.415,602.502,...,630.823,573.156,486.0,511.739,1017.13,639.689,688.608,564.539,527.463,532.266
LOC_Os01g01010.2,38.45,0.0,24.2982,16.4442,40.1252,51.5195,28.2304,36.6024,32.5853,53.4985,...,28.1768,54.8441,0.0,66.2612,34.8668,28.3106,22.3919,25.4606,42.5369,22.7338
LOC_Os01g01019.1,7.0,2.0,8.0,0.0,11.0,5.0,3.0,11.0,4.0,8.0,...,0.0,0.0,1.0,0.0,3.0,1.0,1.0,0.0,0.0,1.0
LOC_Os01g01030.1,43.0,29.0,43.0,13.0,80.0,77.0,47.0,82.0,61.0,64.0,...,47.0,44.0,45.0,49.0,61.0,59.0,52.0,60.0,37.0,50.0
LOC_Os01g01040.4,19.9315,0.0,11.7999,0.0,9.04003,10.414,17.951,16.8159,10.7513,7.62065,...,13.6727,15.8265,12.6427,9.11728,22.1365,19.0306,21.1138,9.41076,16.4023,0.0


***Loading the annotation table***

In [10]:
%%time
label_df = pd.read_csv(HYDRO_LABEL_PATH, index_col=1, sep="\t")

Wall time: 10 ms


In [11]:
label_df.head(2)

Unnamed: 0_level_0,BioSample,LoadDate,MBases,MBytes,Run,SRA_Sample,Sample_Name,genotype,time,treatment,...,Instrument,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,SRA_Study,source_name,tissue
Experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SRX1423937,SAMN04251851,2015-11-20,1166,764,SRR2931043,SRS1156717,GSM1933349,"Azuenca (AZ; IRGC#328, Japonica)",30 min,CONTROL,...,Illumina HiSeq 2000,PAIRED,cDNA,TRANSCRIPTOMIC,Oryza sativa,ILLUMINA,2016-01-04,SRP065945,Rice leaf,leaf
SRX1423938,SAMN04251852,2015-11-20,4005,2500,SRR2931044,SRS1156720,GSM1933350,"Azuenca (AZ; IRGC#328, Japonica)",45 min,CONTROL,...,Illumina HiSeq 2000,PAIRED,cDNA,TRANSCRIPTOMIC,Oryza sativa,ILLUMINA,2016-01-04,SRP065945,Rice leaf,leaf


### Combine the dataframes into an AnnotatedGEM:

`AnnotatedGEM.from_pandas` does a bit of data wrangling, and loads the data into a single `xarray.Dataset`.

In [12]:
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Oryza Sativa")
agem

<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
    Gene   66338
    Sample 475

***Examine the data***

In [13]:
agem.data

### Save the `AnnotatedGEM`

In [14]:
if not HYDRO_GEM_PATH.exists():
    agem.save(HYDRO_GEM_PATH)

### Creating an AnnotatedGEM from files

If you are fortunate enough to have consistenly formatted data (like the above example) you can directly load your data into an AnnotatedGEM.

If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, GEMprospector will attempt to infer it by reading the first line of each file.

In [15]:
agem = gsf.AnnotatedGEM.from_files(
    count_path=RAW_COUNT_PATH,
    label_path=HYDRO_LABEL_PATH,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0, sep="\t"),
    label_kwargs=dict(index_col=1, sep="\t"),
)
agem

<GSForge.AnnotatedGEM>
Name: AnnotatedGEM00098
Selected GEM Variable: 'counts'
    Gene   66338
    Sample 475