# Creating a GSForge.AnnotatedGEM from Text Data

This notebook describes how to create and save an `AnnotatedGEM` object from separate count and label text files.

A count matrix and an annotation table are often created as separate text files.
The count matrix is often formatted with samples as columns and genes as rows due to the way counts are calculated.
An annotation file must have a matching 'sample' index to the count file.



***Downloading the demo data***

A demo gene expression matrix and accompanying annotation text files are stored in a public [OSF]() project.
You can download them by:
+ Navigating to the [data repository on osf](https://osf.io/t3xpw/) and manually download them.

or

+ Installing the [OSF CLI utility](https://osfclient.readthedocs.io/en/latest/index.html) and clone to a directory:
    ```bash
    osf -p t3xpw clone ~/GSForge_demo_data
    ```
    
The paths used in this example assume the second method was used.

***Declaring used paths***

In [1]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

Declare the OSF project directory path.

In [3]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
OSF_PATH

PosixPath('/home/tyler/GSForge_demo_data')

View the files within:

In [4]:
list(OSF_PATH.glob("**/*"))

[PosixPath('/home/tyler/GSForge_demo_data/osfstorage'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/rice_PRJNA385135_salt.GEM.FPKM.txt'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/rice_annotations.csv')]

Declare the paths to the count and label files.

In [32]:
COUNT_PATH = OSF_PATH.joinpath("osfstorage", "rice_PRJNA385135_salt.GEM.FPKM.txt")
LABEL_PATH = OSF_PATH.joinpath("osfstorage", "rice_annotations.csv")
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")

Ensure these files exsist.

In [6]:
assert COUNT_PATH.exists()
assert LABEL_PATH.exists()

---

***Preparing the notebook***

In [7]:
import pandas as pd
import GSForge as gsf

### Loading data with `pandas`

***Loading the count matrix***

In [9]:
%%time
count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0)

CPU times: user 3.17 s, sys: 284 ms, total: 3.45 s
Wall time: 3.45 s


In [10]:
count_df.head()

Unnamed: 0,SRX2776263,SRX2776295,SRX2776363,SRX2776365,SRX2776356,SRX2776371,SRX2776358,SRX2776360,SRX2776359,SRX2776370,...,SRX2776313,SRX2776318,SRX2776307,SRX2776320,SRX2776308,SRX2776298,SRX2776300,SRX2776305,SRX2776312,SRX2776319
LOC_Os01g01010,11.319679,8.365713,7.707072,7.815566,10.529035,14.150313,15.708982,18.614389,16.841938,13.563243,...,8.81454,9.001955,10.926337,8.710381,13.80474,10.987262,14.087475,8.792207,12.102121,11.336815
LOC_Os01g01019,1.23988,1.169013,1.904225,1.291243,1.43194,1.605496,0.291904,0.0,0.245943,1.35028,...,0.315189,1.345944,0.0,0.674577,0.0,1.676842,1.420893,2.285043,0.411098,1.411578
LOC_Os01g01030,3.038393,3.126105,3.713148,4.377853,4.378384,2.242388,2.242876,1.953802,0.972326,1.136529,...,4.389473,4.198297,4.894191,4.802093,1.974933,5.032101,5.18051,3.479917,1.231503,4.758714
LOC_Os01g01040,12.806965,11.518446,11.542926,12.7205,17.63044,11.57632,14.012661,13.156128,10.993877,8.153295,...,6.60434,11.942759,11.879583,10.957371,14.05265,15.094474,15.403028,10.148459,13.70868,13.684376
LOC_Os01g01050,8.39368,4.518756,5.226342,5.784232,7.859654,7.358202,8.218661,6.687577,7.577455,8.084127,...,5.292066,6.102956,7.374552,6.512951,11.378443,9.839329,11.605671,5.666173,9.372606,8.342855


***Loading the annotation table***

In [22]:
%%time
label_df = pd.read_csv(LABEL_PATH, index_col=0)

CPU times: user 7.69 ms, sys: 0 ns, total: 7.69 ms
Wall time: 6.53 ms


In [23]:
label_df.head()

Unnamed: 0,sra_id,experiment,genotype,treatment,rep,NSFTV_ID
0,SRX2776453,GSM2596760,9,salt-stress,rep2,NSFTV_9
1,SRX2776452,GSM2596759,9,salt-stress,rep1,NSFTV_9
2,SRX2776451,GSM2596758,9,control,rep2,NSFTV_9
3,SRX2776450,GSM2596757,9,control,rep1,NSFTV_9
4,SRX2776449,GSM2596756,91,salt-stress,rep2,NSFTV_91


***Ensure sample indexes overlap***

In [24]:
label_df = label_df.set_index("sra_id", drop=True)
label_df.head(2)

Unnamed: 0_level_0,experiment,genotype,treatment,rep,NSFTV_ID
sra_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRX2776453,GSM2596760,9,salt-stress,rep2,NSFTV_9
SRX2776452,GSM2596759,9,salt-stress,rep1,NSFTV_9


Check that the number of samples is the same in both files, and that their overlap is that same length.

In [27]:
assert len(count_df.columns) == len(label_df.index) == len(label_df.index.intersection(count_df.columns))

### Combine the dataframes into an AnnotatedGEM:

`AnnotatedGEM.from_pandas` does a bit of data wrangling, and loads the data into a single `xarray.Dataset`.

In [28]:
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Rice")
agem

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 368

***Examine the data***

In [29]:
agem.data

<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 368)
Coordinates:
  * Sample      (Sample) object 'SRX2776086' 'SRX2776087' ... 'SRX2776453'
  * Gene        (Gene) object 'LOC_Os01g01010' ... 'ChrUn.fgenesh.gene.66'
Data variables:
    experiment  (Sample) object 'GSM2596381' 'GSM2596382' ... 'GSM2596760'
    genotype    (Sample) int64 101 101 101 101 105 105 105 ... 91 91 91 9 9 9 9
    treatment   (Sample) object 'control' 'control' ... 'salt-stress'
    rep         (Sample) object 'rep1' 'rep2' 'rep1' ... 'rep2' 'rep1' 'rep2'
    NSFTV_ID    (Sample) object 'NSFTV_101' 'NSFTV_101' ... 'NSFTV_9' 'NSFTV_9'
    counts      (Sample, Gene) float64 8.477 1.464 5.351 12.91 ... 0.0 0.0 0.0

### Save the `AnnotatedGEM`

In [33]:
agem.save(AGEM_PATH)

PosixPath('/home/tyler/GSForge_demo_data/osfstorage/rice.nc')

---