Creating an `AnnotatedGEM` involves wrangling data from (at least) two sources:

+ The gene expression (or count) matrix.
+ Annotation or label files.

You will need to provide a gene expression matrix (GEM), that has the same sample index as the annotation data.

This notebook will also cover how to go about adding different normalizations, and adding gene lengths.

---

**On Sourcing a Gene Expression Matrix and Annotations**

How exactly get from an RNA-Sequencing experiment to a count matrix is beyond the scope of this tutorial.
See the [GEMMaker package](https://gemmaker.readthedocs.io/en/latest/) for a workflow made by the [Ficklin Laboratory](http://ficklinlab.cahnrs.wsu.edu/).

---

***Setting up the notebook:***

In [1]:
import pandas as pd
import GEMprospector as gp

---

## The `GeneSelector.AnnotatedGEM` Object

Contains the gene expression matrix, which is indexed by a 'Gene'
and 'Sample' coordinates. This `xarray.Dataset` object also contains
(but is not limited to) phenotype information as well.

As with all Python objects, we can view their doc-strings via:

```python
object?
# Or:
help(object)
```

In [2]:
gp.AnnotatedGEM?

## Creating an `AnnotatedGEM`

For basic creation you can pass one of the following to created an `AnnotatedGEM`:
+ a filepath to a netcdf file -- usually a previously created (and saved) `AnnotatedGEM`.
+ an xarray dataset -- usually user created.
+ a pair of `pandas.DataFrame` objects (counts and labels).

Essentially we need to create an 'xarray.Dataset` object that contains all the data we care about.
For most users the simplest approach will be to prepare two `pandas.DataFrame` ojbects that have the samples
either aligned or appropriately indexed.

`GEMprospector` comes with some helper functions to assist in this process. 

***Declaring paths used:***

In [3]:
COUNTS = "~/GEMprospector_demo_data/rice_heat_drought.GEM.raw.txt"
LABELS = "~/GEMprospector_demo_data/srx_sample_annots.txt"
GFF3 = "~/GEMprospector_demo_data/all.gff3"

## Creating an `AnnotatedGEM` from `pandas.DataFrame`

Presented here is a typical walkthrough of creating an `AnnotatedGEM` from several raw data file.
If the indexes used between your files differ, you should have a plan on how they should be aligned
before starting.

***Load the GEM data:***

First we will need to read in the count matrix, and any annotation files we wish to work with.
In the case of GEM files from the GEMMaker software, we can expect them to be tab-delimited.

In [4]:
# Test loading the data.
pd.read_csv(COUNTS, sep="\t", index_col=0).head()

Unnamed: 0,SRX1423934,SRX1423935,SRX1423936,SRX1423937,SRX1423938,SRX1423939,SRX1423940,SRX1423941,SRX1423942,SRX1423943,...,SRX1424399,SRX1424400,SRX1424401,SRX1424402,SRX1424403,SRX1424404,SRX1424405,SRX1424406,SRX1424407,SRX1424408
LOC_Os06g05820,20,2,22,11,23,39,24,34,33,20,...,5,20,20,38,35,43,25,8,8,21
LOC_Os10g27460,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LOC_Os02g35980,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LOC_Os09g23260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LOC_Os01g41670,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [5]:
%%time
count_df = pd.read_csv(COUNTS, sep="\t", index_col=0)


CPU times: user 2.2 s, sys: 169 ms, total: 2.37 s
Wall time: 2.38 s


***Load the label data:***

*In this case we have previously wrangled this annotation file, which was sourced from
[https://doi.org/10.1105/tpc.16.00158](https://doi.org/10.1105/tpc.16.00158).*

In [6]:
pd.read_csv(LABELS, index_col=0).head()

Unnamed: 0_level_0,SampleSRR,Treatment,Time,Tissue,Genotype,Subspecies
Sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
SRX1423934,SRR2931040,CONTROL,15,leaf,AZ,Japonica
SRX1423935,SRR2931041,CONTROL,15,leaf,AZ,Japonica
SRX1423936,SRR2931042,CONTROL,30,leaf,AZ,Japonica
SRX1423937,SRR2931043,CONTROL,30,leaf,AZ,Japonica
SRX1423938,SRR2931044,CONTROL,45,leaf,AZ,Japonica


In [7]:
%%time
label_df = pd.read_csv(LABELS, index_col=0)

CPU times: user 5.23 ms, sys: 215 µs, total: 5.44 ms
Wall time: 5.11 ms


***Combine the dataframes into an AnnotatedGEM:***

`AnnotatedGEM.from_pandas` does a bit of data wrangling, and loads the data into a single `xarray.Dataset`.
Like with using `pandas.read_csv()` it can comforting to do one last check that our data has made it in correctly.


In [8]:
# View the xarray data attribute without assigning the AnnotatedGEM object.
gp.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Osativa").data

<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    SampleSRR   (Sample) object 'SRR2931040' 'SRR2931041' ... 'SRR2931514'
    Treatment   (Sample) object 'CONTROL' 'CONTROL' ... 'RECOV_DROUGHT'
    Time        (Sample) int64 15 15 30 30 45 45 60 ... 240 240 270 270 300 300
    Tissue      (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf' 'leaf'
    Genotype    (Sample) object 'AZ' 'AZ' 'AZ' 'AZ' 'AZ' ... 'TD' 'TD' 'TD' 'TD'
    Subspecies  (Sample) object 'Japonica' 'Japonica' ... 'Indica' 'Indica'
    counts      (Sample, Gene) int64 20 0 0 0 0 0 200 ... 19 0 52 335 0 666 0
Attributes:
    all_labels:  ['SampleSRR', 'Treatment', 'Time', 'Tissue', 'Genotype', 'Su...
    discrete:    ['SampleSRR', 'Treatment', 'Tissue', 'Genotype', 'Subspecies']
    continuous:  ['SampleSRR', 'Time']
    quantile:   

In [9]:
# Now view GEMprospector's summary, ensure the count of samples and genes is correct.
gp.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Osativa")

<GEMprospector.AnnotatedGEM>
Name: Osativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

## Creating an `AnnotatedGEM` from files

If you are fortunate enough to have consistenly formatted data (like the above example) 
you can directly load your data into an `AnnotatedGEM`.

If you do not provide a `sep` argument in the `count_kwargs` or `label_kwargs` dictionaries, 
*GEMprospector* will attempt to infer it by reading the first line of each file.

In [10]:
%%time
agem = gp.AnnotatedGEM.from_files(
    count_path=COUNTS,
    label_path=LABELS,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0),
    label_kwargs=dict(index_col=0),
)

CPU times: user 6.9 s, sys: 1.42 s, total: 8.32 s
Wall time: 8.35 s


In [11]:
agem

<GEMprospector.AnnotatedGEM>
Name: AnnotatedGEM00002
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

---

## Adding Gene Annotations

This form of storing our GEM files also allows for us to store any gene-index properties.
An example of this is gene length:

In [12]:
pd.read_csv(GFF3, sep="\t", comment="#",
             names=['seqname', 'source', 'feature', 'start', 'end',
                    'score', 'strand', 'frame', 'attribute']).head(2)

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attribute
0,Chr1,MSU_osa1r7,gene,2903,10817,.,+,.,ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC...
1,Chr1,MSU_osa1r7,mRNA,2903,10817,.,+,.,ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Pare...


In [13]:
def extract_gff3_gene_lengths(gff3_file):
    """A custom function to extract gene lengths."""
    df = pd.read_csv(gff3_file, sep="\t", comment="#",
                     names=['seqname', 'source', 'feature', 'start', 'end',
                            'score', 'strand', 'frame', 'attribute'])
    gene_ids = df["attribute"].str.extract(r"ID=(\w+)")
    df = df[gene_ids.notna().values]
    df['Gene'] = gene_ids
    df = df.drop_duplicates("Gene")
    df = df.set_index("Gene")
    return df["end"] - df["start"]

In [14]:
gene_lengths = extract_gff3_gene_lengths(GFF3)

Because `gene_lengths` is already (hopefully) indexed correctly, it is trivial to incorporate into our `AnnotatedGEM`.

In [15]:
agem.data["lengths"] = gene_lengths

---

## Saving an `AnnotatedGEM`

All the data contained within the `AnnotatedGEM` class is contained within the `xarray.Dataset` object.

In [17]:
SAVE = False
if SAVE:
    agem.save("~/GEMprospector_demo_data/osativa.nc")