# Creating DESeq2 GeneSets

*This notebook covers how to run and load a basic `DESeq2` DEG result as a `GSForge.GeneSet`.*

---

#### Notebook Preparation

***Declare used paths***

In [None]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
DEG_COLL_PATH = OSF_PATH.joinpath("osfstorage", "DEG_gene_sets")
assert AGEM_PATH.exists()

***Import Python packages***

In [None]:
import GSForge as gsf

***R integration setup***

In [None]:
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

***Import R Packages***

In [None]:
%%R
library("DESeq2")

***Loading an AnnotatedGEM***

In [None]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

### Prepare input data for DESeq2

This requires us to drop genes that have counts of zero.

In [None]:
dropped_counts, labels = gsf.get_data(agem, 
                                      count_mask="dropped",
                                      annotation_variables=["Treatment"])

These counts were made with Kallisto, so we must round them for use in `DEseq2`.

In [None]:
dropped_counts

***Round counts to intergers***

In [None]:
ri_dropped_counts = gsf.utils.R_interface.Py_counts_to_R(dropped_counts)
ri_dropped_counts = ri_dropped_counts.round()

ri_labels = labels.to_dataframe()

In [None]:
ri_dropped_counts.head(2)

In [None]:
ri_labels.head(2)

### `DESeq2` Runs

In [None]:
%%R -i ri_dropped_counts -i ri_labels -o deseq_df

dds <- DESeqDataSetFromMatrix(countData = ri_dropped_counts,
                              colData = ri_labels,
                              design= ~ Treatment)
dds <- DESeq(dds)
deseq_results <- results(dds)
deseq_df = data.frame(deseq_results)

In [None]:
deseq_df.head()

In [None]:
deseq2_treatment = gsf.GeneSet(deseq_df, 
                               name="deseq2_treatment", 
                               attrs={"DESeq2_formula": "~ Treatment"})
deseq2_treatment

In [None]:
deseq2_treatment.data

### Define Helper Functions

Some functions to help assign support to this `GeneSet`.

In [None]:
def pvalue_filter(deseq_result_df, cutoff=0.05):
    """Returns a array of genes which have p-values above the specified cutoff."""
    return deseq_result_df[deseq_result_df["padj"] < cutoff].index

def top_n_abs(dataframe, n=10, col="log2FoldChange", padj_cuttoff=0.05):
    """Returns the top n most (absolutely) differentially expressed genes from a deseq2 result.
    This also filters by p-values."""
    filtered_df = dataframe[dataframe["padj"] < padj_cuttof]
    filtered_df = filtered_df.reindex(filtered_df["log2FoldChange"].abs().sort_values().index)
    return filtered_df.tail(n).index

In [None]:
cutoff = 0.05
gene_count = len(pvalue_filter(deseq_df, cutoff=cutoff))

print(f"{gene_count} genes below P-value threshold of: {cutoff}")

In [None]:
deseq2_treatment.set_support_by_genes(pvalue_filter(deseq_df, cutoff=cutoff))
deseq2_treatment

In [None]:
deseq2_treatment.save_as_netcdf(DEG_COLL_PATH)