This notebook performs gene set enrichment analysis of an EcoTyper results dataset, using `eco_helper enrich`.

## Input Section

This section defines the basic setup of the analysis. The subsequent cells can be run in bulk and need not necessarily be altered in any way.

### Variables to set
- config (optional)
  - The path to the Ecotyper config yaml file used for the experiments to summarize (if the same was used). The project folder can be specified as `{parent}`, the scripts folder within the parent can be specified as `{scripts}` if the config is not within either of these, and absolute path is required.
- directory
  - The name of the Ecotyper results directory. This needs to be either an absolute path or a subdirectory of the results directory within the project directory (parent directory). The results directory can be specified using `{results}`. 
- outdir 
  - The directory to save the outputs in.
- directory_is_absolute
  - Set this to *True* to mark the provided directory not as a subdirectory of the results directory but rather as a raw source directory containing eco_helper enrich results files.
- perform_enrichment (optional)
  - If no pathway enrichment has been performed yet, `eco_helper enrich` is called automatically irrespective of this variable. However, if enrichment has already been performed and data is found, this will force re-computation of the enrichment.

In [1]:
config = None
directory = "{results}/some_data_folder_here..."
outdir = "{parent}/gsea_results"

directory_is_absolute = False
perform_enrichment = False

The below variables are only required in case `eco_helper enrich` is called.

### Variables to set

- gene_sets
  - The gene sets to query.
- perform_enrichr
  - Set to False if `gseapy enrichr` should not be performed. This will also enable or disable loading of existing results.
- perform_prerank
  - Set to False if `prerank` should not be performed. This will also enable or disable loading of existing results.
- only_ecotype_contributing
  - Set to True to only perform gene set enrichment on celltypes and states that contribute to assigned Ecotypes.

In [2]:
gene_sets = [   "Reactome_2016", 
                "WikiPathway_2021_Human", 
                "Panther_2016", 
                "KEGG_2021_Human",
                "GO_Biological_Process_2021", 
                "GO_Molecular_Function_2021", 
                "GO_Cellular_Component_2021"  ]

perform_enrichr = True
perform_prerank = False
only_ecotype_contributing = True

---
<br>

> No need to alter anything in this section...

Now import necessary packages

In [3]:
import eco_helper as eh
import eco_helper.enrich as enrich
import eco_helper.enrich.visualise as visualise

from qpcr._auxiliary.graphical import make_layout_from_list

import os, glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Other project directories (need not necessarily be edited)

In [4]:
if not directory:
    raise ValueError("Please specify a directory.")

# the parent directory of the EcoTyper project
parent = "/data/users/noahkleinschmidt/EcoTyper"

# script and results directories within the parent
scripts = f"{parent}/scripts"
results = f"{parent}/results"
enrichments_dir = f"{parent}/gsea_enrichment"

outdir = outdir.format( parent = parent )
directory = directory.format( parent = parent, results = results )
data_dir = f"{enrichments_dir}/{ os.path.basename( directory ) }" if not directory_is_absolute else directory

if config:
    config = config.format( parent = parent, scripts = scripts )

Make sure we have a valid output directory and (re)-perform enrichment if necessary.

In [5]:
if not os.path.exists( outdir ):
    os.makedirs( outdir )

if not os.path.exists( enrichments_dir ):
    os.makedirs( enrichments_dir )

if perform_enrichment or not os.path.basename( directory ) in os.listdir( enrichments_dir ):
    
    print( "Calling eco_helper enrich")
    def run_enrich( with_enrichr, with_prerank, ecotypes_only ):
        """
        A small helper function to run eco_helper enrich
        """
        enrichr = "--enrichr" if with_enrichr else ""
        prerank = "--prerank" if with_prerank else ""
        ecotypes = "--ecotypes" if ecotypes_only else ""

        cmd = f"eco_helper enrich {enrichr} {prerank} \
                {ecotypes} \
                --assemble \
                --gene_sets { ' '.join( gene_sets ) } \
                --output {data_dir} \
                {directory}"

        os.system( cmd )
    
    run_enrich( with_enrichr = perform_enrichr, with_prerank = perform_prerank, ecotypes_only = only_ecotype_contributing )

Now load the enrichment results for all celltypes or ecotypes into two `EnrichmentCollections`, one for the `prerank` results (if available) and one for the `enrichr` results (if available).

In [6]:
resolution = "ecotype" if only_ecotype_contributing else "celltype"

prerank = None
if perform_prerank:
    prerank = eh.enrich.EnrichmentCollection( data_dir, resolution = resolution, which = "prerank" )

enrichr = None
if perform_enrichr:
    enrichr = eh.enrich.EnrichmentCollection( data_dir, resolution = resolution, which = "enrichr" )

---

## Analysis Section

From here on the user may perform their own analyses on the data.