# Class : Networks and enrichment

---
## Before Class
1. Review slides on GSEA

---
## Learning Objectives
1. Gene Ontology and Pathways
* Gene set enrichment analysis


---
## Pathway enrichment analysis

Often we identify large sets of features (for example, proteins, metabolites, transcribed genes, open chromatin regions, etc) and we would like to identfy if there is any enrichment with known pathways for these data. A common way of doing this is through enrichment analysis. Today we will be using a common enrichment analysis tool called GSEA (Gene Set Enrichment Analysis). The ultimate goal of this method is to identify if there is any enrichment of specific sets of genes given our features.

```
GSEA:
    Rank genes by expression
    Compute cumulative sum over ranked genes as:
        +1/(gene set size) when gene is in set
        -1/(remaining genes) otherwise
    Enrichment = maximum deviation from zero
    
```



In [None]:
# These funtions will read in the provided gene expression file and a gene set in pandas and format them for the functions we are writing

def read_gene_sets(filename):
    """ Function to read in gene set file
    
    Args:
        filename (str): file to be read
        
    Returns:
        gene_sets (dict of lists): dictionary with keys of gene set names and values containing list of gene names
    
    Example:
    >>> gene_sets = read_gene_sets("data/temp.gmt")
    >>> gene_sets['demo1']  #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    ['ATXN1', 'UBQLN4', 'CALM1', 'DLG4', 'MRE11A', 'CTNNB1', 'YWHAG', ...]
    """    


def read_expression(filename):
    """ Function to read in expression file
    
    Args:
        filename (str): file to be read
        
    Returns:
        expression (dict): dictionary with keys of gene set names and values of expression

    Example:
    >>> expression = read_expression("data/temp.txt")
    >>> expression  #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    {'ATXN1': '16.4567529278529', 'UBQLN4': '13.9894927152905', 'CALM1': '13.7455333730743', ...}
    """  

    

In [None]:
#Read in our data
gene_sets = read_gene_sets("data/temp.gmt")
expression = read_expression("data/temp.txt")

In [None]:
def GSEA(gene_set, expression):
    """ Function to perform GSEA testing on a gene set and expression data set
    
    Args:
        gene_set (list): list of gene names in a gene set
        expression (dict): dictionary with keys of gene set names and values of expression
        
    Returns:
        enrichment (float): enrichment score for a given gene set
        score (list): list of scores as sorted genes are compared to gene set
        
    Example:
    >>> gene_sets = read_gene_sets("data/temp.gmt")
    >>> expression = read_expression("data/temp.txt")
    >>> GSEA(gene_sets['demo1'], expression)  #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    (0.8776937535514909, [0, 0.022222222222222223, 0.044444444444444446, 0.06666666666666667, ...)
    """  



## Plotting GSEA

Next, we will plot the results of the GSEA analyses as below:

<img src="figures/gsea_plot.png">

In [None]:
def plot_GSEA(scores):
    """ Function to make GSEA plots
    
    Args:
        score (list): list of scores as sorted genes are compared to gene set
    """  


In [None]:
# Here we plot everything
import matplotlib.pyplot as plt

for gene_set in gene_sets:
    enrichment, scores = GSEA(gene_sets[gene_set], expression)
    print(gene_set, enrichment)
    plot_GSEA(scores)

## Assessing significance

For GSEA and other similar models, it is often difficult to accurately model a background expectation. In these cases we can use simulations to estimate an empirical p-value:

```
for 1000 permutations:
    select random gene set of the same size as existing gene set
    calculate enrichment score for this gene set
    
p_val <= Number of permutations with a higher score / total number of permutations
```

In [None]:
import random

def permute_GSEA(enrichment, gene_set, expression, permutations=1000, seed=42):
    """ Function to perform GSEA testing on a gene set and expression data set
    
    Args:
        enrichment (float): enrichment of the gene set
        gene_set (list): list of gene names in a gene set
        expression (dict): dictionary with keys of gene set names and values of expression
        permutations (int): number of permutations
        seed (int): seed for random.sample
        
    Returns:
        p_val (float): empirical p-pvalue for enrichment
        
    Example:
    >>> gene_sets = read_gene_sets("data/temp.gmt")
    >>> expression = read_expression("data/temp.txt")
    >>> enrichment, scores = GSEA(gene_sets['demo1'], expression)
    >>> permute_GSEA(enrichment, gene_set, expression, permutations=10, seed=42) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    0.0
    """


In [None]:
permute_GSEA(enrichment, gene_set, expression, seed=42)

In [None]:
import doctest
doctest.testmod()