## Task 1: KEGG and gene id mapping

Familiarize yourself with the KEGG Rest interface and how to access it with Biopyhton:

http://www.genome.jp/kegg/rest/keggapi.html

http://nbviewer.jupyter.org/github/widdowquinn/notebooks/blob/master/Biopython_KGML_intro.ipynb

In [342]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd
import math
import seaborn as sns
import urllib2
import random as rd
import Bio
import re
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from scipy.stats import hypergeom

In [2]:
% matplotlib inline

In [3]:
!mkdir -p DayY_InOutput


### Subtask 1.1 Extract gene lists for all (mouse) KEGG pathways and store them in a suitable Python data structure

In [4]:
# get all mus musculus pathways from kegg
pw = kegg_list("pathway","mmu").read()

In [5]:
mouse_pathway_kegg = pd.DataFrame([x.replace(":","\t",1).split("\t") for x in pw.split("\n")],
                                columns=["Type","ID","Description"])

In [6]:
# set index
mouse_pathway_kegg.set_index("ID",inplace=True,drop=False)

In [7]:
# extract genes from pathways
all_genes = {}
for pathway in mouse_pathway_kegg.ID[:-1]:
    #print pathway
    pw = kegg_get(pathway).read()
    GENES = []
    Gene = False
    for line in pw.split("\n"):
        if line.startswith("GENE"):
            Gene = True
        if line.startswith("COMPOUND"):
            Gene = False
        if Gene:
            GENES.append(line)
    all_genes[pathway] = ([re.split(r'\s{1,}', g)[2].replace(";","") 
                      for g in GENES if len(re.split(r'\s{1,}', g))>2])

In [83]:
# merge genes into one data frame with pathway as ID
all_genes_joined = {}
for key in all_genes.keys():
    all_genes_joined[key] = ",\t ".join(all_genes[key])
all_genes_df = pd.DataFrame.from_dict(all_genes_joined,orient="index")
all_genes_df.columns = ["Genes"]
# sort data frame
all_genes_df.sort_index(inplace=True)

In [84]:
# append genes to kegg pathway data frame
mouse_pathway_kegg_genes = pd.concat([mouse_pathway_kegg,all_genes_df],axis=1,join="outer")
mouse_pathway_kegg_genes.drop("Type",axis=1,inplace=True)
mouse_pathway_kegg_genes.drop(mouse_pathway_kegg_genes.ix[0],inplace=True)

In [85]:
mouse_pathway_kegg_genes.head()

Unnamed: 0,ID,Description,Genes
mmu00010,mmu00010,Glycolysis / Gluconeogenesis - Mus musculus (m...,"Hk2,\t Hk3,\t Hk1,\t Hkdc1,\t Gck,\t Gpi1,\t P..."
mmu00020,mmu00020,Citrate cycle (TCA cycle) - Mus musculus (mouse),"Cs,\t Csl,\t Acly,\t Aco2,\t Aco1,\t Idh1,\t I..."
mmu00030,mmu00030,Pentose phosphate pathway - Mus musculus (mouse),"Gpi1,\t G6pd2,\t G6pdx,\t Pgls,\t H6pd,\t Pgd,..."
mmu00040,mmu00040,Pentose and glucuronate interconversions - Mus...,"Gusb,\t Kl,\t Ugt2b5,\t Ugt1a2,\t Ugt1a6a,\t U..."
mmu00051,mmu00051,Fructose and mannose metabolism - Mus musculus...,"Mpi,\t Pmm2,\t Pmm1,\t Gmppb,\t Gmppa,\t Gmds,..."


### Subtask 1.2: Save the KEGG gene sets as a gmt file after you made sure they have the proper gene ids with respect to your DE analysis

hints: 

http://biopython.org/wiki/Annotate_Entrez_Gene_IDs

http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

In [86]:
mouse_pathway_kegg_genes.to_csv("DayY_InOutput/mouse_pathway.gmt")

## Task 2: Gene Set Enrichment

### Subtask 2.1: Read in the csv file you produced during the Differential Expression module, extract a gene list (as a python list of gene symbols) from your favorite multiple correction column (and store it in a variable)

In [87]:
DE = pd.read_csv("Day4_InOutput/multiple_comparison_fc.csv",sep="\t",index_col=0)

In [88]:
diff_reg_genes = DE["simes-hochberg"]

### Subtask 2.2: Perform gene set enrichment (Fisher's exact test or an hypergeometric test will do for our purposes) with the KEGG gene sets you extracted in Task 1 (you may want to store the results in a pandas dataframe and write them to csv)

hint:

https://genetrail2.bioinf.uni-sb.de/help?topic=set_level_statistics

In [384]:
def hypergeom_test(ct):
    """
        Hypergeometric test for given crosstable ct.
        Returns p-value
    """
    k = ct.Pathway.DE
    l = ct.Pathway.All
    m = ct.All.All
    n = ct.All.DE
    kp = (n*l)/m
    if kp >= k:
        p = hypergeom.cdf(k,m,l,n)
    else:
        p = hypergeom.sf(k-1,m,l,n)
    return p

In [390]:
def apply_hypergeom_test():
    """
        Applies the hypergeometric test scipy.stats.hypergeom 
        to all pathways in the dataframe "mouse_pathway_kegg_genes" 
        and the diff_reg_genes as defined before. 
        Return: Data Frame with p values (uncorrected) for each pathway.
    """
    # extract set of genes from pathway gene list
    pathway_gene_set = set()
    for x in mouse_pathway_kegg_genes.Genes:
        for y in x.split(",\t"):
            pathway_gene_set.add(y.strip());
    # set of genes that are represented in pathway_genes and diff_reg_genes
    gene_set = list(set(diff_reg_genes.index).intersection(pathway_gene_set))

    #prepare data frame
    crossdf = pd.DataFrame([gene_set,[False]*len(gene_set),[False]*len(gene_set)]).T
    crossdf.columns=["ID","DE","Pathway"]
    crossdf.set_index("ID",inplace=True)
    
    # apply hypergeom to all pathways
    pvals = {}
    for pathway in mouse_pathway_kegg_genes.index:
        crossdf.DE.loc[set([str(c) for c in diff_reg_genes.loc[diff_reg_genes.values<0.05].
                            index.values]).intersection(gene_set)] = True
        crossdf.Pathway.loc[
            set([str(x.strip()) for x in mouse_pathway_kegg_genes.Genes[pathway].split(",\t")]).intersection(gene_set)
                ] = True
        # calculate crosstable
        crosstable = pd.crosstab(crossdf.DE.replace([False,True],["Not DE","DE"]),
                    crossdf.Pathway.replace([False,True],["Not Pathway","Pathway"]),margins=True)
        pvals[pathway] = hypergeom_test(crosstable)
    # convert to pd.DataFrame
    pvals = pd.DataFrame.from_dict(pvals,orient="index")
    pvals.columns=(["P-Value_hypergeom"])
    return pvals

### Subtask 2.3: Extract a list of significantly (at 0.05 significance) enriched KEGG pathways

In [388]:
pvals_hg = apply_hypergeom_test()

In [389]:
enriched_kegg_pathways = pvals_hg[pvals_hg["P-Value_hypergeom"]<0.05/len(pvals_hg)]
enriched_kegg_pathways.head()

Unnamed: 0,P-Value_hypergeom
mmu02010,0.0001529799
mmu04622,1.609165e-10
mmu04022,4.699378e-07
mmu04350,9.623347e-11
mmu03022,4.107536e-07


## Task 3: KEGG map visualization

#### hint:

http://nbviewer.jupyter.org/github/widdowquinn/notebooks/blob/master/Biopython_KGML_intro.ipynb

#### remark:

In real life you may want to use the R-based tool pathview: https://bioconductor.org/packages/release/bioc/html/pathview.html (if you insist you can also try to use r2py for using pathview from Python during the practical)

For Python (in addition to the Biopyhton module) https://github.com/idekerlab/py2cytoscape in combination with https://github.com/idekerlab/KEGGscape may be another alternative (in the future)

Generally speaking, it is always a good idea to pay attention also to other pathway databases like Reactome or WikiPathways ...

### Subtask 3.1: Pick some significantly enriched KEGG pathways of your choice from 2.3 and visualize them

### Subtask 3.2: Define a a suitable binary color scheme respresenting the fact whether a gene is significantly expressed or not

hint: 

http://www.rapidtables.com/web/color/RGB_Color.htm

### Subtask 3.3: Visualize the pathway(s) from 3.1 in such a way that the included genes have the corresponding color from 3.2 ( you may need to define a suitable mapping from single genes to what is actually shown in the pathway map...)

### Subtask 3.4: Define a suitable continuous color range representing the log2 fold changes of the all the genes in your data

hint:

http://bsou.io/posts/color-gradients-with-python

### Subtask 3.5: Visualize the pathway(s) from 3.1 in such a way that the included genes have the corresponding color from 3.4