In [None]:
# imports
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio import Entrez

from IPython.display import Image

## Task 1: KEGG and gene id mapping

Familiarize yourself with the KEGG Rest interface and how to access it with Biopyhton:

http://www.genome.jp/kegg/rest/keggapi.html

http://nbviewer.jupyter.org/github/widdowquinn/notebooks/blob/master/Biopython_KGML_intro.ipynb

### Subtask 1.1 Extract gene lists for all (mouse) KEGG pathways and store them in a suitable Python data structure
Below is some example code showing how to get data out of the KEGG REST service in general.
It lists all the mouse pathways and extracts the pathway IDs from the REST response, then pulls the gene list for one pathway.
Extend the code to get lists of gene IDs for each pathway.

In [None]:
# get all pathways from organism 'mmu' (Mus musculus)
pathwaysResponse = kegg_list('pathway', 'mmu').read()
# Format = path:ID Description
print(pathwaysResponse)

In [None]:
# extract pathway IDs from the raw REST response
pathways = [((pathway.split('\t')[0]).split(':'))[-1] for pathway in pathwaysResponse.split('\n')[:-1]]
# prints the number of pathways and the IDs of the first 10
print(len(pathways))
pathways[:10]

In [None]:
# for each pathway, get all contained genes from KEGG (in a similar format as the list of pathways above)
# Format = path:ID organism:geneID
# and transform that raw response into a list of gene IDs for each pathway.
# these lists then should be stored in a dictionary with the pathway IDs as keys
# or a table containing pairs of pathwayID and geneID

# HINT: use kegg_link(organism, pathway).read()
# example for one pathway:
exampleGeneList = kegg_link('mmu', pathways[0]).read()
print(exampleGeneList)

### Subtask 1.2: Transform the gene iDs to a format that can be used with your DE analysis

http://www.informatics.jax.org/downloads/reports/MGI_Gene_Model_Coord.rpt <br>
This file contains a mapping from Entrez IDs to gene symbols.
Download and read in this file, then use the respective columns to map the Entrez ID numbers used in KEGG to the gene symbols we have for the differential expression data.

### Optional Subtask 1.3: If you have too much time, save the KEGG gene sets as a gmt file <br>
we will not need it for further analysis, but gmt is a commonly used format for gene sets. It would be very useful if you want to rerun this notebook in the future, without having to rerun the calls to the KEGG REST service in 1.1

hints: 

http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

## Task 2: Gene Set Enrichment

### Subtask 2.1: Read in the csv file you produced during the Differential Expression module, extract a gene list (as a python list of gene symbols) from your favorite multiple correction column (and store it in a variable)

### Subtask 2.2: Perform gene set enrichment (Fisher's exact test or a hypergeometric test will do for our purposes) with the KEGG gene sets you extracted in Task 1 (you may want to store the results in a pandas dataframe and write them to csv)

hint: see the section 'Over-representation analysis' here for the hypergeometric test:

https://genetrail2.bioinf.uni-sb.de/help?topic=set_level_statistics

### Subtask 2.3: Extract a list of significantly (at 0.05 significance) enriched KEGG pathways

## Task 3: KEGG map visualization

#### hint:

http://nbviewer.jupyter.org/github/widdowquinn/notebooks/blob/master/Biopython_KGML_intro.ipynb

#### remark:

In real life you may want to use the R-based tool pathview: https://bioconductor.org/packages/release/bioc/html/pathview.html (if you insist you can also try to use r2py for using pathview from Python during the practical)

For Python (in addition to the Biopyhton module) https://github.com/idekerlab/py2cytoscape in combination with https://github.com/idekerlab/KEGGscape may be another alternative (in the future)

Generally speaking, it is always a good idea to pay attention also to other pathway databases like Reactome or WikiPathways ...

### Subtask 3.1: Pick some significantly enriched KEGG pathways of your choice from 2.3 and visualize them

### Subtask 3.2: Define a suitable color scheme respresenting the fact whether a gene is significantly expressed or not
Use three colors:
1. not significant
2. significantly overexpressed in CD
3. significantly overexpressed in HFD

Alternatively, if you have enough time, use a continuous color gradient from color 2 to 3.

hint: 

http://www.rapidtables.com/web/color/RGB_Color.htm

hint 2:
Start with the next task 3.3, then it will be more clear, how you need to define the colors


### Subtask 3.3: Visualize the pathway(s) from 3.1 in such a way that the included genes have the corresponding color from 3.2 ( you may need to define a suitable mapping from single genes to what is actually shown in the pathway map...)

In [None]:

# some hints on how the data structures of a random pathway look like:
# run this and check the output below to see what is stored in the fields (and what type of data you can put into them)
pathway = KGML_parser.read(kegg_get('mmu00010', "kgml"))
print('### Pathway description ###')
print(pathway)

print('### Genes: ###')
for gene in pathway.genes[0:3]:
    print('### New gene: ###')
    print(gene.name) # each gene may have multiple IDs, you should map and compare them all to your DE gene list
    for graphic in gene.graphics:
        print('### Current gene color: ###')
        print(graphic.bgcolor)
        
# use KGMLCanvas(...) to draw PDFs