In [None]:
# imports
import pandas as pd
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
#from Bio import Entrez

from IPython.display import Image

from io import StringIO

## KEGG REST service and visualizing KEGG pathways
You can find documentation about the REST service itself and how to use it directly using URLs in a browser here: http://www.genome.jp/kegg/rest/keggapi.html

The following link shows a Jupyter notebook that contains code examples that are very similar to the tasks in this Pathways notebook, and almost in the same order, so take a look if you need help with anything from access to the KEGG REST service using Biopython to coloring nodes in a pathway visualization:
http://nbviewer.jupyter.org/github/widdowquinn/notebooks/blob/master/Biopython_KGML_intro.ipynb

## Task 1: KEGG and gene id mapping

Familiarize yourself with the KEGG Rest interface and how to access it with Biopyhton.<br>
Before you start solving task 1.1, read its task description, look at and run all code examples given there. Try to understand what happens, maybe by recreating some REST calls as URL calls in your browser.<br>
The Biopython functions `kegg_list` and `kegg_link` simply wrap this URL call into a function and return the response in plain text.<br>
E.g. accessing the URL http://rest.kegg.jp/list/pathway/hsa lists all pathways associated with humans and
is equivalent to calling<br>
`kegg_list('pathway', 'hsa').read()` in Python.<br>
(Organsim IDs: hsa = Homo Sapiens, mmu = Mus musculus (mouse))

### Subtask 1.1 Extract gene lists for all (mouse) KEGG pathways and store them
Below is some example code showing how to get data out of the KEGG REST service in general.
It lists all the mouse pathways and extracts the pathway IDs from the REST response, then pulls the gene list for one pathway.

* The raw list of pathways looks like this:<br>
path:mmu00010	Glycolysis / Gluconeogenesis - Mus musculus (mouse)<br>
path:mmu00020	Citrate cycle (TCA cycle) - Mus musculus (mouse)<br>
path:mmu00030	Pentose phosphate pathway - Mus musculus (mouse)<br>
path:mmu00040	Pentose and glucuronate interconversions - Mus musculus (mouse)<br>
path:mmu00051	Fructose and mannose metabolism - Mus musculus (mouse)<br>
path:mmu00052	Galactose metabolism - Mus musculus (mouse)<br>
path:mmu00053	Ascorbate and aldarate metabolism - Mus musculus (mouse)<br>
...<br>
(path:ID)(TAB)(Description with spaces)


* Each raw list of genes looks like this:<br>
path:mmu00010	mmu:100042025<br>
path:mmu00010	mmu:103988<br>
path:mmu00010	mmu:106557<br>
path:mmu00010	mmu:110695<br>
path:mmu00010	mmu:11522<br>
path:mmu00010	mmu:11529<br>
path:mmu00010	mmu:11532<br>
...<br>
(path:ID)(TAB)(organismID:geneID)
<br>

The gene identifiers are so-called Entrez ID's: KEGG chose to use the Entrez system.

* Extend the code to get lists of gene IDs for each pathway<br>
* Store the lists in a way that allows convenient lookup of all genes in each pathway:
A DataFrame with two columns: Pathway and Entrez, where the pathway ID will be equal for all genes from one pathway. From this you can extract all genes of a pathway using the first column. Be aware that none of those columns on their own can be a unique identifier for each row. Each gene might also be present in multiple pathways: it's a typical case of N-to-N ("many-to-many") mapping.
 
BEWARE! You will need to do many calls to the REST service to retrieve pathway information and this might take a lot of time, partly because of restrictions to the number of calls per second allowed by Biopython.<br>
Make sure your code works as expected on e.g. 3 Pathways before you start to process all (> 300) of them.

In [None]:
# get all pathways from organism 'mmu' (Mus musculus)
pathwaysResponse = kegg_list('pathway', 'mmu').read()
# Format = path:ID Description
print(pathwaysResponse)

In [None]:
# split the response on "\n" newline characters, split the lines on "\t" characters
# store them in a suitable data structure (e.g. a Series indexed by pathway ID's)

# pw_ids_names = ...

In [None]:
# use kegg_link(organism, pathway).read() to get the list of genes for each pathway

# example for one pathway:
exampleGeneList = kegg_link('mmu', 'path:mmu00010').read()
# print(exampleGeneList)

# A neat trick to turn this (and the list of pathways before) directly into a pandas DataFrame
# is to use StringIO to simulate a CSV input file:
tmp_df = pd.read_csv(StringIO(exampleGeneList), sep='\t', header=None)
tmp_df

# Extend this code so that you obtain all pathway tables, and concatenate them into a single
# pathway_entrez DataFrame.

### Subtask 1.2: Create the pathway-to-gene DataFrame
How many rows does the `pathway_entrez` DataFrame have? How many unique Entrez identifiers are in there?
Store the DataFrame as a csv file so that you can load it back up easily if necessary.


### Subtask 1.3: Create a conversion table between gene identifier formats

http://www.informatics.jax.org/downloads/reports/MGI_Gene_Model_Coord.rpt <br>
The file above contains mappings between different identifier types, including the Entrez IDs that KEGG uses and the gene symbols we have in the DE data.
Download and read in this file, then use the respective columns to map the Entrez ID numbers to gene symbols.

Create a DataFrame with two columns:
* `Gene.1` (to match the index name we had used in the DE notebook) and
* `Entrez` to match our KEGG `pathway_entrez` table.

Watch out! Pandas may convert the Entrez ID's to floating point numbers when loading the csv. Also, the "mmu:" prefix which KEGG uses is missing from them. Turn those `12345.0`-like floating point values to `mmu:12345` strings, otherwise you will have a hard time matching them with KEGG pathways.

## Task 2: Gene Set Enrichment

### Subtask 2.1: Prepare your differential expression data

* Read in the csv file you produced in the Differential Expression notebook
* Introduce a new boolean column `significant`, with criteria that you can adjust with later. You can start with -log10p > 2 and |log2fold| > 0.5 but they are not necessarily your final values.

### Subtask 2.2: Perform gene set enrichment with the KEGG gene sets you extracted in Task 1
Several tests are suitable for our purpose: the $\chi^2$ test we had used before, Fisher's exact test, or the simplest binomial test.<br>

Pick one of them. If you choose a contingency-table based test ($\chi^2$ or Fisher's exact), find a way to calculate the entries of the contingency table for each pathway (significantly DE and contained in PW / not significant and contained / significant and not contained / not significant and not contained).

If you choose a binomial test, determine the successes / number of trials / probability of success parameters. And remember, the latter is independent of your pathway! This is why it's a simplification.

### Subtask 2.3: Extract a list of significantly enriched KEGG pathways
How many pathways are enriched? How many survive the Benjamini-Hochberg correction? Do you have to update your criteria for gene significance?

Is the proportion of enriched pathways similar to the proportion of differentially expressed genes?

### Subtask 2.4: Create a pathway DataFrame combining all of the above
* indexed by pathway ID
* number of genes in pathway
* number of significant genes in pathway
* p-values of enrichment