Ciao-Sin Chen
cscchen@umich.edu

### Previous steps 

In [1]:
import pandas as pd

In [2]:
pathway_ID_list = pd.read_csv('http://rest.kegg.jp/list/pathway/hsa', delimiter = "\t", header = None)
pathway_ID_list.columns = ["pathway_ID", "pathway_name"]
# pathway_ID_list

In [3]:
geneID_geneName = pd.read_csv("http://rest.kegg.jp/list/hsa", sep = "\t", header = None)
geneID_geneName.columns = ["gene_ID", "gene_name"]
# geneID_geneName

In [4]:
pathway_to_gene = pd.read_csv("http://rest.kegg.jp/link/pathway/hsa", sep = "\t", header = None)
pathway_to_gene.columns = ["gene_ID", "pathway_ID"]
# pathway_to_gene

### Step 4. Compute the number of pathways of each gene

In [5]:
def get_map_gene_pathways(links_gene_pathway): 
    """
    Compute the map of each gene and the set of pathways the gene is in.

    Parameters:
        links_gene_pathway (pandas.DataFrame): The dataframe of links between genes and pathways.

    Returns:
        map_gene_pathways (dict): The map of each gene and the set of pathways the gene is in.

    """
    map_gene_pathways_info = links_gene_pathway.set_index("gene_ID").groupby(level = 0).apply(lambda pathways: list(pathways.pathway_ID))
    map_gene_pathways = dict(map_gene_pathways_info)
    return map_gene_pathways

def count_pathway(gene_ID, map_gene_pathways):
    """
    Count the number of pathways the given gene is in.

    Parameters:
        gene_ID (str): The KEGG ID of the gene.
        map_gene_pathways (dict): The map of each gene and the set of pathways the gene is in, which is generated by get_map_gene_pathways function.

    Returns:
        pathway_count (int): The number of pathways the given gene is in.

    """
    pathways = map_gene_pathways[gene_ID]
    pathway_count = len(pathways)
    return pathway_count

def get_gene_name(gene_ID, gene_name_info):
    """
    get the name of the given gene.

    Parameters:
        gene_ID (str): The KEGG ID of the gene.
        gene_name_info (pandas.DataFrame): The dataframe of KEGG ID and names of all genes.

    Returns:
        gene_name (str): The name of the given gene.

    """
    gene_name = gene_name_info.gene_name[gene_name_info.gene_ID == gene_ID]
    return gene_name

def get_pathway_name(pathway_ID, pathway_name_info): 
    """
    get the name of the given pathway.

    Parameters:
        pathway_ID (str): The KEGG ID of the pathway.
        pathway_name_info (pandas.DataFrame): The dataframe of KEGG ID and names of all pathways.

    Returns:
        pathway_name (str): The name of the given pathway.

    """
    pathway_name = pathway_name_info.pathway_name[pathway_name_info.pathway_ID == pathway_ID]
    return pathway_name


In [6]:
# create a dictionary with genes as keys and pathways as values using information of links between genes and pathways
map_gene_pathways = get_map_gene_pathways(pathway_to_gene)

# count the number of pathways of each gene
pathway_counts = {"gene_ID": map_gene_pathways.keys(), 
                  "pathway_count": [count_pathway(gene_ID, map_gene_pathways) for gene_ID in map_gene_pathways.keys()], 
                  "pathway_IDs": map_gene_pathways.values()}
pathway_counts = pd.DataFrame(pathway_counts).sort_values("pathway_count", ascending = False)
pathway_counts.iloc[:10, ]

Unnamed: 0,gene_ID,pathway_count,pathway_IDs
5208,hsa:5595,117,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
5207,hsa:5594,117,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
4738,hsa:5290,106,"[path:hsa00562, path:hsa01100, path:hsa01521, ..."
4739,hsa:5291,106,"[path:hsa00562, path:hsa01100, path:hsa01521, ..."
4741,hsa:5293,106,"[path:hsa00562, path:hsa01100, path:hsa01521, ..."
4744,hsa:5296,103,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
7295,hsa:8503,103,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
4743,hsa:5295,103,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
1646,hsa:208,100,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."
1640,hsa:207,100,"[path:hsa01521, path:hsa01522, path:hsa01524, ..."


### Reflections
#### What was your biggest challenge in this project? (regarding writing code and not only)
My biggest challenge in this project was to find out how to make the dataframe of links between gene and pathway into a dictionary. When I was computing the pathway count of each gene, instead of just directly getting the counts, I chose to make a dictionary with genes as keys and pathways as values to retain the information of pathway IDs. Since one gene can have multiple pathways, I supposed a dictionary is an appropriate format, but none of formats of to_dict method can handle multiple values for one key. It took me a while to figure out how to use set_index, groupby, and apply methods together.

#### What did you learn while working on this project? (regarding writing code and not only)
While working on this project, I learned how to collaborate with others on GitHub using clones, forks, pull requests. This is the first time I realized how the comments and tags features can reduce the burden of communicating back and forth and make the information sharing much more convenient. I also learned how to how to use dataframe methods or comprehensions to make the code more efficient without iterating through rows or columns of dataframes because it can be a lot more time consuming when working on a bigger data. 

#### If you had more time on the project what other question(s) would you like to answer? (at least one question is required)
If I had more time on the project, I would look for pathways that contains the highest numbers of genes, which are likely to be an overview of multiple pathways, such as biosynthesis of amino acids (hsa01230). These can help with visualizing the difference of metabolic trends when there are different groups or time points in a study.