# Extraction of gene set collections from Human-GEM

This notebook demonstrates the process of extracting gene-metabolite and gene-subsystem associations from Human-GEM, and exporting them to a `.gmt` file (a common file format for gene set collections).

Note that it is not required that you run this notebook, since the resulting files `HumanGEM_metabolite_GSC.gmt` and `HumanGEM_subsystem_GSC.gmt` already exist in the `data/gene_set_collections/` subdirectory of this session (these files are needed for the next exercise in R). However, we have provided the code here so it is clear how one would generate such files from a GEM if necessary.


In [None]:
import cobra
import os
import numpy as np

In [None]:
# load Human-GEM model (note: may take 2-3 minutes to load, since it is a pretty large model)
model = cobra.io.load_yaml_model(os.path.join('data', 'models', 'Human-GEM.yml'))

## Subsystem-gene associations

First we will extract the subsystem-gene associations. Each reaction is assigned to a subsystem (only one subsystem per reaction in the case of Human-GEM - this is not always the case), so we first find all the reactions within a subsystem, and then extract the genes associated with each of those reactions.

In [None]:
# retrieve gene associations for all subsystems
subsystems = np.unique([x.subsystem for x in model.reactions])
gene_assoc = []
for s in subsystems:
    genes = [set(x.genes) for x in model.reactions if s in x.subsystem]
    genes = list(set.union(*genes))
    gene_assoc.append(sorted([x.id for x in genes]))  # sort for consistency

In [None]:
# remove subsystems with no gene associations
subsystems = [x for i,x in enumerate(subsystems) if len(gene_assoc[i]) > 0]
gene_assoc = [x for x in gene_assoc if len(x) > 0]

In [None]:
# write subsystem-gene associations to .gmt file
# Note that the second column in a .gmt file is a description field that we are not using ('NA').
merged_list = ['\t'.join([subsystems[i]] + ['na'] + gene_assoc[i]) + '\n' for i in range(len(subsystems))]
with open(os.path.join('data', 'gene_set_collections', 'HumanGEM_subsystem_GSC.gmt'), 'w') as f:
    f.writelines(merged_list)

## Metabolite-gene associations

Next are the metabolite-gene associations. This can be done in one of two ways, since GEMs treat metabolites as different if they are in different compartments, even if they are chemically identical. For example, `ATP_c` and `ATP_m` are both adenosine triphosphate, but one metabolite represents ATP in the cytoplasm, and the other represents ATP in the mitochondria.

One can therefore include the compartment information in the metabolite-gene associations, such that `ATP_c` and `ATP_m` will be separate gene sets. However, since many of these metabolites participate in similar reactions (encoded by the same genes) and/or because metabolite gene sets can already be quite small, it is often recommended that the compartment is ignored, and all compartment forms of a metabolite are merged together.

### Option 1: Excluding compartment (recommended)
Metabolites that have the same name but different cellular location (compartment) will be merged

In [None]:
# ignore compartments
metabolites = np.unique([x.name for x in model.metabolites])

In [None]:
# retrieve gene associations for all metabolites
gene_assoc = []
for met_name in metabolites:
    reactions = [set(m.reactions) for m in model.metabolites if m.name == met_name]
    reactions = list(set.union(*reactions))
    genes = [set(r.genes) for r in reactions]
    genes = list(set.union(*genes))
    gene_assoc.append(sorted([x.id for x in genes]))

### Option 2: Including compartment
Metabolites with identical name but different cellular location (compartment) will be treated as different metabolites

In [None]:
# combine metabolite names with their compartment abbreviation
metabolites = [x.name + '[' + x.compartment + ']' for x in model.metabolites]

In [None]:
# retrieve gene associations for all metabolites
gene_assoc = []
for m in model.metabolites:
    genes = [set(r.genes) for r in list(m.reactions)]
    genes = list(set.union(*genes))
    gene_assoc.append(sorted([x.id for x in genes]))

### Process and write to file

In [None]:
# remove metabolites with no gene associations
metabolites = [x for i,x in enumerate(metabolites) if len(gene_assoc[i]) > 0]
gene_assoc = [x for x in gene_assoc if len(x) > 0]

In [None]:
# some metabolites contain an apostrophe, which can disrupt parsing by some packages
metabolites = [x.replace("'", "") for x in metabolites]

In [None]:
# write metabolite-gene associations to .gmt file
# Note that the second column in a .gmt file is a description field that we are not using ('NA').
merged_list = ['\t'.join([metabolites[i]] + ['na'] + gene_assoc[i]) + '\n' for i in range(len(metabolites))]
with open(os.path.join('data', 'gene_set_collections', 'HumanGEM_metabolite_GSC.gmt'), 'w') as f:
    f.writelines(merged_list)