# Integrative analysis of pathway deregulation in obesity #

## Python implementation

### Steps (according to the paper)

1. Probes containing missing values are excluded from the analysis. 

2. Probes are mapped to Entrez ID labels if they are available in the associated platform. Otherwise the David portal is used to convert the available labels to Entrez ID labels. 

3. Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary). 

4. Probes mapping to the same Entrez ID label are averaged out. 

5. Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all. 

6. We apply a simple L1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples. After these steps, each data set or batch is represented by a single expression matrix X. Each entry Xi j represents the log2 of the expression intensity of gene i in sample j.

### Imports

In [1]:
# Import std libraries
import os
from operator import itemgetter 
import re

# Import third party
import numpy as np
import pandas as pd
import GEOparse

# Set logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logging.getLogger("GEOparse").setLevel(logging.WARNING)

### Load Dataset

Download the dataset (if needed) and load it.

Some GEOparse names:
- DataSet (GDS)
- Series (GSE)
- Platform (GPL)
- Samples (GSM)

In [2]:
def load_dataset(dataset_id):
    """
    Load the dataset from disk (or download it if it does not exists)
    Arguments:
    - dataset_id: the ID of the dataset to load
    
    Output:
    - GSE object (GEOparse Series)
    """
    path = "./" + dataset_id + "_family.soft.gz"
    if os.path.exists(path):
        # Load from an existing file
        print("- Loading from", path)
        gse = GEOparse.get_GEO(filepath=path)
    else:
        # Download GSE and load it
        print("- Downloading", dataset_id)
        gse = GEOparse.get_GEO(geo=dataset_id, destdir="./")
    return gse

dataset_id = "GSE26637"
gse = load_dataset(dataset_id)
print("- Dataset loaded")

- Loading from ./GSE26637_family.soft.gz


  gpls[entry_name] = parse_GPL(data_group, entry_name)


- Dataset loaded


### Get Info

Get some useful info and statistics from our data.

We're going to extract:
- number of platforms
- number of samples
- dimension of each sample

In [3]:
# data_frames contains the data-frame of each sample
# samples_name is a list which contains the name associated to each dataframe
data_frames = []
samples_information = []

#this variable is not used but will help in finding elements that are correlated in the original experiment
association_dictio = {}

for gsm_name, gsm in gse.gsms.items():
    #print(gsm.metadata)
    list_identifier = gsm.metadata['title'][0].split()
    identifier = list_identifier[1] + " " + list_identifier[2] + " " + list_identifier[3]
    if not identifier in association_dictio.keys():
        association_dictio[identifier] = []
    if identifier in association_dictio.keys():
        association_dictio[identifier].append((gsm.table, gsm.metadata['geo_accession'][0]))
        
    data_frames.append(gsm.table)
    samples_information.append((gsm.metadata['geo_accession'][0], identifier))
    #print(gsm_name)

samples_information_reduced = []
for el in samples_information:
    if "Sensitive" in el[1]:
        samples_information_reduced.append(el[0] + "_" + "LF")
    elif "Resistant" in el[1]:
        samples_information_reduced.append(el[0] + "_" + "OF")

print("- Number of platforms", len(gse.gpls.items()))
print("- Number of samples", len(data_frames), "Remember, all Female in this daaset")
print("- Dimension of each sample (assuming are all the same)", data_frames[0].shape)
print('\nExample of a sample dataframe:')
data_frames[0].head()
print(data_frames[0].shape)
print(samples_information_reduced)
print(samples_information)

- Number of platforms 1
- Number of samples 20 Remember, all Female in this daaset
- Dimension of each sample (assuming are all the same) (54675, 2)

Example of a sample dataframe:
(54675, 2)
['GSM655612_LF', 'GSM655610_LF', 'GSM655614_OF', 'GSM655621_LF', 'GSM655617_OF', 'GSM655606_OF', 'GSM655603_OF', 'GSM655615_OF', 'GSM655618_LF', 'GSM655619_LF', 'GSM655607_OF', 'GSM655622_LF', 'GSM655620_LF', 'GSM655613_OF', 'GSM655608_LF', 'GSM655609_LF', 'GSM655616_OF', 'GSM655605_OF', 'GSM655604_OF', 'GSM655611_LF']
[('GSM655612', 'Insulin Sensitive 7'), ('GSM655610', 'Insulin Sensitive 4'), ('GSM655614', 'Insulin Resistant 14'), ('GSM655621', 'Insulin Sensitive 5'), ('GSM655617', 'Insulin Resistant 9'), ('GSM655606', 'Insulin Resistant 21'), ('GSM655603', 'Insulin Resistant 1'), ('GSM655615', 'Insulin Resistant 20'), ('GSM655618', 'Insulin Sensitive 13'), ('GSM655619', 'Insulin Sensitive 3'), ('GSM655607', 'Insulin Resistant 9'), ('GSM655622', 'Insulin Sensitive 7'), ('GSM655620', 'Insulin Sen

### Filter data #1

We're going to:
- Remove probes from the mapper, mapping to multiple Entrez IDs
- Construct a Python dictionary containing the valid probes and their Entrez ID

In [4]:
def create_mapper(meta_data_tables):
    """
    Returns a python dictionary that represents our mapper object
    Important: not all probs_id are mapped to an ENTREZ_GENE_ID
    probs_id without an enterez_id are not added to the dictionary
    """
    mapper = {}
    for df in meta_data_tables:
        for index, row in df.iterrows():
            probs_id = row['ID']
            
            if probs_id in mapper and mapper[probs_id] != row['ENTREZ_GENE_ID']:
                # Multiple enterez id for the same probs
                # Set their value to None to invalid them
                # Elements set to "None" are then removed
                mapper[probs_id] = None
                
            if probs_id not in mapper and not pd.isnull(row['ENTREZ_GENE_ID']):
                mapper[probs_id] = row['ENTREZ_GENE_ID']
            
    # Remove invalid mapping (value = None)
    # (Some of the probes are linked with multiple numbers (enterez_id ?) using /// as separator)
    filtered_mapper = {k:v for k,v in mapper.items() if v != None and '/' not in v}
    
    return filtered_mapper

meta_data_tables = []
for gpl_name, gpl in gse.gpls.items():
    meta_data_tables.append(gpl.table)

mapper = create_mapper(meta_data_tables)
print("- Mapper loaded", len(mapper))

- Mapper loaded 41834


### Filter data #2: 
#### Remove rows without a matching enterez_id
Intuition: 
1. Convert the dictionary **mapper** into a pandas' DataFrame (**mapper_df**).  
2. Use a SQL-like inner join to merge **mapper_df** with the existing pandas' DataFrame.  
Inner join creates a new Dataframe with *only* the matching rows.

References:
- https://www.w3schools.com/sql/sql_join_inner.asp
- https://pandas.pydata.org/pandas-docs/stable/merging.html

In [5]:
# Convert mapper to a Pandas Dataframe with two columns (probs, enterez_id)
mapper_df = pd.DataFrame.from_dict(mapper, orient='index')
mapper_df.index.name = 'ID_REF'
mapper_df.columns = ['ENTREZ_GENE_ID']
print("\nExample of the obtained Pandas DataFrame:")
print(mapper_df.head())

# Create a mapper (sample_id, person_id)
#mapper_sample_person = pd.DataFrame(samples_label)
#mapper_sample_person = mapper_sample_person.set_index([samples_name])
#mapper_sample_person = mapper_sample_person.transpose()
#mapper_sample_person.head()


Example of the obtained Pandas DataFrame:
            ENTREZ_GENE_ID
ID_REF                    
228401_at            29028
234342_at            56975
202971_s_at           8445
1553668_at           84859
236655_at             7163


### Creation of a Dataframe Entrez ID - Value: genes common to all array
For each sample:
1. We convert probes' values in **log2(values)**
2. Rows with the same **entrez_id** are merged together using the average (probes mapping to the same Entrez ID are averaged out).

In [6]:
# Update each df object inside the list 'data_frames' with the dataframe with only matching probs_id
# For each sample's dataframes:
# - log2
# - only matching entrez_id
# - L1 normalization
data_frames_entrez = []
for df in data_frames:
    df = pd.merge(df, mapper_df, how='inner', left_on=['ID_REF'], right_index=True, sort=False)
    df = df.groupby('ENTREZ_GENE_ID').mean()
    df['VALUE'] = 2**df['VALUE']
    df.VALUE = df.VALUE / df.VALUE.sum()
    df['VALUE'] = np.log2(df['VALUE'])
    data_frames_entrez.append(df)

""" part that has to deal with multiple sample for the same person
# ok, now we have to concat all the dataframes of each person
data_frames_per_person = []

# remove repeated elements from samples_label
unique_person_label = list(set(samples_label))

for person_label in unique_person_label:
    
    # get index of repeated elements in samples_label that has the value = person_label
    index_to_merge = [i for i, x in enumerate(samples_label) if x == person_label]
    
    # get elements from data_frames_entrez given the index in index_to_merge
    arrays = list(itemgetter(*index_to_merge)(data_frames_entrez))
    
    # if arrays contains the same subset of entrez_id, do the mean
    all_gene_person = pd.concat(arrays).groupby('ENTREZ_GENE_ID').mean()
    
    # save the dataset with all the 5 arrays inside data_frames_per_person
    data_frames_per_person.append(all_gene_person)
"""

# concat data_frames by columns and use the sample GSM identifier as axis
merged_entrez_value_df = pd.concat(data_frames_entrez, axis=1, keys=samples_information_reduced)
print("size before any removal", merged_entrez_value_df.shape)

# remove gene with at least 1 missing value
merged_entrez_value_df = merged_entrez_value_df.dropna(axis=0, how='any')
print("size after removing genes with missing value", merged_entrez_value_df.shape)

# remove disturbing index "VALUE"
merged_entrez_value_df.columns = merged_entrez_value_df.columns.droplevel([1])

print("\nExample of the obtained merged data frame:")
merged_entrez_value_df.head()

size before any removal (20486, 20)
size after removing genes with missing value (20486, 20)

Example of the obtained merged data frame:


Unnamed: 0_level_0,GSM655612_LF,GSM655610_LF,GSM655614_OF,GSM655621_LF,GSM655617_OF,GSM655606_OF,GSM655603_OF,GSM655615_OF,GSM655618_LF,GSM655619_LF,GSM655607_OF,GSM655622_LF,GSM655620_LF,GSM655613_OF,GSM655608_LF,GSM655609_LF,GSM655616_OF,GSM655605_OF,GSM655604_OF,GSM655611_LF
ENTREZ_GENE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,-19.817234,-19.432509,-19.639501,-19.493835,-19.760796,-19.680379,-19.755866,-19.854615,-19.534963,-19.683329,-19.55326,-19.782791,-19.362538,-19.531939,-19.546581,-19.669911,-19.773928,-19.82005,-19.59024,-19.752417
10,-18.617234,-19.362509,-19.559501,-19.443835,-19.760796,-19.630379,-19.725866,-19.794615,-19.454963,-19.623329,-19.60326,-19.752791,-19.362538,-18.341939,-19.496581,-19.659911,-19.723928,-19.75005,-19.53024,-19.592417
100,-15.002234,-15.542509,-15.539501,-15.853835,-14.160796,-15.810379,-15.790866,-14.339615,-16.679963,-14.273329,-15.77826,-14.352791,-16.357538,-15.331939,-15.861581,-15.474911,-14.798928,-15.12005,-15.57524,-16.367417
1000,-18.932234,-18.552509,-18.719501,-18.583835,-18.875796,-18.855379,-18.905866,-19.039615,-18.429963,-18.793329,-18.73826,-18.897791,-18.347538,-18.516939,-18.646581,-18.729911,-18.923928,-18.93505,-18.71524,-18.867417
10000,-17.400091,-16.268223,-16.922358,-16.613835,-16.913653,-16.864665,-16.800152,-16.870329,-17.072106,-16.994757,-16.986117,-17.405648,-17.176823,-16.70051,-16.775153,-17.109911,-16.949642,-17.22005,-16.565955,-16.938131


In [7]:
merged_entrez_value_df.to_pickle("data/GSE26637_table.pkl")