# Integrative analysis of pathway deregulation in obesity #

## Python implementation

### Steps (according to the paper)

1. Probes containing missing values are excluded from the analysis. 

2. Probes are mapped to Entrez ID labels if they are available in the associated platform. Otherwise the David portal is used to convert the available labels to Entrez ID labels. 

3. Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary). 

4. Probes mapping to the same Entrez ID label are averaged out. 

5. Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all. 

6. We apply a simple L1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples. After these steps, each data set or batch is represented by a single expression matrix X. Each entry Xi j represents the log2 of the expression intensity of gene i in sample j.

### Imports

In [1]:
# Import std libraries
import os
from operator import itemgetter 

# Import third party
import numpy as np
import pandas as pd
import GEOparse

# Set logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logging.getLogger("GEOparse").setLevel(logging.WARNING)

### Load Dataset

Download the dataset (if needed) and load it.

Some GEOparse names:
- DataSet (GDS)
- Series (GSE)
- Platform (GPL)
- Samples (GSM)

In [2]:
def load_dataset(dataset_id):
    """
    Load the dataset from disk (or download it if it does not exists)
    Arguments:
    - dataset_id: the ID of the dataset to load
    
    Output:
    - GSE object (GEOparse Series)
    """
    path = "./" + dataset_id + "_family.soft.gz"
    if os.path.exists(path):
        # Load from an existing file
        print("- Loading from", path)
        gse = GEOparse.get_GEO(filepath=path)
    else:
        # Download GSE and load it
        print("- Downloading", dataset_id)
        gse = GEOparse.get_GEO(geo=dataset_id, destdir="./")
    return gse

dataset_id = "GSE2508"
gse = load_dataset(dataset_id)
print("- Dataset loaded")

- Downloading GSE2508
D: 100% - 25.5MiB  / 25.5MiB  eta 0:00:01
- Dataset loaded


### Get Info

Get some useful info and statistics from our data.

We're going to extract:
- number of platforms
- number of samples
- dimension of each sample

In [3]:
# data_frames contains the data-frame of each sample
# samples_name is a list which contains the name associated to each dataframe
data_frames = []
samples_information = []
for gsm_name, gsm in gse.gsms.items():
    subject_info = gsm.metadata['title'][0].split(' ')
    subject_label = subject_info[0][0] + subject_info[1][0] + subject_info[2]
    data_frames.append(gsm.table)
    samples_information.append({'name': gsm_name, 'label': subject_label})

samples_name = [d['name'] for d in samples_information]
samples_label = [d['label'] for d in samples_information]

print("- Number of platforms", len(gse.gpls.items()))
print("- Number of samples", len(data_frames))
print("- Dimension of each sample (assuming are all the same)", data_frames[0].shape)
print('\nExample of a sample dataframe:')
data_frames[0].head()

- Number of platforms 6
- Number of samples 195
- Dimension of each sample (assuming are all the same) (12626, 3)

Example of a sample dataframe:


Unnamed: 0,ID_REF,VALUE,ABS_CALL
0,100_g_at,3993.9,A
1,1000_at,7829.6,P
2,1001_at,1081.4,P
3,1002_f_at,82.5,A
4,1003_s_at,864.1,A


### Filter data #1

We're going to:
- Remove probes from the mapper, mapping to multiple Entrez IDs
- Construct a Python dictionary containing the valid probes and their Entrez ID

In [4]:
def create_mapper(meta_data_tables):
    """
    Returns a python dictionary that represents our mapper object
    Important: not all probs_id are mapped to an ENTREZ_GENE_ID
    probs_id without an enterez_id are not added to the dictionary
    """
    mapper = {}
    for df in meta_data_tables:
        for index, row in df.iterrows():
            probs_id = row['ID']
            
            if probs_id in mapper and mapper[probs_id] != row['ENTREZ_GENE_ID']:
                # Multiple enterez id for the same probs
                # Set their value to None to invalid them
                # Elements set to "None" are then removed
                mapper[probs_id] = None
                
            if probs_id not in mapper and not pd.isnull(row['ENTREZ_GENE_ID']):
                mapper[probs_id] = row['ENTREZ_GENE_ID']
            
    # Remove invalid mapping (value = None)
    # (Some of the probes are linked with multiple numbers (enterez_id ?) using /// as separator)
    filtered_mapper = {k:v for k,v in mapper.items() if v != None and '/' not in v}
    
    return filtered_mapper

meta_data_tables = []
for gpl_name, gpl in gse.gpls.items():
    meta_data_tables.append(gpl.table)

mapper = create_mapper(meta_data_tables)
print("- Mapper loaded", len(mapper))

- Mapper loaded 39211


### Filter data #2: 
#### Remove rows without a matching enterez_id
Intuition: 
1. Convert the dictionary **mapper** into a pandas' DataFrame (**mapper_df**).  
2. Use a SQL-like inner join to merge **mapper_df** with the existing pandas' DataFrame.  
Inner join creates a new Dataframe with *only* the matching rows.

References:
- https://www.w3schools.com/sql/sql_join_inner.asp
- https://pandas.pydata.org/pandas-docs/stable/merging.html

In [5]:
# Convert mapper to a Pandas Dataframe with two columns (probs, enterez_id)
mapper_df = pd.DataFrame.from_dict(mapper, orient='index')
mapper_df.index.name = 'ID_REF'
mapper_df.columns = ['ENTREZ_GENE_ID']
print("\nExample of the obtained Pandas DataFrame:")

# Create a mapper (sample_id, person_id)
mapper_sample_person = pd.DataFrame(samples_label)
mapper_sample_person = mapper_sample_person.set_index([samples_name])
mapper_sample_person = mapper_sample_person.transpose()
mapper_sample_person.head()

mapper_person_sample = {key: value for (key, value) in zip(samples_label, samples_name)}


Example of the obtained Pandas DataFrame:


### Creation of a Dataframe Entrez ID - Value: genes common to all array
For each sample:
1. We convert probes' values in **log2(values)**
2. Rows with the same **entrez_id** are merged together using the average (probes mapping to the same Entrez ID are averaged out).

In [6]:
# Update each df object inside the list 'data_frames' with the dataframe with only matching probs_id
# For each sample's dataframes:
# - log2
# - only matching entrez_id
# - L1 normalization
data_frames_entrez = []
for df in data_frames:
    df = pd.merge(df, mapper_df, how='inner', left_on=['ID_REF'], right_index=True, sort=False)
    df = df.groupby('ENTREZ_GENE_ID').mean()
    df.VALUE = df.VALUE / df.VALUE.sum()
    df['VALUE'] = np.log2(df['VALUE'])
    data_frames_entrez.append(df)

# ok, now we have to concat all the dataframes of each person
data_frames_per_person = []

# remove repeated elements from samples_label
unique_person_label = list(set(samples_label))

for person_label in unique_person_label:
    
    # get index of repeated elements in samples_label that has the value = person_label
    index_to_merge = [i for i, x in enumerate(samples_label) if x == person_label]
    
    # get elements from data_frames_entrez given the index in index_to_merge
    arrays = list(itemgetter(*index_to_merge)(data_frames_entrez))
    
    # if arrays contains the same subset of entrez_id, do the mean
    all_gene_person = pd.concat(arrays).groupby('ENTREZ_GENE_ID').mean()
    
    # save the dataset with all the 5 arrays inside data_frames_per_person
    data_frames_per_person.append(all_gene_person)

# concat data_frames by columns and use the people's label as index
merged_entrez_value_df = pd.concat(data_frames_per_person, axis=1, keys=unique_person_label)
print("size before any removal", merged_entrez_value_df.shape)

# remove gene with at least 1 missing value
merged_entrez_value_df = merged_entrez_value_df.dropna(axis=0, how='any')
print("size after removing genes with missing value", merged_entrez_value_df.shape)

# remove disturbing index "VALUE"
merged_entrez_value_df.columns = merged_entrez_value_df.columns.droplevel([1])

print("\nExample of the obtained merged data frame:")
merged_entrez_value_df.tail()

size before any removal (18281, 39)
size after removing genes with missing value (18280, 39)

Example of the obtained merged data frame:


Unnamed: 0,LM09,OM04,LF08,LM08,LM06,OF01,OF06,LF10,LF01,OF08,...,LM03,OF10,LM07,LM01,OF09,LM10,LF02,OM08,LF06,OM02
9991,-15.015017,-14.029182,-14.730684,-14.436887,-14.690659,-14.005998,-14.297691,-15.104663,-15.138182,-13.993873,...,-14.022922,-14.461315,-14.117349,-13.890242,-14.678685,-14.365473,-14.219301,-14.633491,-15.221065,-14.112547
9992,-11.967246,-13.369152,-11.645316,-13.391123,-12.29095,-12.596421,-11.387416,-12.773683,-11.899348,-11.462925,...,-12.082758,-10.817922,-13.561328,-12.627455,-12.640967,-11.437825,-12.848358,-13.063657,-11.182727,-13.212526
9993,-12.579342,-12.596001,-12.617639,-12.455286,-12.62593,-12.872774,-12.412381,-12.565632,-12.621854,-12.697331,...,-12.342435,-12.848545,-12.72783,-12.683825,-12.515396,-12.435501,-12.468785,-12.556916,-12.342415,-13.214788
9994,-15.99981,-16.399624,-17.434541,-15.180187,-16.60509,-16.427043,-17.080937,-16.794241,-16.996244,-16.210221,...,-16.576245,-16.514249,-16.606163,-16.639169,-17.513064,-17.703505,-17.460904,-16.2129,-16.43949,-16.291632
9997,-16.212005,-16.484908,-16.281233,-16.092018,-16.429553,-14.738354,-14.249296,-16.022438,-16.19862,-15.047503,...,-17.19411,-14.627341,-16.252034,-16.155454,-14.792721,-15.872621,-16.524631,-16.525529,-16.487183,-16.068297


Because of the consistency among all the tables in all the batches  
the person_id, index of the columns, have to be replaced with one (random) sample_id belonging to that person

In [7]:
# get columns index
indexes = list(merged_entrez_value_df.columns.get_values())

# using the mapper_person_sample
new_indexes = [mapper_person_sample[x] + '_' + x[0:2] for x in indexes]

# copy the matrix
final_matrix = merged_entrez_value_df

# replace the columns index
final_matrix.columns = new_indexes

# give a name to the row index
final_matrix.index.name = 'ENTREZ_GENE_ID'

# preview
final_matrix.head()

Unnamed: 0_level_0,GSM47301_LM,GSM47856_OM,GSM47568_LF,GSM47840_LM,GSM47362_LM,GSM47581_OF,GSM47372_OF,GSM47395_LF,GSM47561_LF,GSM47326_OF,...,GSM47398_LM,GSM47415_OF,GSM47839_LM,GSM47357_LM,GSM47414_OF,GSM47405_LM,GSM47225_LF,GSM47598_OM,GSM47352_LF,GSM47378_OM
ENTREZ_GENE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-14.333303,-13.13712,-13.390124,-15.177814,-13.535999,-13.198388,-13.376763,-15.204678,-13.313768,-15.188506,...,-15.204893,-15.007754,-15.17063,-14.719135,-14.708128,-15.160781,-13.928214,-13.647916,-14.346581,-14.279093
10,-17.913243,-18.478254,-18.356992,-19.541172,-19.251524,-16.656797,-19.417744,-18.514169,-16.296373,-18.957014,...,-18.111421,-19.300273,-18.405934,-18.509624,-18.985489,-16.839802,-18.004887,-17.921703,-18.93927,-16.764758
100,-14.352948,-14.386996,-14.852295,-14.737911,-14.208276,-14.859601,-13.958755,-14.962105,-14.469576,-14.178614,...,-15.814656,-14.359548,-14.066272,-14.553393,-14.499653,-13.862449,-14.567345,-14.416044,-14.523867,-14.617676
1000,-15.370801,-15.667994,-16.422555,-15.798042,-15.806323,-16.521828,-15.535281,-15.639368,-15.690821,-16.757663,...,-17.032607,-16.548264,-15.083416,-15.14529,-16.394201,-15.619832,-15.927902,-15.911905,-16.46339,-16.146455
10000,-15.274287,-15.023897,-14.882516,-14.632519,-15.354033,-14.912501,-14.708296,-15.804482,-15.755244,-15.212492,...,-15.164343,-15.436966,-16.011478,-15.667627,-15.099094,-16.055491,-14.967439,-15.120512,-14.906458,-15.017584


In [8]:
final_matrix.to_pickle("data/GSE2508_table.pkl")