# Batch 2 Preprocessing
Group 4: Damiano Chini, Riccardo Gilmozzi, Gianmarco Piccinno & Alessandro Rizzuto

This code operates the preprocessing steps, namely (from the paper):

1) Probes containing missing values are excluded from the analysis.

2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

4) Probes mapping to the same Entrez ID label are averaged out.

5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.

After these steps, each data set or batch is represented by a single expression matrix X. Each entry X i j represents the log 2 of the expression intensity of gene i in sample j.


In [58]:
import GEOparse
import pandas as pd

In [59]:
gse2 = GEOparse.get_GEO(filepath="./data/GSE26637_family.soft.gz")

12-Oct-2017 12:27:26 INFO GEOparse - Parsing ./data/GSE26637_family.soft.gz: 
12-Oct-2017 12:27:26 DEBUG GEOparse - DATABASE: GeoMiame
12-Oct-2017 12:27:26 DEBUG GEOparse - SERIES: GSE26637
12-Oct-2017 12:27:26 DEBUG GEOparse - PLATFORM: GPL570
  gpls[entry_name] = parse_GPL(data_group, entry_name)
12-Oct-2017 12:27:30 DEBUG GEOparse - SAMPLE: GSM655603
12-Oct-2017 12:27:30 DEBUG GEOparse - SAMPLE: GSM655604
12-Oct-2017 12:27:31 DEBUG GEOparse - SAMPLE: GSM655605
12-Oct-2017 12:27:31 DEBUG GEOparse - SAMPLE: GSM655606
12-Oct-2017 12:27:31 DEBUG GEOparse - SAMPLE: GSM655607
12-Oct-2017 12:27:32 DEBUG GEOparse - SAMPLE: GSM655608
12-Oct-2017 12:27:32 DEBUG GEOparse - SAMPLE: GSM655609
12-Oct-2017 12:27:32 DEBUG GEOparse - SAMPLE: GSM655610
12-Oct-2017 12:27:33 DEBUG GEOparse - SAMPLE: GSM655611
12-Oct-2017 12:27:33 DEBUG GEOparse - SAMPLE: GSM655612
12-Oct-2017 12:27:33 DEBUG GEOparse - SAMPLE: GSM655613
12-Oct-2017 12:27:33 DEBUG GEOparse - SAMPLE: GSM655614
12-Oct-2017 12:27:34 DEBUG G

In [60]:
plats_2 = list(gse2.gpls.keys())[0]
print (plats_2)

GPL570


Annotation table

In [61]:
samples2 = gse2.phenotype_data[["characteristics_ch1.2.stimulation", "characteristics_ch1.3.resistance status"]]

samples2 = samples2.rename(columns={'characteristics_ch1.2.stimulation':'fasting_status',
                         'characteristics_ch1.3.resistance status':'insulin_status'})

samples2['cbmi'] = samples2['insulin_status'].apply(lambda x: 'obese' if x == 'resistant' else 'lean')

print(samples2)
print(samples2.shape)

             fasting_status insulin_status   cbmi
GSM655603           fasting      resistant  obese
GSM655604           fasting      resistant  obese
GSM655605           fasting      resistant  obese
GSM655606           fasting      resistant  obese
GSM655607           fasting      resistant  obese
GSM655608           fasting      sensitive   lean
GSM655609           fasting      sensitive   lean
GSM655610           fasting      sensitive   lean
GSM655611           fasting      sensitive   lean
GSM655612           fasting      sensitive   lean
GSM655613  hyperinsulinemia      resistant  obese
GSM655614  hyperinsulinemia      resistant  obese
GSM655615  hyperinsulinemia      resistant  obese
GSM655616  hyperinsulinemia      resistant  obese
GSM655617  hyperinsulinemia      resistant  obese
GSM655618  hyperinsulinemia      sensitive   lean
GSM655619  hyperinsulinemia      sensitive   lean
GSM655620  hyperinsulinemia      sensitive   lean
GSM655621  hyperinsulinemia      sensitive   lean


In [62]:
with open('./output/batch2_ann.txt', 'w') as handle:
    samples2.to_csv(handle, sep='\t')

# Preprocessing of Expression Data (Batch 2)

In [63]:
samples2_exprs = gse2.pivot_samples('VALUE')[list(samples2.index)]
samples2_exprs.index.name = 'ID'
print('Expression Table', '\n', samples2_exprs.head())

Expression Table 
 name       GSM655603  GSM655604  GSM655605  GSM655606  GSM655607  GSM655608  \
ID                                                                            
1007_s_at       5.14       5.15       5.15       5.22       5.16       5.11   
1053_at         5.04       5.04       5.26       5.41       4.85       4.87   
117_at          4.52       4.44       4.73       3.77       3.49       3.39   
121_at          2.50       2.50       2.49       2.50       2.51       2.53   
1255_g_at       2.20       2.23       2.22       2.23       2.23       2.27   

name       GSM655609  GSM655610  GSM655611  GSM655612  GSM655613  GSM655614  \
ID                                                                            
1007_s_at       5.16       5.85       5.18       4.89       5.41       5.06   
1053_at         4.92       5.07       4.95       5.27       5.04       4.30   
117_at          3.15       3.52       3.51       3.09       3.67       3.99   
121_at          2.50       2.51 

Get annotation table with geoparse

In [64]:
#ann_table2 = gse2.gpls[plats_2].table[["ID", "ENTREZ_GENE_ID", "Gene Symbol"]]

ann_table2 = gse2.gpls[plats_2].table[["ID", "ENTREZ_GENE_ID"]]
print('Annotation Table: ', '\n', ann_table2.head())

Annotation Table:  
           ID     ENTREZ_GENE_ID
0  1007_s_at  780 /// 100616237
1    1053_at               5982
2     117_at               3310
3     121_at               7849
4  1255_g_at               2978


Remove probes without ENTREZ ID

# 1) Probes containing missing values are excluded from the analysis.

In [65]:
print(samples2_exprs.shape)
samples2_exprs = samples2_exprs.dropna()
print(samples2_exprs.shape)

(54675, 20)
(54675, 20)


# 5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

In [66]:
ann_table = ann_table2.dropna()
print(ann_table.head())

          ID     ENTREZ_GENE_ID
0  1007_s_at  780 /// 100616237
1    1053_at               5982
2     117_at               3310
3     121_at               7849
4  1255_g_at               2978


Remove ids that refer to different Entrez ids:
for example
         ID                              Gene ID
2   7896740                81099///79501///26682


In [67]:
print('Shape with double entrez ', ann_table.shape)
ann_table = ann_table[~ann_table['ENTREZ_GENE_ID'].str.contains("///")]
print('Shape without double entrez ', ann_table.shape)
#ann_table = ann_table[~ann_table['Gene Symbol'].str.contains("///")]
#print('Shape without double gene symbols ', ann_table.shape)
print(ann_table.head())

Shape with double entrez  (44134, 2)
Shape without double entrez  (41834, 2)
          ID ENTREZ_GENE_ID
1    1053_at           5982
2     117_at           3310
3     121_at           7849
4  1255_g_at           2978
6    1316_at           7067


# 2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

In [68]:
exprs_2 = ann_table.merge(samples2_exprs, left_on='ID', right_index=True, how='inner')

#exprs_2.index = exprs_2['Gene ID']; del exprs_2['Gene ID']
print(exprs_2.head())
print ('Shape of the complete DataFrame:', exprs_2.shape)

          ID ENTREZ_GENE_ID  GSM655603  GSM655604  GSM655605  GSM655606  \
1    1053_at           5982       5.04       5.04       5.26       5.41   
2     117_at           3310       4.52       4.44       4.73       3.77   
3     121_at           7849       2.50       2.50       2.49       2.50   
4  1255_g_at           2978       2.20       2.23       2.22       2.23   
6    1316_at           7067       4.40       5.02       5.47       3.99   

   GSM655607  GSM655608  GSM655609  GSM655610    ...      GSM655613  \
1       4.85       4.87       4.92       5.07    ...           5.04   
2       3.49       3.39       3.15       3.52    ...           3.67   
3       2.51       2.53       2.50       2.51    ...           2.51   
4       2.23       2.27       2.22       2.27    ...           2.25   
6       4.32       5.37       6.64       4.17    ...           4.39   

   GSM655614  GSM655615  GSM655616  GSM655617  GSM655618  GSM655619  \
1       4.30       5.06       5.10       5.04      


# 3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

# 4) Probes mapping to the same Entrez ID label are averaged out.

In [69]:
exprs_2 = exprs_2.groupby('ENTREZ_GENE_ID').mean()
print(exprs_2.head())
print (exprs_2.shape)

                GSM655603  GSM655604  GSM655605  GSM655606  GSM655607  \
ENTREZ_GENE_ID                                                          
1                2.110000   2.130000      2.120   2.140000   2.290000   
10               2.140000   2.190000      2.190   2.190000   2.240000   
100              6.075000   6.145000      6.820   6.010000   6.065000   
1000             2.960000   3.005000      3.005   2.965000   3.105000   
10000            5.065714   5.154286      4.720   4.955714   4.857143   

                GSM655608  GSM655609  GSM655610  GSM655611  GSM655612  \
ENTREZ_GENE_ID                                                          
1                2.150000      2.170   2.160000   2.130000   2.130000   
10               2.200000      2.180   2.230000   2.290000   3.330000   
100              5.835000      6.365   6.050000   5.515000   6.945000   
1000             3.050000      3.110   3.040000   3.015000   3.015000   
10000            4.921429      4.730   5.324286   

# Write the eprs2 dataframe in a text file

In [70]:
with open('./output/batch2_exprs.txt', 'w') as handle:
    to_write = exprs_2.T
    to_write.to_csv(handle, sep='\t')

# Comparison with the data provided by the authors

In [74]:
or_data2 = pd.read_pickle('./data/GSE26637_geno.p')
print (or_data2)
print (or_data2.shape)

              1    10    100   1000     10000  100009676   10001  10002  \
GSM655608  2.15  2.20  5.835  3.050  5.108333      2.615  5.5675  2.685   
GSM655609  2.17  2.18  6.365  3.110  4.960000      2.555  5.5900  2.515   
GSM655610  2.16  2.23  6.050  3.040  5.420000      2.585  6.4375  2.575   
GSM655611  2.13  2.29  5.515  3.015  5.131667      2.560  5.7900  2.540   
GSM655612  2.13  3.33  6.945  3.015  4.701667      2.565  6.0175  2.535   
GSM655618  2.18  2.26  5.035  3.285  4.760000      2.695  4.8550  2.805   
GSM655619  2.14  2.20  7.550  3.030  4.813333      2.570  5.7650  2.555   
GSM655620  2.21  2.21  5.215  3.225  4.520000      2.740  4.6975  2.835   
GSM655621  2.16  2.21  5.800  3.070  5.190000      2.985  4.8750  2.845   
GSM655622  2.14  2.17  7.570  3.025  4.675000      2.570  5.8050  2.820   
GSM655603  2.11  2.14  6.075  2.960  5.038333      2.545  6.2275  2.475   
GSM655604  2.13  2.19  6.145  3.005  5.378333      2.520  6.0700  2.530   
GSM655605  2.12  2.19  6.