# Batch 3 Preprocessing
Group 4: Damiano Chini, Riccardo Gilmozzi, Gianmarco Piccinno & Alessandro Rizzuto

This code operates the preprocessing steps, namely (from the paper):

1) Probes containing missing values are excluded from the analysis.

2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

4) Probes mapping to the same Entrez ID label are averaged out.

5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.

After these steps, each data set or batch is represented by a single expression matrix X. Each entry X i j represents the log 2 of the expression intensity of gene i in sample j.


In [1]:
import GEOparse
import pandas as pd

In [2]:
gse3 = GEOparse.get_GEO(filepath="./data/GSE27949_family.soft.gz")

12-Oct-2017 19:54:39 INFO GEOparse - Parsing ./data/GSE27949_family.soft.gz: 
12-Oct-2017 19:54:39 DEBUG GEOparse - DATABASE: GeoMiame
12-Oct-2017 19:54:39 DEBUG GEOparse - SERIES: GSE27949
12-Oct-2017 19:54:39 DEBUG GEOparse - PLATFORM: GPL570
  gpls[entry_name] = parse_GPL(data_group, entry_name)
12-Oct-2017 19:54:44 DEBUG GEOparse - SAMPLE: GSM691122
12-Oct-2017 19:54:44 DEBUG GEOparse - SAMPLE: GSM691123
12-Oct-2017 19:54:44 DEBUG GEOparse - SAMPLE: GSM691124
12-Oct-2017 19:54:45 DEBUG GEOparse - SAMPLE: GSM691125
12-Oct-2017 19:54:45 DEBUG GEOparse - SAMPLE: GSM691126
12-Oct-2017 19:54:45 DEBUG GEOparse - SAMPLE: GSM691127
12-Oct-2017 19:54:46 DEBUG GEOparse - SAMPLE: GSM691128
12-Oct-2017 19:54:46 DEBUG GEOparse - SAMPLE: GSM691129
12-Oct-2017 19:54:46 DEBUG GEOparse - SAMPLE: GSM691130
12-Oct-2017 19:54:47 DEBUG GEOparse - SAMPLE: GSM691131
12-Oct-2017 19:54:47 DEBUG GEOparse - SAMPLE: GSM691132
12-Oct-2017 19:54:47 DEBUG GEOparse - SAMPLE: GSM691133
12-Oct-2017 19:54:48 DEBUG G

In [3]:
plats_3 = list(gse3.gpls.keys())[0]
print (plats_3)

GPL570


Annotation table, overweight samples are still being kept

In [4]:
samples3 = pd.to_numeric(gse3.phenotype_data["characteristics_ch1.1.bmi"])
samples3 = pd.DataFrame(samples3); samples3.head()

samples3.rename(columns={"characteristics_ch1.1.bmi":"bmi"}, inplace=True)

def get_cbmi(x):
    if (x < 25):
        return 'lean'
    elif (x >= 30):
        return 'obese'
    else:
        return 'overweight'
    
samples3["cbmi"] = samples3["bmi"].apply(get_cbmi)
print(samples3)
print(samples3.shape)

            bmi        cbmi
GSM691122  38.1       obese
GSM691123  35.3       obese
GSM691124  32.9       obese
GSM691125  42.2       obese
GSM691126  39.7       obese
GSM691127  36.7       obese
GSM691128  41.5       obese
GSM691129  50.2       obese
GSM691130  37.4       obese
GSM691131  36.7       obese
GSM691132  39.5       obese
GSM691133  27.2  overweight
GSM691134  33.2       obese
GSM691135  25.1  overweight
GSM691136  23.2        lean
GSM691137  27.8  overweight
GSM691138  31.8       obese
GSM691139  26.9  overweight
GSM691140  27.4  overweight
GSM691141  28.8  overweight
GSM691142  32.2       obese
GSM691143  23.6        lean
GSM691144  39.1       obese
GSM691145  34.2       obese
GSM691146  33.1       obese
GSM691147  26.4  overweight
GSM691148  25.1  overweight
GSM691149  27.3  overweight
GSM691150  26.2  overweight
GSM691151  16.7        lean
GSM691152  23.0        lean
GSM691153  23.1        lean
GSM691154  23.2        lean
(33, 2)


In [5]:
with open('./output/batch3_ann.txt', 'w') as handle:
    samples3.to_csv(handle, sep='\t')

# Preprocessing of Expression Data (Batch 3)

In [6]:
samples3_exprs = gse3.pivot_samples('VALUE')[list(samples3.index)]
samples3_exprs.index.name = 'ID'
print('Expression Table', '\n', samples3_exprs.head())

Expression Table 
 name       GSM691122  GSM691123  GSM691124  GSM691125  GSM691126  GSM691127  \
ID                                                                            
1007_s_at   7.004583   7.659500   7.430517   7.771238   7.683040   7.879629   
1053_at     5.285030   5.253187   5.492071   5.738101   5.643463   5.356738   
117_at      5.607604   5.476518   5.753545   5.453142   6.431755   4.901203   
121_at      6.942938   7.450646   7.004481   7.251976   6.931031   7.162552   
1255_g_at   2.840260   2.948136   2.827350   2.663054   2.610417   2.978128   

name       GSM691128  GSM691129  GSM691130  GSM691131    ...      GSM691145  \
ID                                                       ...                  
1007_s_at   7.184948   7.124926   7.498073   7.687856    ...       7.767611   
1053_at     5.775912   5.640174   5.262772   5.704770    ...       5.200273   
117_at      6.298175   6.155661   5.073436   5.634631    ...       5.401797   
121_at      6.570954   7.042676 

Get annotation table with geoparse

In [7]:
#ann_table3 = gse3.gpls[plats_3].table[["ID", "ENTREZ_GENE_ID", "Gene Symbol"]]

ann_table3 = gse3.gpls[plats_3].table[["ID", "ENTREZ_GENE_ID"]]
print (len(ann_table3))

print('Annotation Table: ', '\n', ann_table3.head())

54675
Annotation Table:  
           ID     ENTREZ_GENE_ID
0  1007_s_at  780 /// 100616237
1    1053_at               5982
2     117_at               3310
3     121_at               7849
4  1255_g_at               2978


Remove probes without ENTREZ ID

# 1) Probes containing missing values are excluded from the analysis.

In [8]:
print(samples3_exprs.shape)
samples3_exprs = samples3_exprs.dropna()
print(samples3_exprs.shape)

(54675, 33)
(54675, 33)


# 5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

In [9]:
ann_table = ann_table3.dropna()
print(ann_table.head())

          ID     ENTREZ_GENE_ID
0  1007_s_at  780 /// 100616237
1    1053_at               5982
2     117_at               3310
3     121_at               7849
4  1255_g_at               2978


Remove ids that refer to different Entrez ids:
for example
         ID                              Gene ID
2   7896740                81099///79501///26682


In [10]:
print('Shape with double entrez ', ann_table.shape)
ann_table = ann_table[~ann_table['ENTREZ_GENE_ID'].str.contains("///")]
print('Shape without double entrez ', ann_table.shape)
#ann_table = ann_table[~ann_table['Gene Symbol'].str.contains("///")]
#print('Shape without double gene symbols ', ann_table.shape)
print(ann_table.head())

Shape with double entrez  (44134, 2)
Shape without double entrez  (41834, 2)
          ID ENTREZ_GENE_ID
1    1053_at           5982
2     117_at           3310
3     121_at           7849
4  1255_g_at           2978
6    1316_at           7067


# 2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

In [11]:
exprs_3 = ann_table.merge(samples3_exprs, left_on='ID', right_index=True, how='inner')

#exprs_3.index = exprs_3['Gene ID']; del exprs_3['Gene ID']
print(exprs_3.head())
print ('Shape of the complete DataFrame:', exprs_3.shape)

          ID ENTREZ_GENE_ID  GSM691122  GSM691123  GSM691124  GSM691125  \
1    1053_at           5982   5.285030   5.253187   5.492071   5.738101   
2     117_at           3310   5.607604   5.476518   5.753545   5.453142   
3     121_at           7849   6.942938   7.450646   7.004481   7.251976   
4  1255_g_at           2978   2.840260   2.948136   2.827350   2.663054   
6    1316_at           7067   5.466243   5.547826   5.659389   5.456705   

   GSM691126  GSM691127  GSM691128  GSM691129    ...      GSM691145  \
1   5.643463   5.356738   5.775912   5.640174    ...       5.200273   
2   6.431755   4.901203   6.298175   6.155661    ...       5.401797   
3   6.931031   7.162552   6.570954   7.042676    ...       6.758132   
4   2.610417   2.978128   2.493905   2.745614    ...       2.837239   
6   4.989941   5.460332   4.818883   4.960852    ...       4.950623   

   GSM691146  GSM691147  GSM691148  GSM691149  GSM691150  GSM691151  \
1   5.121898   5.655970   5.334950   5.479950   5.3


# 3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

# 4) Probes mapping to the same Entrez ID label are averaged out.

In [12]:
exprs_3 = exprs_3.groupby('ENTREZ_GENE_ID').mean()
print(exprs_3.head())
print (exprs_3.shape)

                GSM691122  GSM691123  GSM691124  GSM691125  GSM691126  \
ENTREZ_GENE_ID                                                          
1                4.202563   3.853392   4.005799   3.775287   3.988817   
10               3.611358   3.900851   3.670345   3.989030   3.699571   
100              5.526316   5.553305   5.527924   5.899559   5.844783   
1000             3.781192   3.881830   3.518877   3.740666   3.540397   
10000            4.752245   4.729065   4.935592   4.848327   4.814203   

                GSM691127  GSM691128  GSM691129  GSM691130  GSM691131  \
ENTREZ_GENE_ID                                                          
1                4.074538   3.789106   3.985469   4.202136   3.929663   
10               4.050322   3.687475   3.788186   4.230811   4.020388   
100              5.752965   6.008601   5.560185   6.149886   6.078632   
1000             3.609309   3.587624   3.610829   3.501181   3.365197   
10000            4.945414   4.901753   4.854134   

# Write the eprs3 dataframe in a text file

In [13]:
with open('./output/batch3_exprs.txt', 'w') as handle:
    to_write = exprs_3.T
    to_write.to_csv(handle, sep='\t')

# Comparison with the data provided by the authors

In [14]:
or_data3 = pd.read_pickle('./data/GSE27949_geno.p')
print (or_data3.head())
print (or_data3.shape)

                  1        10       100      1000     10000  100009676  \
GSM691136  3.789106  3.421848  5.473183  3.560196  5.150060   3.516442   
GSM691143  4.213376  3.858635  5.712123  3.740538  4.954951   3.487883   
GSM691151  3.853851  3.631455  5.229485  3.578582  4.636259   3.486197   
GSM691152  3.905869  3.812028  5.441216  3.718983  4.950701   3.461761   
GSM691153  3.933165  4.106960  6.344840  3.683929  4.920402   3.500371   

              10001     10002     10003     10004    ...         9987  \
GSM691136  4.541283  3.987601  2.948349  5.626504    ...     7.097902   
GSM691143  4.751201  3.793340  3.310487  5.679048    ...     7.134892   
GSM691151  4.617861  3.929365  3.915912  5.443038    ...     7.290359   
GSM691152  4.612422  3.694696  3.125776  5.613361    ...     6.813517   
GSM691153  4.694406  3.631009  3.208704  5.594756    ...     7.053419   

               9988      9989       999      9990      9991      9992  \
GSM691136  8.110432  5.480923  2.986128  4.

In [15]:
probes_or=pd.Series(or_data3.columns)
probes_my= pd.Series(exprs_3.index)
print ('the number of entrez in our result is:',len(probes_my))
print ('the number of entrez in orig result is:',len(probes_or))

intersection = set.intersection(set(probes_or), set(probes_my))
print ('the number of probes COMMON between original and oursis:',len(intersection))

sub=set(probes_or) - set(probes_my)
print ('the number of probes found in orig but in our results is:',len(sub))

annotations = gse3.gpls[plats_3].table[["ID", "ENTREZ_GENE_ID", "Gene Symbol"]]
#annotations = gse3.gpls[plats_3].table
l=[]
for x in sub:
    if annotations.loc[annotations['ENTREZ_GENE_ID'] == x].empty:
        #print (annotations.loc[annotations['ENTREZ_GENE_ID'] == x])
        l.append(x)
print ('this is the list of all entrez not found in gpl annotations:\n',l)
print ('size of this list is:', len(l))
print ('in annotations i found:\n', annotations.loc[annotations['ENTREZ_GENE_ID'] == '71'])
print ('in authors results i found:\n',probes_or[probes_or.isin(['71'])])


#probes_expr = pd.DataFrame(samples3_exprs.index)
#probes_expr.loc[probes_expr['ID'] == '49411']
#print (probes_expr)

the number of entrez in our result is: 20486
the number of entrez in orig result is: 20283
the number of probes COMMON between original and oursis: 18978
the number of probes found in orig but in our results is: 1305
this is the list of all entrez not found in gpl annotations:
 ['112970', '654434', '100129103', '387647', '728882', '83869', '6950', '100507547', '7044', '440862', '100506948', '267004', '6144', '100287421', '100287166', '388946', '23266', '7442', '27111', '100133142', '100131534', '100132832', '100507235', '29920', '730495', '255812', '100505598', '6224', '100129852', '28227', '55525', '80149', '100287005', '602', '64072', '100508534', '284804', '51217', '751071', '1589', '100509635', '100093630', '27251', '401166', '347734', '100288271', '100128002', '153514', '100505522', '768222', '5687', '283079', '283131', '28797', '339047', '114659', '6613', '100293553', '5820', '100506853', '100505523', '100287529', '100287349', '5625', '28562', '85028', '100131112', '60626', '6457

# 6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.