# Batch 3 Preprocessing
Group 4: Damiano Chini, Riccardo Gilmozzi, Gianmarco Piccinno & Alessandro Rizzuto
Useful links:
https://github.com/ComplexityBiosystems/obesity-score
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48964

This code operates the preprocessing steps, namely (from the paper):

1) Probes containing missing values are excluded from the analysis.

2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

4) Probes mapping to the same Entrez ID label are averaged out.

5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.

After these steps, each data set or batch is represented by a single expression matrix X. Each entry X i j represents the log 2 of the expression intensity of gene i in sample j.


In [17]:
import GEOparse
import pandas as pd

In [18]:
gse3 = GEOparse.get_GEO(filepath="./data/GSE27949_family.soft.gz")

11-Oct-2017 19:46:55 INFO GEOparse - Parsing ./data/GSE27949_family.soft.gz: 
11-Oct-2017 19:46:55 DEBUG GEOparse - DATABASE: GeoMiame
11-Oct-2017 19:46:55 DEBUG GEOparse - SERIES: GSE27949
11-Oct-2017 19:46:55 DEBUG GEOparse - PLATFORM: GPL570
  gpls[entry_name] = parse_GPL(data_group, entry_name)
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691122
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691123
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691124
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691125
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691126
11-Oct-2017 19:46:57 DEBUG GEOparse - SAMPLE: GSM691127
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691128
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691129
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691130
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691131
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691132
11-Oct-2017 19:46:58 DEBUG GEOparse - SAMPLE: GSM691133
11-Oct-2017 19:46:58 DEBUG G

In [19]:
plats_3 = list(gse3.gpls.keys())[0]
print(plats_3)

GPL570


Annotation table

In [20]:
samples3 = gse3.phenotype_data["characteristics_ch1.1.bmi"]
samples3 = pd.DataFrame(samples3); samples3.head()
samples3.rename(columns={"characteristics_ch1.1.bmi":"bmi"}, inplace=True)
samples3["cbmi"] = samples3["bmi"].apply(lambda x: "obese" if (float(x) > 30) else ("lean" if (float(x) <= 25) else ("overweight" if (float(x) > 25) & (float(x) <= 30) else "STRANGE")) )
samples3 = samples3[['cbmi', 'bmi']]

print(samples3)

                 cbmi   bmi
GSM691122       obese  38.1
GSM691123       obese  35.3
GSM691124       obese  32.9
GSM691125       obese  42.2
GSM691126       obese  39.7
GSM691127       obese  36.7
GSM691128       obese  41.5
GSM691129       obese  50.2
GSM691130       obese  37.4
GSM691131       obese  36.7
GSM691132       obese  39.5
GSM691133  overweight  27.2
GSM691134       obese  33.2
GSM691135  overweight  25.1
GSM691136        lean  23.2
GSM691137  overweight  27.8
GSM691138       obese  31.8
GSM691139  overweight  26.9
GSM691140  overweight  27.4
GSM691141  overweight  28.8
GSM691142       obese  32.2
GSM691143        lean  23.6
GSM691144       obese  39.1
GSM691145       obese  34.2
GSM691146       obese  33.1
GSM691147  overweight  26.4
GSM691148  overweight  25.1
GSM691149  overweight  27.3
GSM691150  overweight  26.2
GSM691151        lean  16.7
GSM691152        lean  23.0
GSM691153        lean  23.1
GSM691154        lean  23.2


In [21]:
or_samples3 = pd.read_pickle('./data/GSE27949_pheno.p')
print(type(or_samples3['bmi'][0]), type(or_samples3['cbmi'][0]), type(or_samples3.index[0]))
print(type(samples3['bmi'][0]), type(samples3['cbmi'][0]), type(samples3.index[0]))
or_samples3['bmi'] = or_samples3['bmi'].astype(str)
print(type(or_samples3['bmi'][0]), type(or_samples3['cbmi'][0]), type(or_samples3.index[0]))
print(type(samples3['bmi'][0]), type(samples3['cbmi'][0]), type(samples3.index[0]))

or_samples3_d = or_samples3.to_dict()
samples3_d = samples3.to_dict()
or_samples3_d == samples3_d

<class 'numpy.float64'> <class 'str'> <class 'str'>
<class 'str'> <class 'str'> <class 'str'>
<class 'str'> <class 'str'> <class 'str'>
<class 'str'> <class 'str'> <class 'str'>


True

In [35]:
samples3.to_pickle('./output/batch3_pheno.p')
with open('./output/batch3_pheno.txt', 'w') as handle:
    samples3.to_csv(handle, sep='\t')

# Preprocessing of Expression Data (Batch 4)

In [23]:
samples3_ = gse3.pivot_samples('VALUE')[list(samples3.index)]
print(samples3_.head())


name       GSM691122  GSM691123  GSM691124  GSM691125  GSM691126  GSM691127  \
ID_REF                                                                        
1007_s_at   7.004583   7.659500   7.430517   7.771238   7.683040   7.879629   
1053_at     5.285030   5.253187   5.492071   5.738101   5.643463   5.356738   
117_at      5.607604   5.476518   5.753545   5.453142   6.431755   4.901203   
121_at      6.942938   7.450646   7.004481   7.251976   6.931031   7.162552   
1255_g_at   2.840260   2.948136   2.827350   2.663054   2.610417   2.978128   

name       GSM691128  GSM691129  GSM691130  GSM691131    ...      GSM691145  \
ID_REF                                                   ...                  
1007_s_at   7.184948   7.124926   7.498073   7.687856    ...       7.767611   
1053_at     5.775912   5.640174   5.262772   5.704770    ...       5.200273   
117_at      6.298175   6.155661   5.073436   5.634631    ...       5.401797   
121_at      6.570954   7.042676   6.968916   7.0779

In [24]:
samples3_ann = samples3_.reset_index().merge(gse3.gpls['GPL570'].table[["ID", "ENTREZ_GENE_ID"]],
                                left_on='ID_REF', right_on="ID").set_index('ID_REF')
del samples3_ann["ID"]
samples3_ann['ENTREZ_GENE_ID'] = samples3_ann['ENTREZ_GENE_ID'].astype(str)
print(samples3_ann.head(), '\n\n', samples3_ann.shape[0])

           GSM691122  GSM691123  GSM691124  GSM691125  GSM691126  GSM691127  \
ID_REF                                                                        
1007_s_at   7.004583   7.659500   7.430517   7.771238   7.683040   7.879629   
1053_at     5.285030   5.253187   5.492071   5.738101   5.643463   5.356738   
117_at      5.607604   5.476518   5.753545   5.453142   6.431755   4.901203   
121_at      6.942938   7.450646   7.004481   7.251976   6.931031   7.162552   
1255_g_at   2.840260   2.948136   2.827350   2.663054   2.610417   2.978128   

           GSM691128  GSM691129  GSM691130  GSM691131        ...          \
ID_REF                                                       ...           
1007_s_at   7.184948   7.124926   7.498073   7.687856        ...           
1053_at     5.775912   5.640174   5.262772   5.704770        ...           
117_at      6.298175   6.155661   5.073436   5.634631        ...           
121_at      6.570954   7.042676   6.968916   7.077937        ...  

# 5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

In [25]:
print(samples3_ann.shape[0])
samples3_ann = samples3_ann[~samples3_ann.ENTREZ_GENE_ID.str.contains("///")]

54675


In [26]:
samples3_ann['ENTREZ_GENE_ID'] = samples3_ann['ENTREZ_GENE_ID'].astype(float)
print(samples3_ann.shape[0])
samples3_ann = samples3_ann.dropna()
print(samples3_ann.shape[0])
samples3_ann['ENTREZ_GENE_ID'] = samples3_ann['ENTREZ_GENE_ID'].astype(int)
samples3_ann['ENTREZ_GENE_ID'] = samples3_ann['ENTREZ_GENE_ID'].astype(str)


52375
41834


# 1) Probes containing missing values are excluded from the analysis.

In [27]:
print(samples3_ann.shape)
samples3_ann = samples3_ann.dropna()
print(samples3_ann.shape)

(41834, 34)
(41834, 34)


# 2) Probes are mapped to Entrez ID labels if they are available in the associated platform.


# 3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

# 4) Probes mapping to the same Entrez ID label are averaged out.

In [28]:
exprs_3 = samples3_ann.groupby('ENTREZ_GENE_ID').median()
print(exprs_3.head())
print('\n', exprs_3.shape)

                GSM691122  GSM691123  GSM691124  GSM691125  GSM691126  \
ENTREZ_GENE_ID                                                          
1                4.202563   3.853392   4.005799   3.775287   3.988817   
10               3.611358   3.900851   3.670345   3.989030   3.699571   
100              5.526316   5.553305   5.527924   5.899559   5.844783   
1000             3.781192   3.881830   3.518877   3.740666   3.540397   
10000            4.521154   4.355984   4.752987   4.277416   4.410266   

                GSM691127  GSM691128  GSM691129  GSM691130  GSM691131  \
ENTREZ_GENE_ID                                                          
1                4.074538   3.789106   3.985469   4.202136   3.929663   
10               4.050322   3.687475   3.788186   4.230811   4.020388   
100              5.752965   6.008601   5.560185   6.149886   6.078632   
1000             3.609309   3.587624   3.610829   3.501181   3.365197   
10000            4.467307   4.573366   4.229275   

# Write the eprs4 dataframe in a text file

In [36]:
exprs3 = exprs_3.T
exprs3.to_pickle('./output/batch3_exprs.p')
with open('./output/batch3_geno.txt', 'w') as handle:
    exprs3.to_csv(handle, sep='\t')

# Comparison with the data provided by the authors

In [32]:
or_data3 = pd.read_pickle('./data/GSE27949_geno.p')
print(or_data3.shape, exprs3.shape)
print(len(set.intersection(set(or_data3.columns), set(exprs3.columns))))
print(type(or_data3.columns[0]))
print(type(exprs3.columns[0]))

(33, 20283) (33, 20486)
18978
<class 'str'>
<class 'str'>


# 6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.