# Batch 2 Preprocessing
Group 4: Damiano Chini, Riccardo Gilmozzi, Gianmarco Piccinno & Alessandro Rizzuto
Useful links:
https://github.com/ComplexityBiosystems/obesity-score
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48964

This code operates the preprocessing steps, namely (from the paper):

1) Probes containing missing values are excluded from the analysis.

2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

4) Probes mapping to the same Entrez ID label are averaged out.

5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.

After these steps, each data set or batch is represented by a single expression matrix X. Each entry X i j represents the log 2 of the expression intensity of gene i in sample j.


In [1]:
import GEOparse
import pandas as pd

In [2]:
gse2 = GEOparse.get_GEO(filepath="./data/GSE26637_family.soft.gz")

11-Oct-2017 19:46:22 INFO GEOparse - Parsing ./data/GSE26637_family.soft.gz: 
11-Oct-2017 19:46:22 DEBUG GEOparse - DATABASE: GeoMiame
11-Oct-2017 19:46:22 DEBUG GEOparse - SERIES: GSE26637
11-Oct-2017 19:46:22 DEBUG GEOparse - PLATFORM: GPL570
  gpls[entry_name] = parse_GPL(data_group, entry_name)
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655603
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655604
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655605
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655606
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655607
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655608
11-Oct-2017 19:46:24 DEBUG GEOparse - SAMPLE: GSM655609
11-Oct-2017 19:46:25 DEBUG GEOparse - SAMPLE: GSM655610
11-Oct-2017 19:46:25 DEBUG GEOparse - SAMPLE: GSM655611
11-Oct-2017 19:46:25 DEBUG GEOparse - SAMPLE: GSM655612
11-Oct-2017 19:46:25 DEBUG GEOparse - SAMPLE: GSM655613
11-Oct-2017 19:46:25 DEBUG GEOparse - SAMPLE: GSM655614
11-Oct-2017 19:46:25 DEBUG G

In [3]:
plats_2 = list(gse2.gpls.keys())[0]
print(plats_2)

GPL570


Annotation table

In [4]:
samples2 = gse2.phenotype_data[["characteristics_ch1.0.gender", "characteristics_ch1.2.stimulation", "characteristics_ch1.3.resistance status"]]
samples2 = samples2.rename(columns={'characteristics_ch1.0.gender':'gender', 'characteristics_ch1.2.stimulation':'fasting_status',
                         'characteristics_ch1.3.resistance status':'insulin_status'})
samples2['cbmi'] = samples2['insulin_status'].map(lambda x: 'lean' if x == 'sensitive' else 'obese')
print(samples2)

           gender    fasting_status insulin_status   cbmi
GSM655603  female           fasting      resistant  obese
GSM655604  female           fasting      resistant  obese
GSM655605  female           fasting      resistant  obese
GSM655606  female           fasting      resistant  obese
GSM655607  female           fasting      resistant  obese
GSM655608  female           fasting      sensitive   lean
GSM655609  female           fasting      sensitive   lean
GSM655610  female           fasting      sensitive   lean
GSM655611  female           fasting      sensitive   lean
GSM655612  female           fasting      sensitive   lean
GSM655613  female  hyperinsulinemia      resistant  obese
GSM655614  female  hyperinsulinemia      resistant  obese
GSM655615  female  hyperinsulinemia      resistant  obese
GSM655616  female  hyperinsulinemia      resistant  obese
GSM655617  female  hyperinsulinemia      resistant  obese
GSM655618  female  hyperinsulinemia      sensitive   lean
GSM655619  fem

In [15]:
samples2.to_pickle('./output/batch2_pheno.p')
with open('./output/batch2_pheno.txt', 'w') as handle:
    samples2.to_csv(handle, sep='\t')

# Preprocessing of Expression Data (Batch 2)

In [6]:
samples2_exprs = gse2.pivot_samples('VALUE')[list(samples2.index)]
#print('Expression Table', '\n', samples2_exprs.head())

In [7]:
samples2_ann = samples2_exprs.reset_index().merge(gse2.gpls['GPL570'].table[["ID", "ENTREZ_GENE_ID"]],
                                left_on='ID_REF', right_on="ID").set_index('ID_REF')
samples2_ann.drop('ID', inplace=True, axis=1)
samples2_ann['ENTREZ_GENE_ID'] = samples2_ann['ENTREZ_GENE_ID'].astype(str)
#print(samples2_ann.head())

# 5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

In [8]:
print(samples2_ann.shape[0])
samples2_ann = samples2_ann[~samples2_ann.ENTREZ_GENE_ID.str.contains("///")].dropna()
samples2_ann['ENTREZ_GENE_ID'].astype(float, inplace=True)
print(samples2_ann.shape[0])
samples2_ann = samples2_ann.dropna()
print(samples2_ann.shape[0])

54675
52375
52375


# 1) Probes containing missing values are excluded from the analysis.

# 2) Probes are mapped to Entrez ID labels if they are available in the associated platform.


# 3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

# 4) Probes mapping to the same Entrez ID label are averaged out.

In [9]:
exprs_2 = samples2_ann.groupby('ENTREZ_GENE_ID').median()
print(exprs_2.head())
print('\n', exprs_2.shape)

                GSM655603  GSM655604  GSM655605  GSM655606  GSM655607  \
ENTREZ_GENE_ID                                                          
1                   2.110      2.130      2.120      2.140      2.290   
10                  2.140      2.190      2.190      2.190      2.240   
100                 6.075      6.145      6.820      6.010      6.065   
1000                2.960      3.005      3.005      2.965      3.105   
10000               5.230      3.810      3.710      4.310      4.020   

                GSM655608  GSM655609  GSM655610  GSM655611  GSM655612  \
ENTREZ_GENE_ID                                                          
1                   2.150      2.170       2.16      2.130      2.130   
10                  2.200      2.180       2.23      2.290      3.330   
100                 5.835      6.365       6.05      5.515      6.945   
1000                3.050      3.110       3.04      3.015      3.015   
10000               4.060      3.470       4.75   

# Write the eprs4 dataframe in a text file

In [16]:
to_write = exprs_2.T
to_write.to_pickle('./output/batch2_exprs.p')
with open('./output/batch2_geno.txt', 'w') as handle:
    to_write.to_csv(handle, sep='\t')

# Comparison with the data provided by the authors

In [12]:
or_data2 = pd.read_pickle('./data/GSE26637_geno.p')
print(or_data2.shape, to_write.shape)
len(set.intersection(set(or_data2.columns), set(to_write.columns)))

(20, 20283) (20, 20487)


18978

# 6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.