# Batch 4 Preprocessing
Group 4: Damiano Chini, Riccardo Gilmozzi, Gianmarco Piccinno & Alessandro Rizzuto
Useful links:
https://github.com/ComplexityBiosystems/obesity-score
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48964

This code operates the preprocessing steps, namely (from the paper):

1) Probes containing missing values are excluded from the analysis.

2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

4) Probes mapping to the same Entrez ID label are averaged out.

5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.

After these steps, each data set or batch is represented by a single expression matrix X. Each entry X i j represents the log 2 of the expression intensity of gene i in sample j.


In [1]:
import GEOparse
import pandas as pd

In [2]:
gse4 = GEOparse.get_GEO(filepath="./data/GSE48964_family.soft.gz")

11-Oct-2017 19:37:40 INFO GEOparse - Parsing ./data/GSE48964_family.soft.gz: 
11-Oct-2017 19:37:40 DEBUG GEOparse - DATABASE: GeoMiame
11-Oct-2017 19:37:40 DEBUG GEOparse - SERIES: GSE48964
11-Oct-2017 19:37:40 DEBUG GEOparse - PLATFORM: GPL6244
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187673
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187674
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187675
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187676
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187677
11-Oct-2017 19:37:42 DEBUG GEOparse - SAMPLE: GSM1187678


In [3]:
plats_4 = list(gse4.gpls.keys())[0]
print(plats_4)

GPL6244


Annotation table

In [20]:
samples4 = gse4.phenotype_data["source_name_ch1"]
samples4 = pd.DataFrame(samples4); samples4.head()
samples4.rename(columns={"source_name_ch1":"cbmi"}, inplace=True); samples4.head()
samples4["cbmi"] = samples4["cbmi"].apply(lambda x: 'obese' if x.split(' ')[1].lower()=='obese' else 'lean'); samples4.head()
print(samples4.head()); print(len(samples4))
print(samples4.shape)

             cbmi
GSM1187673  obese
GSM1187674  obese
GSM1187675  obese
GSM1187676   lean
GSM1187677   lean
6
(6, 1)


In [25]:
samples4.to_pickle('./output/batch4_pheno.p')
with open('./output/batch4_pheno.txt', 'w') as handle:
    samples4.to_csv(handle, sep='\t')

# Preprocessing of Expression Data (Batch 4)

In [6]:
samples4_exprs = gse4.pivot_samples('VALUE')[list(samples4.index)]
samples4_exprs.index.name = 'ID'
print('Expression Table', '\n', samples4_exprs.head())

Expression Table 
 name     GSM1187673  GSM1187674  GSM1187675  GSM1187676  GSM1187677  \
ID                                                                    
7892501     4.14123     3.69950     4.97979     5.18763     6.36756   
7892502     5.43273     5.83503     4.95171     5.01420     5.61799   
7892503     4.19429     4.65989     6.82531     4.47086     4.17932   
7892504     9.02873     8.80280     8.53222     8.67004     9.11426   
7892505     4.47112     4.15105     5.47718     4.37622     5.17555   

name     GSM1187678  
ID                   
7892501     5.24355  
7892502     5.71236  
7892503     6.35435  
7892504     8.83972  
7892505     4.36861  


We are considering the annotation table the one contained in the GPL6244.annot file (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6244).

ID = ID from Platform data table

Gene ID = Entrez Gene identifier

The column containing the Entrez_ID is Gene ID!!!

In [7]:
ann_table4 = pd.read_csv('./data/GPL6244.annot', sep='\t', skiprows=27, dtype=str, na_values='NaN',usecols=[0,3], index_col=0)#.dropna()
print('Annotation Table: ', '\n', ann_table4.head())

Annotation Table:  
                                      Gene ID
ID                                          
7896736                                  NaN
7896738                                  NaN
7896740                81099///79501///26682
7896742  100134822///346288///140849///55251
7896744      729759///441308///81399///26683


Remove probes without ENTREZ ID

# 1) Probes containing missing values are excluded from the analysis.

In [8]:
print(samples4_exprs.shape)
samples4_exprs = samples4_exprs.dropna()
print(samples4_exprs.shape)

(33297, 6)
(33297, 6)


# 5) Probes that cannot be mapped to a unique Entrez ID label are excluded from the analysis, as well as those that cannot be mapped to any Entrez ID label at all.

In [9]:
ann_table = ann_table4.dropna()
print(ann_table.head())

                                     Gene ID
ID                                          
7896740                81099///79501///26682
7896742  100134822///346288///140849///55251
7896744      729759///441308///81399///26683
7896754                100287934///100287497
7896756                      400728///157693


Remove ids that refer to different Entrez ids:
for example
         ID                              Gene ID
2   7896740                81099///79501///26682


In [10]:
ann_table = ann_table[~ann_table['Gene ID'].str.contains("///")]
print(ann_table.head())
print(ann_table.shape)

        Gene ID
ID             
7896759  643837
7896761  148398
7896779  339451
7896798   84069
7896817    9636
(20052, 1)


# 2) Probes are mapped to Entrez ID labels if they are available in the associated platform.

In [11]:
print(type(ann_table.index[0]))
print(type(samples4_exprs.index[0]))
samples4_exprs.index = samples4_exprs.index.astype(str)
print(type(ann_table.index[0]))
print(type(samples4_exprs.index[0]))

<class 'str'>
<class 'numpy.int64'>
<class 'str'>
<class 'str'>


In [12]:
exprs_4 = ann_table.merge(samples4_exprs, left_index=True, right_index=True, how='inner')
#exprs_4.index = exprs_4['Gene ID']; del exprs_4['Gene ID']
print(exprs_4.head())
print('\nShape of the complete DataFrame: ', exprs_4.shape)

        Gene ID  GSM1187673  GSM1187674  GSM1187675  GSM1187676  GSM1187677  \
ID                                                                            
7896759  643837     8.39817     8.50222     8.16782     8.52986     8.80022   
7896761  148398     8.05877     8.15524     7.81207     7.79663     7.98804   
7896779  339451     8.52992     8.55970     8.48987     8.37219     8.52200   
7896798   84069     8.58032     8.49331     8.50711     8.46933     8.48800   
7896817    9636     8.04444     8.12649     8.08356     7.92055     7.71697   

         GSM1187678  
ID                   
7896759     8.32601  
7896761     8.13137  
7896779     8.64135  
7896798     8.49557  
7896817     7.85109  

Shape of the complete DataFrame:  (20052, 7)



# 3) Values corresponding to raw expression counts or gene expression intensity are log2 transformed (if necessary).

# 4) Probes mapping to the same Entrez ID label are averaged out.

In [13]:
exprs_4 = exprs_4.groupby('Gene ID').mean()
print(exprs_4.head())
print('\n', exprs_4.shape)

         GSM1187673  GSM1187674  GSM1187675  GSM1187676  GSM1187677  \
Gene ID                                                               
1           8.35427     8.42801     8.21779     8.31298     8.28053   
10          4.80774     5.20881     4.83953     4.92198     5.06599   
100        10.14680     9.79308    10.29420    10.41020    10.68150   
1000       12.85520    12.69220    12.49910    10.47370    12.68170   
10000      10.62110    10.43620    10.36570    10.29150    10.42170   

         GSM1187678  
Gene ID              
1           8.26724  
10          5.09882  
100         9.95644  
1000       12.75540  
10000      10.35980  

 (19036, 6)


# Write the eprs4 dataframe in a text file

In [24]:
exprs4 = exprs_4.T
exprs4.to_pickle('./output/batch4_exprs.p')
with open('./output/batch4_geno.txt', 'w') as handle:
    exprs4.to_csv(handle, sep='\t')

# Comparison with the data provided by the authors

In [18]:
or_data4 = pd.read_pickle('./data/GSE48964_geno.p')
#print(or_data4.head())
print('\n', or_data4.shape, exprs4.shape)
print('Intersection: ', len(set.intersection(set(exprs4.columns), set(or_data4.columns))))


 (6, 16193) (6, 19036)
Intersection:  15697


# 6) We apply a simple L 1 normalization in linear space, imposing that the sum of expression of all genes is constant among samples.