# Data preprocessing


## Tomato dataset

The real-world dataset published within a [study](https://www.pnas.org/doi/full/10.1073/pnas.1309606110) on identifying changes in DNA sequence and gene expression that differentiate cultivated tomato and its wild relatives. 

... TBA ...

[Link to download the data](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE45774&format=file&file=GSE45774_rpkm_all.txt.gz)

In [1]:
import pandas as pd
import numpy as np

We will load the data, select the the columns from 29:52 and do the log2(x+1) transform. 

<span style="color:blue">#TODO - explain more why? @Elisabeth</span>


In [3]:
data = pd.read_table("./data/GSE45774_rpkm_all.txt", index_col=0)
# Select columns 29:52 (in python 28:)
data = data.iloc[:,28:]
# Do the log2(x+1 transformation)
data_log = data.applymap(lambda x: np.log2(x+1))


We have to transpose the data to have genes as columns and to create potenital targets out of rownames:

In [6]:
# Transpose the data
data_transposed = data_log.transpose()
# Create extra features that might serve as target
data_transposed['species'] = [l.split('.')[0] for l in data_transposed.index]
data_transposed['position'] = [l.split('.')[1] for l in data_transposed.index]
data_transposed['tissue'] = [l.split('.')[2] for l in data_transposed.index]
data_transposed['root'] = [1 if l.split('.')[2] == 'root' else 0 for l in data_transposed.index]

# print the info on the data
print(f"There are {data_transposed.shape[0]} rows and {data_transposed.shape[1]} columns")
# Save the data - will be loaded in the notebooks
data_transposed.to_csv('./data/tomatos_with_targets.txt')

There are 24 rows and 28302 columns


In [5]:
data_transposed.head()

Unnamed: 0,Solyc02g081130.1.1,Solyc12g038200.1.1,Solyc00g097760.1.1,Solyc08g069180.2.1,Solyc01g012570.1.1,Solyc08g076670.2.1,Solyc04g024840.2.1,Solyc09g074310.2.1,Solyc09g005370.1.1,Solyc12g098180.1.1,...,Solyc01g088670.1.1,Solyc06g063380.1.1,Solyc10g050450.1.1,Solyc07g051990.1.1,Solyc02g093490.2.1,Solyc10g007270.2.1,species,position,tissue,root
penn.Sh.floral,1.169216,0.0,0.0,1.305753,0.0,4.634343,2.860965,3.242561,0.0,0.0,...,2.120579,0.0,0.0,0.0,5.027894,0.0,penn,Sh,floral,0
penn.Sh.leaf,0.0,0.858644,0.0,2.948383,0.0,4.653912,2.241161,2.922573,0.0,0.0,...,1.37014,0.0,2.006256,0.0,5.06839,0.0,penn,Sh,leaf,0
penn.Sh.root,0.0,0.0,0.0,0.469958,0.0,5.798938,3.217704,2.229663,0.0,0.0,...,1.970763,0.0,1.688588,0.0,4.889667,0.0,penn,Sh,root,1
penn.Sh.sdling,1.813992,0.0,0.0,2.586147,0.0,4.771861,3.099842,3.452604,0.0,0.0,...,1.982446,0.0,1.651062,0.0,3.10562,0.0,penn,Sh,sdling,0
penn.Sh.stem,0.0,0.0,0.0,1.540887,0.0,4.90777,4.800873,1.97389,0.0,0.0,...,1.452937,0.0,0.0,0.0,4.99372,0.0,penn,Sh,stem,0


In [8]:
# These are the potential targets:
potential_targets = ['species','position','tissue','root']
data_transposed[potential_targets]

Unnamed: 0,species,position,tissue,root
penn.Sh.floral,penn,Sh,floral,0
penn.Sh.leaf,penn,Sh,leaf,0
penn.Sh.root,penn,Sh,root,1
penn.Sh.sdling,penn,Sh,sdling,0
penn.Sh.stem,penn,Sh,stem,0
penn.Sh.veg,penn,Sh,veg,0
penn.Sun.floral,penn,Sun,floral,0
penn.Sun.leaf,penn,Sun,leaf,0
penn.Sun.root,penn,Sun,root,1
penn.Sun.sdling,penn,Sun,sdling,0


## California housing

In [6]:
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
data = X
data['MedHouseVal'] = y
data.to_csv('./data/calif_housing.txt', index=None)