# Data transformation pipelines

Several steps:
* clean data from missing values
* encode categorical features
* rescale numerical features

### Imports

In [1]:
import os
import numpy as np
import sklearn as skl
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [2]:
# better visualization for long outputs
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

### Data

In [3]:
data_path = '/home/lorenzo/skl-repo/0_data/california_housing.csv'
df = pd.read_csv(data_path)

In [4]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
train, test = train_test_split(df, test_size = 0.2, random_state=3542)
print(f'Train set length: {len(train)}')
print(f'Test set length: {len(test)}')

Train set length: 16512
Test set length: 4128


### Managing null values

In [6]:
print(f'df original shape: {df.shape}')
df.isna().sum()

df original shape: (20640, 10)


longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Option 1: remove rows containing null values.

In [7]:
df.dropna().shape

(20433, 10)

Option 2: remove columns containing null values.

In [8]:
df.drop('total_bedrooms', axis=1).shape

(20640, 9)

Option 3: fill null values with mean, median or other value.

In [9]:
median = df['total_bedrooms'].median()
df['total_bedrooms'].fillna(median, inplace=True)
df.shape

(20640, 10)

Notice however, that it would be prefereable to compute the median (or mean, or other) only on the training set and then use it to fill the missing values in both the training and test set.

In [10]:
imputer = SimpleImputer(strategy = 'median')

We need to momentarily separate numerical features (which can be filled with mean/median) from categorical/string features which can be filled with constant or most frequent values. In this case, we'll simply create a copy of the dataframe without the column 'ocean_proximity'.

In [17]:
train_num = train.drop('ocean_proximity', axis=1)
test_num = test.drop('ocean_proximity', axis=1)

Now we can fit and transform the training set.

In [18]:
train_num_cl = imputer.fit_transform(train_num)

Then we apply the same transformation to the test set, without refitting it.

In [19]:
test_num_cl = imputer.transform(test_num)

Put everything back together.

In [20]:
train_clean = pd.DataFrame(train_num_cl, columns = train_num.columns, index = train_num.index)
train_clean['ocean_proximity'] = train['ocean_proximity']
test_clean = pd.DataFrame(test_num_cl, columns = test_num.columns, index = test_num.index)
test_clean['ocean_proximity'] = test['ocean_proximity']

In [22]:
print(f'Train set length: {len(train)}')
print(f'Test set length: {len(test)}')

Train set length: 16512
Test set length: 4128


In [24]:
train_clean.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Encoding categorical features

### Scaling numerical features