## 2. Preparing the Data for Machine Learning Algorithms

We will be writing functions to prepare the data for ML algorithms. There are a few good reasons for doing that:
1. This will allow us to reproduce these transformations easily on any dataset (e.g. next time when you get a fresh dataset).
2. We will gradually build a library of transformation functions that you can reuse in the future projects.
3. We can use these functions in our live system to transform the new data before feeding into the algorithms.
4. This will make it possible for us to easily try various transformations and see which combination of transformations work best.

In [None]:
import pandas as pd
ca_housing = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/California_housing.csv"

housing = pd.read_csv(ca_housing)

In [None]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit


import numpy as np

housing["income_cat"] = pd.cut(housing["median_income"],
                              bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
                              labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# Now we should remove the income_cat attribute so that
# the data is back to its original state
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Fun note: [Why use 42 for the random state?](https://medium.com/geekculture/the-story-behind-random-seed-42-in-machine-learning-b838c4ac290a)

Let's separate the predictors and the labels since we don't necessarily want to apply the same transformations to the predictors and the target values. Let's create a copy of the data and not affect `strat_train_set`:

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

### Data Cleaning
Most ML algorithms cannot work with missing features, so let's create a few functions to take care of them. We noticed that `total_bedrooms` attribute has some missing values.

In [None]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16512 entries, 12655 to 19773
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16354 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
 8   ocean_proximity     16512 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.3+ MB


We have 3 options of fixing the missing data:
1. Get rid of the row that has a missing value using `dropna()`.
2. Get rid of the whole attribute, `total_bedrooms`, using `drop()`.
3. Set the values to some value (zero, the mean, the median, etc.), using `fillna()`.

In [None]:
# Let's sample a few of the rows with missing data and see the effect of options 1, 2, and 3:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,,1145.0,480.0,6.358,NEAR OCEAN


In [None]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1 - drop the rows with missing values

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity


In [None]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2 - drop the column total_bedrooms

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,1145.0,480.0,6.358,NEAR OCEAN


In [None]:
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3 - replace na with the median
sample_incomplete_rows

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,433.0,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,433.0,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,433.0,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,433.0,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,433.0,1145.0,480.0,6.358,NEAR OCEAN


Scikit-Learn provides a handy class to take care of missing values: `SimpleImputer`.
First, we need to create a `SimpleImputer` instance, specifying that we want to replace each attribute's missing values with the median of that attribute:

In [None]:
from sklearn.impute import SimpleImputer
# strategy options include: mean, median, most frequent, or constant
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

imputer = SimpleImputer(strategy="median")

We need to remove the text attribute because median can only be calculated on numerical attributes:

In [None]:
housing_num = housing.drop('ocean_proximity', axis=1)

Now we can fit the `imputer` instance to the training data using the `fit()` method:

In [None]:
imputer.fit(housing_num)

The `imputer` has simply computed the median of each attribute and stored the result in its `statistics_` instance variable.

In [None]:
imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Although only the `total_bedrooms` attribute had missing values, we cannot be sure that there won't be any missing values in new data after the system goes live, so it's safer to apply the `imputer` to all the numercal attributes.

In [None]:
# Check that this is the same as manually computing the median of each attribute:

housing_num.median().values

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Now we can use the "trained" imputer to tranform the training set by replacing missing values by the learned median:

In [None]:
#Transform the training set:
X = imputer.transform(housing_num)

In [None]:
X

array([[-1.2146e+02,  3.8520e+01,  2.9000e+01, ...,  2.2370e+03,
         7.0600e+02,  2.1736e+00],
       [-1.1723e+02,  3.3090e+01,  7.0000e+00, ...,  2.0150e+03,
         7.6800e+02,  6.3373e+00],
       [-1.1904e+02,  3.5370e+01,  4.4000e+01, ...,  6.6700e+02,
         3.0000e+02,  2.8750e+00],
       ...,
       [-1.2272e+02,  3.8440e+01,  4.8000e+01, ...,  4.5800e+02,
         1.7200e+02,  3.1797e+00],
       [-1.2270e+02,  3.8310e+01,  1.4000e+01, ...,  1.2080e+03,
         5.0100e+02,  4.1964e+00],
       [-1.2214e+02,  3.9970e+01,  2.7000e+01, ...,  6.2500e+02,
         1.9700e+02,  3.1319e+00]])

The result is a plain NumPy array containing the transformed features. We can put it back into Pandas DataFrame:

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)
housing_tr.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964


In [None]:
housing_tr.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16512 entries, 12655 to 19773
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16512 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
dtypes: float64(8)
memory usage: 1.1 MB


In [None]:
# Check to see if the previously missing data are filled with the median values:housing_tr.loc[sample_incomplete_rows.index]
housing_tr.loc[sample_incomplete_rows.index]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
1606,-122.08,37.88,26.0,2947.0,433.0,825.0,626.0,2.933
10915,-117.87,33.73,45.0,2264.0,433.0,1970.0,499.0,3.4193
19150,-122.7,38.35,14.0,2313.0,433.0,954.0,397.0,3.7813
4186,-118.23,34.13,48.0,1308.0,433.0,835.0,294.0,4.2891
16885,-122.4,37.58,26.0,3281.0,433.0,1145.0,480.0,6.358


### Handling Text and Categorical Attributes
Earlier we left out the categorical attributes `ocean_proximity` because it is a text attribute so we cannot compute its median. Most ML algorithms prefers to work with numbers anyway, so let's convert these categories from text to numbers. For this, we can use Scikit-Learn's `OrdinalEncoder` class.

In [None]:
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
12655,INLAND
15502,NEAR OCEAN
2908,INLAND
14053,NEAR OCEAN
20496,<1H OCEAN
1481,NEAR BAY
18125,<1H OCEAN
5830,<1H OCEAN
17989,<1H OCEAN
4861,<1H OCEAN


In [None]:
housing['ocean_proximity'].unique()

array(['INLAND', 'NEAR OCEAN', '<1H OCEAN', 'NEAR BAY', 'ISLAND'],
      dtype=object)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[0:10]

array([[1.],
       [4.],
       [1.],
       [4.],
       [0.],
       [3.],
       [0.],
       [0.],
       [0.],
       [0.]])

We can get the list of categories using the `categories_` instance variable. It is a list containing a 1D array of categories for each categorical attribute.

In [None]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g. for ordered categories such as "bad", "average", "good", "excellent"), but it is obviously not the case for the `ocean_proximity` column. To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is "<1H OCEAN" (and 0 otherwise), another attribute equal to 1 when the category is "INLAND" (and 0 otherwise), and so on. This is called _**one-hot encoding**_, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are somtimes called _dummy_ attributes.

Scikit-Learn provides a `OneHotEncoder` class to convert categorical values into one-hot vectors:

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

By default, the output is a SciPy _sparse matrix_, instead of a NumPy array. This is very useful when we have categorical attributes with thousands of categories. After one-hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for a single 1 per row. A sparse matrix only stores the location of the non-zero elements to not waste memory to store all the zeros. You can use the sparse matrix mostly like a normal 2D array. You can convert it to a (dense) NumPy array using the `toarray()` method:

In [None]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

In [None]:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot



array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

one again, you can get the list of categories using the endcoder's `categories_` instance variable:

In [None]:
cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

### Custom Transformers
Although Scikit-Learn provides many useful transformers, we will need to write our own tranformers for tasks such as custom cleanup operations or combining specific attributes. We will need to create a class and implement three methods:
1. `fit()` - returning self
2. `tranform()`
3. `fit_transform()`

If we add [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) as a base class, we will don't need to implement the method `fit_transform` since its a method in the `TransformerMixin` class.
If we add [`BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) as a base class, we will get two extra methods, `get_params()` and `set_params()` that will be useful for automatic hyperparameter tuning.

#### Let's create a custom transformer to add extra attributes:

In [None]:
housing.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity'],
      dtype='object')

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# rooms_idx, bedrooms_idx, population_idx, household_idx = 3, 4, 5, 6

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_idx, bedrooms_idx, population_idx, household_idx = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

# This is our custom transformer class
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X, y=None):
      # adding attributes to the dataset that are calculated from other attributes
        rooms_per_household = X[:, rooms_idx] / X[:, household_idx]
        population_per_household = X[:, population_idx] / X[:, household_idx]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_idx] / X[:, rooms_idx]
            # concatenates each row's value into one large row
            # https://numpy.org/doc/stable/reference/generated/numpy.c_.html
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

# initializing the transformer
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
# using the transformer
housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
housing_extra_attribs

array([[-121.46, 38.52, 29.0, ..., 'INLAND', 5.485835694050992,
        3.168555240793201],
       [-117.23, 33.09, 7.0, ..., 'NEAR OCEAN', 6.927083333333333,
        2.6236979166666665],
       [-119.04, 35.37, 44.0, ..., 'INLAND', 5.3933333333333335,
        2.223333333333333],
       ...,
       [-122.72, 38.44, 48.0, ..., '<1H OCEAN', 4.1104651162790695,
        2.6627906976744184],
       [-122.7, 38.31, 14.0, ..., '<1H OCEAN', 6.297405189620759,
        2.411177644710579],
       [-122.14, 39.97, 27.0, ..., 'INLAND', 5.477157360406092,
        3.1725888324873095]], dtype=object)

In the example above, the transformer has one hyperparameter, `add_bedrooms_per_room`, set to `True` by default. This hyperparameter will allow you to easily find out whether adding this attribute helps the ML algorithms or not.

- You can add a hyperparameter to gate any data preparation step that you are not 100% sure about. The more you automate these data preparation steps, the more combinations you can automatically try out, making it much more likely that you will find a great combination (and saving you a lot of time).

### Notations

$X$ is a matrix containing all the feature values (excluding labels) of all instances in the dataset. This is one row per instance of the $i^{th}$ row is equal to the transpose of $x^{(i)}$, noted ($x^{(i)})^{T}$.

$x^{(i)}$ is a vector of all the feature values (excluding labels) of the $i^{th}$ instance in the dataset, and $y^{(i)}$ is its label (the desired output value for that instance).

For example, if the first district in the dataset is located at _longitude_ -118.29°, _latitude_ 33.91°, and it has 1,416 _inhabitants_ with a _median_ _income_ of \\$38,372, and the _median house value_ is \\$156,400 (ignoring the other features for now), then:

$x^{(1)} = \begin{pmatrix}-118.29\\
33.91\\
1,416\\
38,327
\end{pmatrix}$

and:

$y^{(1)} = 156,400$


For example, if the first district is as just described, then the matrix $X$ will look like this:

$X=\begin{pmatrix}
(x^{(1)})^T\\
(x^{(2)})^T\\
\vdots\\
(x^{(1999)})^T\\
(x^{(2000)})^T\\
\end{pmatrix} = \begin{pmatrix}-118.29 & 33.91 & 1,416 & 38,372\\
\vdots & \vdots & \vdots & \vdots
\end{pmatrix}$



In [None]:
# X - the original housing data (exclude labels)
housing.values

array([[-121.46, 38.52, 29.0, ..., 706.0, 2.1736, 'INLAND'],
       [-117.23, 33.09, 7.0, ..., 768.0, 6.3373, 'NEAR OCEAN'],
       [-119.04, 35.37, 44.0, ..., 300.0, 2.875, 'INLAND'],
       ...,
       [-122.72, 38.44, 48.0, ..., 172.0, 3.1797, '<1H OCEAN'],
       [-122.7, 38.31, 14.0, ..., 501.0, 4.1964, '<1H OCEAN'],
       [-122.14, 39.97, 27.0, ..., 197.0, 3.1319, 'INLAND']], dtype=object)

In [None]:
housing.values[0]

array([-121.46, 38.52, 29.0, 3873.0, 797.0, 2237.0, 706.0, 2.1736,
       'INLAND'], dtype=object)

In [None]:
# After transforming X by running attr_adder.transform(housing.values)
housing_extra_attribs[0]

array([-121.46, 38.52, 29.0, 3873.0, 797.0, 2237.0, 706.0, 2.1736,
       'INLAND', 5.485835694050992, 3.168555240793201], dtype=object)

In [None]:
## Convert the 2D array into Pandas DataFrame
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"], # the columns added by our custom transformer
    index=housing.index)
housing_extra_attribs.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,rooms_per_household,population_per_household
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND,5.485836,3.168555
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN,6.927083,2.623698
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND,5.393333,2.223333
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN,3.886128,1.859213
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN,6.096552,3.167241


### Feature scaling
One of the most important transformations we need to apply to our data is _feature scaling_. With few exceptions, ML algorithms don't perform well when the input numerical attributes have very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median income range from 0 to 15.

* Note: scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale:
1. Min-max scaling
2. Standardization

#### 1. Min-max scaling
It's also referred to as _normalization_. This scaling is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min.

`X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))`<br>
`X_scaled = X_std * (max - min) + min`      
where min, max = feature_range.

Scikit-learn provides a transformer called [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?highlight=minmaxscaler#sklearn.preprocessing.MinMaxScaler) for this. It has a `feature_range` hyperparameter that lets you change the range if you don't want 0-1 for some reason.

#### 2. Standardization
First it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance.

It's different than min-max scaling:
1. it doesn't bound value to a specific range, which may be a problem for some algorithms (e.g. neural networks often expect an input value ranging from 0 to 1).
2. It's much less affected by outliers than min-max scaling. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0-15 down to 0-0.15, where as standardization would not be much affected.

Scikit-Learn provides a transformer called [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standardscaler#sklearn.preprocessing.StandardScaler) for standardization.

## Transformation Pipeline
As you see, we have just gone through a number of transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the `Pipeline` class to help with such sequences of transformations. Here is a small pipeline for the numerical attributes:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), # fill in missing values with median values
        ('attribs_adder', CombinedAttributesAdder()),  # add combined attributes to the data
        ('std_scaler', StandardScaler()),              # feature scaling
        # ('std_scaler', MinMaxScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)


In [None]:
housing_num_tr

array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.01739526,
         0.00622264, -0.12112176],
       [ 1.17178212, -1.19243966, -1.72201763, ...,  0.56925554,
        -0.04081077, -0.81086696],
       [ 0.26758118, -0.1259716 ,  1.22045984, ..., -0.01802432,
        -0.07537122, -0.33827252],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ..., -0.5092404 ,
        -0.03743619,  0.32286937],
       [-1.56080303,  1.2492109 , -1.1653327 , ...,  0.32814891,
        -0.05915604, -0.45702273],
       [-1.28105026,  2.02567448, -0.13148926, ...,  0.01407228,
         0.00657083, -0.12169672]])

The `Pipeline` constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be a transformer (i.e., they must have a fit_transform() method). The names can be anything you like (as long as they are unique and don't contain double underscore "__"): they will come in handy later for hyperparameter tuning.

So far we have handled the categorical columns and the numerical columns separately. It would be more conventient to have a single transformer that can handle all columns, applying the appropriate transforms to each column.

Scikit-Learn introduced the [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) for this purpose, and the good news is that it works great with Pandas DataFrames.

Let's use it to apply all the transformations to the housing data:

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num) # list of numerical column names
cat_attribs = ["ocean_proximity"] # list of categorical column names

print(num_attribs)

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


In [None]:
housing_prepared

array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.17178212, -1.19243966, -1.72201763, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.26758118, -0.1259716 ,  1.22045984, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ...,  0.        ,
         0.        ,  0.        ],
       [-1.56080303,  1.2492109 , -1.1653327 , ...,  0.        ,
         0.        ,  0.        ],
       [-1.28105026,  2.02567448, -0.13148926, ...,  0.        ,
         0.        ,  0.        ]])