# Building a Data Preparation Pipeline
The aim of this part of our example ML project workflow, is to build a data preparation pipeline that transforms an initial Pandas DataFrame into a Numpy array that is suitable for training a Scikit-Learn machine learning model with.

## Split Data into Predictors and Labels
We start by loading our training data and seperating the input variable data `X` from the labelled data `y`.

In [1]:
import pandas as pd

train_data = pd.read_csv('data/data_train.csv')
X = train_data.drop(['median_house_value'], axis=1)
y = train_data['median_house_value']

X.info()
X.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 9 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16351 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.1+ MB


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.29,37.81,49.0,844.0,204.0,560.0,152.0,1.75,NEAR BAY
1,-123.52,41.01,17.0,1564.0,345.0,517.0,222.0,2.1542,INLAND
2,-122.89,40.76,14.0,712.0,131.0,270.0,90.0,2.3958,INLAND
3,-118.36,34.26,34.0,3677.0,573.0,1598.0,568.0,6.838,<1H OCEAN
4,-121.48,38.57,38.0,2809.0,805.0,1243.0,785.0,1.8512,INLAND


## DataFrame Mapping and Variable Selection
We define a custom transformer for adapting an arbitrarty DataFrame, so that we can automate the process of:
- returning Numpy arrays as required by Scikit-Learn algorithms; and so,
- we can select subsets of columns (i.e. input variables).

We present the class definition below, but in order to be able to serialise the final pipeline(s) the actual code is imported from a simple Python module called `custom_transformers.py` that exists in the same directory as this notebook.

```python
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder


class DataFrameAdapter(BaseEstimator, TransformerMixin):
    """DataFrameAdapter
    
    Class for mapping column-subsets of Pandas DataFrames
    to raw Numpy arrays.
    """
    def __init__(self, col_names):
        self.col_names = list(col_names)

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.col_names].values
    
    def get_feature_names(self):
        return self.col_names
```

We test this transformer on our training data. 

In [3]:
from custom_transformers import DataFrameAdapter

# test our custom Transformer class
cat_data_adapter = DataFrameAdapter(['ocean_proximity'])
cat_data = cat_data_adapter.fit_transform(X)

cat_data

array([['NEAR BAY'],
       ['INLAND'],
       ['INLAND'],
       ..., 
       ['<1H OCEAN'],
       ['INLAND'],
       ['NEAR OCEAN']], dtype=object)

## Encoding Categorical Variables
Unlike R algorithms that (usually) embed the necessary scaffolding code to automatically deal with categorical data, in Scikit-Learn we will have to write a another custom Transform to perform this task. This class will expect a list of categorical (string) variables (i.e. columns) as input, that will then be turned into factors (integer encoded data), before using a one-hot-encoding algorithm to encode the factors to the 0/1 indicator variables required by most ML algorithms (as well as most 'traditional'statistical methods like OLS regression).

Once again we present only the class definition below, so that we are able to serialise the final pipeline(s). The actual code is loaded from a simple Python module called `custom_transformers.py` that exists in the same directory as this notebook.

```python
class CategoricalFeatureEncoder(BaseEstimator, TransformerMixin):
    """CategoricalFeatureEncoder
    
    Class for automating the process of applying 
    one-hot-encoding to all categorical variables in a Numpy
    array of only categorical variables.
    """
    def __init__(self):
        return None

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if X.ndim == 1:
            X = X.reshape((-1, 1))
        num_vars = X.shape[1]
        encoded_vars = [self.__transform_single_cat_var__(var) for var in X.T]
        feature_names, features = list(zip(*encoded_vars))
        self.feature_names = list(np.concatenate(feature_names, axis=0))
        return np.concatenate(features, axis=1)

    def get_feature_names(self):
        return self.feature_names
    
    def __transform_single_cat_var__(self, cat_feature_col):
        feature_names_, int_factors = np.unique(cat_feature_col, return_inverse=True)
        one_hot_encoder = OneHotEncoder()
        encoded_factors = one_hot_encoder.fit_transform(int_factors.reshape((-1, 1)))
        return (feature_names_, encoded_factors.toarray())
```

We test this transformer on our training data. 

In [4]:
from custom_transformers import CategoricalFeatureEncoder

# test our custom Transformer class
cat_encoder = CategoricalFeatureEncoder()
encoded_cat_data = cat_encoder.fit_transform(cat_data)

print('column names: {}'.format(cat_encoder.get_feature_names()))
encoded_cat_data

column names: ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']


array([[ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       ..., 
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

## Impute Missing Data
We now test a missing data imputer configured to use the median value to fill-in missing values (such as those we found in the `total_bedrooms` variable), before we add it to our final data preparation pipeline.

In [5]:
from sklearn.preprocessing import Imputer

num_data = X.drop(['ocean_proximity'], axis=1)
num_imputer = Imputer(strategy='median')
imputed_num_data = num_imputer.fit_transform(num_data)

# convert back to DataFrame so we can compare with original using info
imputed_num_df = pd.DataFrame(imputed_num_data, columns=num_data.columns)
imputed_num_df.info()
imputed_num_df.isnull().any()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 8 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16512 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
dtypes: float64(8)
memory usage: 1.0 MB


longitude             False
latitude              False
housing_median_age    False
total_rooms           False
total_bedrooms        False
population            False
households            False
median_income         False
dtype: bool

## Scaling Numeric Data
Machine learning algorithms, whether they are using matrix inversion, gradient descent (or some other optimisation routine to minimise a utiliity function), benefit when all variables have the same units. For example, matrix inversion becomes more stable and gradient descent methods don't suffer from the creation of artificial 'valleys'.

We test a simple scaling transformer that we will be able to use in our data preparation pipeline.

In [6]:
from sklearn.preprocessing import MinMaxScaler

data_scaler = MinMaxScaler()
scaled_data = data_scaler.fit_transform(imputed_num_data)

# convert back to DataFrame so we can get a statistical description for comparison
scaled_df = pd.DataFrame(scaled_data, columns=num_data.columns)
scaled_df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,0.474923,0.329694,0.541771,0.067011,0.086265,0.087201,0.093008,0.231966
std,0.199489,0.226864,0.247019,0.055069,0.066894,0.067101,0.070631,0.130148
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.252988,0.148778,0.333333,0.036802,0.047519,0.047975,0.051904,0.142095
50%,0.581673,0.182784,0.54902,0.054046,0.069749,0.071227,0.076176,0.209301
75%,0.631474,0.550478,0.705882,0.080269,0.103898,0.105767,0.112957,0.29232
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Assembling the Final Pipeline
Using the the various different pieces of pipe (the transformations explored above), we assemble a final pipeline that can be used for preparing data before training ML algorithms or scoring (predicting) new feature instances (or observations).

Note, that in this particular instance our final pipeline is actually a union of two seperate pipelines. This is because we have to treat numerical and catergorical variables seperately.

In [8]:
from sklearn.pipeline import Pipeline, FeatureUnion

numeric_cols = X.select_dtypes(exclude=['object']).columns
numeric_pipeline = Pipeline([
    ('var_selector', DataFrameAdapter(numeric_cols)),
    ('imputer', Imputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

categorical_cols = X.select_dtypes(include=['object']).columns
categorical_pipeline = Pipeline([
    ('var_selector', DataFrameAdapter(categorical_cols)),
    ('categorical_encoder', CategoricalFeatureEncoder())
])

data_prep_pipeline = FeatureUnion([
    ('numeric_pipeline', numeric_pipeline),
    ('categorical_pipeline', categorical_pipeline)
])

### Testing the Final Pipeline
To make sure that our data preparation pipeline works as we expect it to, we apply it to the initial data.

In [9]:
prepared_data = data_prep_pipeline.fit_transform(X)

print('prepared data has {} observations of {} features'.format(*prepared_data.shape))
prepared_data

prepared data has 16512 observations of 13 features


array([[ 0.20517928,  0.56004251,  0.94117647, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.08266932,  0.90010627,  0.31372549, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.14541833,  0.87353879,  0.25490196, ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.25498008,  0.47715197,  0.33333333, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.21115538,  0.77789586,  0.39215686, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.72410359,  0.02019129,  0.68627451, ...,  0.        ,
         0.        ,  1.        ]])

### Persisting the Pipeline to Disk
Our chosen approach is to serialise it using SciKit-Learn's (actually SciPy's) `joblib` object serialiser as this is meant to work with large Numpy arrays better than Python's core `pickle` module does. It is also easier to use.

In [10]:
from sklearn.externals import joblib

joblib.dump(data_prep_pipeline, 'models/data_prep_pipeline.pkl') 

['models/data_prep_pipeline.pkl']

In [11]:
# check we can reload the model and use it as above
deser_data_prep_pl = joblib.load('models/data_prep_pipeline.pkl') 

# verify that it gives the same results
(deser_data_prep_pl.transform(X) == data_prep_pipeline.transform(X)).all()

True