# 6. The ColumnTransformer

* Apply different transformations to different columns of data
* Use a pandas dataframe as input data
* Create a pipeline connecting all of our steps

## The previous workflow
Before the `ColumnTransformer` existed, there was no direct way to apply different transformations to different columns of data. 


### Common workarounds

* pandas `get_dummies` function 
* scikit-learn's `MultiLabelBinarizer`
* The pandas [sklearn_pandas][1] library 

The `ColumnTransformer` paired with the upgraded `OneHotEncoder` makes the transition from pandas much easier, more robust, and gives us a single obvious path forward.

[1]: https://github.com/scikit-learn-contrib/sklearn-pandas

## Processing string and numeric columns separately

In [None]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
hs.head()

In [None]:
hs.isna().sum()

In [None]:
y = hs.pop('SalePrice').values

## Using the `ColumnTransformer`
* List of three item tuples - name, transformer, list of columns

### Columns dropped - numpy array returned

### Keep the remaining columns

### Imputing the numeric columns with the mean
* Transform numeric and string columns independently

## Build separate pipelines

## Adding machine learning
* One final pipeline

## Grid searching a pipeline of transformers

## Summary of Commands

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from mymetrics import root_mean_squared_log_error

# string pipeline
string_si = SimpleImputer(strategy='constant', fill_value='MISSING')
ohe = OneHotEncoder(sparse=False)
steps = [('impute', string_si), ('encode', ohe)]
string_pipe = Pipeline(steps)

# numeric pipeline
numeric_si = SimpleImputer(strategy='mean')
ss = StandardScaler()
steps = [('si', numeric_si), ('standardize', ss)]
numeric_pipe = Pipeline(steps)

# columns
string_cols = ['Neighborhood', 'Exterior1st']
numeric_cols = ['YearBuilt', 'LotFrontage', 'GrLivArea', 'GarageArea']

transformers = [('string', string_pipe, string_cols), 
                ('numeric', numeric_pipe, numeric_cols)]

ct = ColumnTransformer(transformers)
rfr = RandomForestRegressor()
steps = [('transformers', ct), ('rfr', rfr)]
final_pipe = Pipeline(steps)

kf = KFold(n_splits=5, shuffle=True)
grid = {'transformers__numeric__si__strategy': ['mean', 'median'],
       'rfr__n_estimators': [50, 100], 'rfr__max_depth': range(2, 6)}
gs = GridSearchCV(final_pipe, grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(hs, y)
gs.best_params_

## Exercise
Use the `ColumnTransformer` to build separate pipelines for string and numeric columns. Build a final pipeline that adds machine learning as the last step.