# Utilizing SKLearn's `ColumnTransformer`

Up until this point, we've done steps like scaling, imputing, and encoding separately, each as their own piece. But SKLearn has a handy class designed to preprocess groups of columns effectively, streamlining your pre-processing steps!

Enter: [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

## Set Up

[Insurance Costs data](https://www.kaggle.com/mirichoi0218/insurance) (they got the idea for cleaning up the original open source data from [Machine Learning with R](https://www.packtpub.com/product/machine-learning-with-r-third-edition/9781788295864))

Goal: predict insurance charges

In [1]:
# Initial imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [2]:
df = pd.read_csv('data/insurance.csv')

In [3]:
# explore the data
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [6]:
df.describe(include='O')

Unnamed: 0,sex,smoker,region
count,1338,1338,1338
unique,2,2,4
top,male,no,southeast
freq,676,1064,364


In [7]:
# set our X and y
X = df.drop(columns='charges')
y = df['charges']

In [8]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

## Define Our Steps

Observations from our initial data exploration: 
- No null values
- 6 input columns and 1 target column (`charges`)
- `age`, `bmi`, and `children` are numeric
- `sex`, `smoker`, and `region` are objects

What preprocessing steps will we need to take?
- Scale our three numeric inputs
- Encode our three object inputs

## Old Way

We have a train test split, and want to avoid data leakage - if we pre-processed our data before splitting, we'd potenitally leak some information about the test set into our training data!

Let's look at the steps we'd take to preprocess our data without data leakage and without using `ColumnTransformer`, scaling using SKLearn's `MinMaxScaler` and encoding using SKLearn's `OneHotEncoder`:

#### Tackle Our Object Columns First

In [11]:
# Define a list of our object column names
obj_cols = ['sex', 'smoker', 'region']

# Separate our object columns for both train and test
X_train_obj = X_train[obj_cols]
X_test_obj = X_test[obj_cols]

X_train_obj.head()

Unnamed: 0,sex,smoker,region
693,male,no,northwest
1297,female,no,southeast
634,male,no,southwest
1022,male,yes,southeast
178,female,no,southwest


In [12]:
# Instantiate and fit our encoder
ohe = OneHotEncoder(drop='first', sparse=False)
ohe.fit(X_train_obj)

OneHotEncoder(drop='first', sparse=False)

In [13]:
# Encode our train and test sets
X_train_ohe = ohe.transform(X_train_obj)
X_test_ohe = ohe.transform(X_test_obj)

In [14]:
# Saving these as dataframes, with appropriate column names and index
X_train_ohe = pd.DataFrame(X_train_ohe, 
                           columns = ohe.get_feature_names(input_features = obj_cols),
                           index=X_train.index)
X_test_ohe = pd.DataFrame(X_test_ohe, 
                          columns = ohe.get_feature_names(input_features = obj_cols),
                          index=X_test.index)

X_train_ohe.head()

Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
693,1.0,0.0,1.0,0.0,0.0
1297,0.0,0.0,0.0,1.0,0.0
634,1.0,0.0,0.0,0.0,1.0
1022,1.0,1.0,0.0,1.0,0.0
178,0.0,0.0,0.0,0.0,1.0


#### Now Tackle Our Numeric Columns

In [15]:
# Define a list of our numeric column names
num_cols = ['age', 'bmi', 'children']

# Separate our numeric columns for both train and test
X_train_num = X_train[num_cols]
X_test_num = X_test[num_cols]

In [16]:
# Instantiate and fit our scaler
# using minmaxscaler because other cols are now binary
scaler = MinMaxScaler() 
scaler.fit(X_train_num)

MinMaxScaler()

In [17]:
# Scale our train and test sets
X_train_scaled = scaler.transform(X_train_num)
X_test_scaled = scaler.transform(X_test_num)

In [18]:
# Saving these as dataframes, with appropriate column names and index
X_train_scaled = pd.DataFrame(X_train_scaled, 
                              columns = num_cols,
                              index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, 
                             columns = num_cols,
                             index=X_test.index)

X_train_scaled.head()

Unnamed: 0,age,bmi,children
693,0.130435,0.207022,0.0
1297,0.217391,0.283831,0.4
634,0.717391,0.638687,0.2
1022,0.630435,0.541297,0.2
178,0.608696,0.34813,0.4


#### Put Them Back Together

In [19]:
# Use concat
X_train_processed = pd.concat([X_train_ohe, X_train_scaled], axis=1)
X_test_processed = pd.concat([X_test_ohe, X_test_scaled], axis=1)

X_train_processed.head()

Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest,age,bmi,children
693,1.0,0.0,1.0,0.0,0.0,0.130435,0.207022,0.0
1297,0.0,0.0,0.0,1.0,0.0,0.217391,0.283831,0.4
634,1.0,0.0,0.0,0.0,1.0,0.717391,0.638687,0.2
1022,1.0,1.0,0.0,1.0,0.0,0.630435,0.541297,0.2
178,0.0,0.0,0.0,0.0,1.0,0.608696,0.34813,0.4


Whew! In order to do just these two preprocessing steps, we had to separate out our columns, process train and test the same (fitting on train then transforming both train and test of course), then put them back together - and to get all the names out we had to make everything dataframes to concat. That's a lot!

## New Way!

Let's see how `ColumnTransformer` simplifies the process...

In [20]:
# Need to import ColumnTransformer!
from sklearn.compose import ColumnTransformer

In [21]:
# We've luckily already defined our list of columns
# We'll need these for our ColumnTransformer
print(f"Object Columns: {obj_cols}")
print(f"Numeric Columns: {num_cols}")

Object Columns: ['sex', 'smoker', 'region']
Numeric Columns: ['age', 'bmi', 'children']


`ColumnTransformer` takes a list of transformers, where each item in the list is a tuple with three parts:
1. Nickname for the step (useful for getting out column names)
2. Preprocessor object
3. List of columns to transform with that preprocessor object

> NOTE! Lists of columns used in `ColumnTransformer` must be mutually exclusive. If you put the same column name in multiple steps in a `ColumnTransformer`, you'll get multiple copies of that column each transformed in a different way.

Let's show it in action!

In [22]:
# Create an columntransformer object
ct = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), obj_cols),
    ('scaler', MinMaxScaler(), num_cols)
])

In [23]:
# Fit and transform
ct.fit(X_train)
X_train_ct = ct.transform(X_train)
X_test_ct = ct.transform(X_test)

In [24]:
# Showcase the result - initial output is a numpy array
X_train_ct

array([[1.        , 0.        , 1.        , ..., 0.13043478, 0.20702179,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.2173913 , 0.28383105,
        0.4       ],
       [1.        , 0.        , 0.        , ..., 0.7173913 , 0.63868711,
        0.2       ],
       ...,
       [1.        , 0.        , 0.        , ..., 0.86956522, 0.24791499,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.41304348, 0.85122411,
        0.4       ],
       [1.        , 0.        , 0.        , ..., 0.80434783, 0.37503363,
        0.        ]])

If we ever want to turn that resulting numpy array into a dataframe, we'll want to get out the column names after transformation - especially for preprocessing steps like `OneHotEncoder` which changes the number of columns.

We can do that using that handy nickname we set when creating the `ColumnTransformer`! After the `ColumnTransformer` has been fit, we can access the `named_transformers_` attribute to access each step individiually, and treat it just like we would a normal preprocessor object on its own.

In [25]:
# Accessing the resulting column names from 'ohe'
ohe_col_names = ct.named_transformers_['ohe'].get_feature_names(input_features = obj_cols)
ohe_col_names

array(['sex_male', 'smoker_yes', 'region_northwest', 'region_southeast',
       'region_southwest'], dtype=object)

In [26]:
# Now let's turn our X_train_ct into a dataframe to see
# Note that steps are done in the order they're passed
pd.DataFrame(X_train_ct,
             columns = [*ohe_col_names, *num_cols], # Using * to unpack lists
             index = X_train.index).head()

Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest,age,bmi,children
693,1.0,0.0,1.0,0.0,0.0,0.130435,0.207022,0.0
1297,0.0,0.0,0.0,1.0,0.0,0.217391,0.283831,0.4
634,1.0,0.0,0.0,0.0,1.0,0.717391,0.638687,0.2
1022,1.0,1.0,0.0,1.0,0.0,0.630435,0.541297,0.2
178,0.0,0.0,0.0,0.0,1.0,0.608696,0.34813,0.4


What do you think? Easier or more work to use `ColumnTransformer`? Up to you - so long as you always avoid data leakage in your preprocessing steps!