## Problem


Now, we would like to apply different transformations on different columns:

- Numeric columns:

  - imputation and
  - scaling

- Nominal categorical columns:

  - imputation and
  - one-hot encoding

- Ordinal categorical columns:

  - imputation and
  - ordinal encoding
  
  
How can we apply these on the data before fitting the regressor? 

## Solution: ColumnTransformer

In [237]:
#split the data set
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


In [238]:
video_df = pd.read_table("video.csv", sep = ";", na_values="99", index_col=0)
video_df.head()

Unnamed: 0,time,freq,sex,age,home,math,work,own,grade
0,2.0,weekly,female,19,yes,no,10.0,yes,A
1,0.0,monthly,female,18,yes,yes,0.0,yes,C
2,0.0,monthly,male,19,yes,no,0.0,yes,B
3,0.5,monthly,female,19,yes,no,0.0,yes,B
4,0.0,semesterly,female,19,yes,yes,0.0,no,B


In [254]:
video_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91 entries, 0 to 90
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    91 non-null     float64
 1   freq    78 non-null     object 
 2   sex     91 non-null     object 
 3   age     91 non-null     int64  
 4   home    91 non-null     object 
 5   math    91 non-null     object 
 6   work    88 non-null     float64
 7   own     91 non-null     object 
 8   grade   91 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 7.1+ KB


In [239]:
video_df.describe()

Unnamed: 0,time,age,work
count,91.0,91.0,88.0
mean,1.242857,19.516484,7.352273
std,3.77704,1.846093,10.313522
min,0.0,18.0,0.0
25%,0.0,19.0,0.0
50%,0.0,19.0,1.0
75%,1.25,20.0,13.25
max,30.0,33.0,55.0


In [240]:
video_y = video_df[["time"]]
video_X = video_df.drop(["time"], axis=1)

#Split 90:10
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.1, random_state=1300)

In [241]:
from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

In [242]:
#different transformations on different columns
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LinearRegression

In [243]:
#numeric columns
numeric_features = ["time", "age", "work"]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


In [244]:
numeric_transformer

In [245]:
#nominal categorical columns
categorical_features = ["freq", "grade"]
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [246]:
categorical_transformer

In [247]:
#ordinal categorical columns
ordinal_features = ["sex", "home", "math", "own"]
ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder())])

In [248]:
ordinal_transformer

In [249]:
#apply different transformations on different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features)])


In [250]:
preprocessor

In [251]:
#apply the preprocessor and the regressor
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', LinearRegression())])

In [252]:
pipe

In [253]:
#fit the pipeline
pipe.fit(video_X_train, video_y_train)

ValueError: A given column is not a column of the dataframe

In [None]:
#predict
pipe.predict(video_X_test)

In [None]:
#score
print('Test R2 on test data: %.2f' % pipe.score(video_X_test, video_y_test))