## Problem


Now, we would like to apply different transformations on different columns:

- Numeric columns:

  - imputation and
  - scaling

- Nominal categorical columns:

  - imputation and
  - one-hot encoding

- Ordinal categorical columns:

  - imputation and
  - ordinal encoding
  
  
How can we apply these on the data before fitting the regressor? 

## Solution: ColumnTransformer

In [323]:
#split the data set
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


In [324]:
video_df = pd.read_table("video.csv", sep = ";", na_values="99", index_col=0)
video_df.head()

Unnamed: 0,time,freq,sex,age,home,math,work,own,grade
0,2.0,weekly,female,19,yes,no,10.0,yes,A
1,0.0,monthly,female,18,yes,yes,0.0,yes,C
2,0.0,monthly,male,19,yes,no,0.0,yes,B
3,0.5,monthly,female,19,yes,no,0.0,yes,B
4,0.0,semesterly,female,19,yes,yes,0.0,no,B


In [325]:
video_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91 entries, 0 to 90
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    91 non-null     float64
 1   freq    78 non-null     object 
 2   sex     91 non-null     object 
 3   age     91 non-null     int64  
 4   home    91 non-null     object 
 5   math    91 non-null     object 
 6   work    88 non-null     float64
 7   own     91 non-null     object 
 8   grade   91 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 7.1+ KB


In [326]:
video_df["freq"].value_counts()

freq
weekly        28
semesterly    23
monthly       18
daily          9
Name: count, dtype: int64

In [327]:
video_df.describe()

Unnamed: 0,time,age,work
count,91.0,91.0,88.0
mean,1.242857,19.516484,7.352273
std,3.77704,1.846093,10.313522
min,0.0,18.0,0.0
25%,0.0,19.0,0.0
50%,0.0,19.0,1.0
75%,1.25,20.0,13.25
max,30.0,33.0,55.0


In [328]:
video_y = video_df[["time"]]
video_X = video_df.drop(["time"], axis=1)

#Split 90:10
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.1, random_state=1300)

In [329]:
from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

In [330]:
#different transformations on different columns
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LinearRegression

In [331]:
#numeric columns
numeric_features = ["age", "work"]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


In [332]:
numeric_transformer

In [333]:
#nominal categorical columns
categorical_features = ["sex", "home", "math", "own"]
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'))])

In [334]:
categorical_transformer

In [335]:
#ordinal categorical columns
ordinal_features = ["freq", "grade"]
ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder())])

In [336]:
ordinal_transformer

In [337]:
#apply different transformations on different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features)])


In [338]:
preprocessor

In [344]:
prepared_data = Pipeline(steps=[('preprocessor', preprocessor)])
train = prepared_data.fit_transform(video_X_train, video_y_train)
test = prepared_data.transform(video_X_test)
train

Unnamed: 0,num__age,num__work,cat__sex_male,cat__home_yes,cat__math_yes,cat__own_yes,ord__freq,ord__grade
49,-0.282214,-0.740219,1.0,1.0,0.0,1.0,3.0,0.0
77,-0.282214,0.794321,1.0,1.0,0.0,0.0,2.0,1.0
36,-0.282214,-0.740219,0.0,1.0,0.0,1.0,1.0,1.0
84,-0.801743,0.698412,0.0,1.0,1.0,0.0,2.0,1.0
79,-0.282214,0.218869,1.0,1.0,0.0,1.0,3.0,1.0
...,...,...,...,...,...,...,...,...
51,0.237316,-0.740219,0.0,1.0,1.0,1.0,3.0,1.0
66,-0.801743,-0.740219,1.0,1.0,0.0,1.0,2.0,0.0
89,-0.282214,-0.260675,1.0,1.0,0.0,1.0,3.0,1.0
1,-0.801743,-0.740219,0.0,1.0,1.0,1.0,1.0,2.0


In [340]:
#apply the preprocessor and the regressor
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', LinearRegression())])

In [341]:
#fit the pipeline
pipe.fit(video_X_train, video_y_train)

In [342]:
#predict
pipe.predict(video_X_test)

array([[ 1.35583723],
       [ 0.70757786],
       [-0.54640513],
       [ 3.40887956],
       [ 1.57449   ],
       [ 1.10371634],
       [ 2.49622747],
       [ 0.78041387],
       [ 0.12959952],
       [ 0.17321906]])

In [343]:
#score
print('Test R2 on test data: %.2f' % pipe.score(video_X_test, video_y_test))

Test R2 on test data: -0.37
