
# Column Transformer with Mixed Types


This example illustrates how to apply different preprocessing and feature
extraction pipelines to different subsets of features, using
`ColumnTransformer`. This is particularly handy for the
case of datasets that contain heterogeneous data types, since we may want to
scale the numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after mean-imputation,
while the categorical data is one-hot encoded after imputing missing values
with a new category (``'missing'``).

In addition, we show two different ways to dispatch the columns to the
particular pre-processor: by column names and by column data types.

Finally, the preprocessing pipeline is integrated in a full prediction pipeline
using `Pipeline`, together with a simple classification
model.


In [2]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV


#### Step 1: Read the data set `Titanic.csv`
Print the column names. Drop 'boat','body','cabin','home.dest','name','ticket','sibsp','parch'

In [3]:
df=pd.read_csv('Titanic.csv')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [6]:
mod_df=df.drop(['boat','body','cabin','home.dest','name','ticket','sibsp','parch'],axis=1)

In [7]:
mod_df.head()

Unnamed: 0,pclass,survived,sex,age,fare,embarked
0,1,1,female,29.0,211.3375,S
1,1,1,male,0.9167,151.55,S
2,1,0,female,2.0,151.55,S
3,1,0,male,30.0,151.55,S
4,1,0,female,25.0,151.55,S


#### Step 2: Calculate duplicates
- Report the size of the dataset
- Report if there are any duplicates using `duplicated()` method of pandas dataframe
- List all duplicate rows
- Drop all duplicate rows using `dop_duplicates` method of pandas dataframe
- Report the new shape of the dataset

In [10]:
df.shape

(1309, 14)

In [9]:
mod_df[mod_df.duplicated()]

Unnamed: 0,pclass,survived,sex,age,fare,embarked
118,1,0,male,?,26.55,S
125,1,0,male,?,0,S
179,1,0,male,?,26.55,S
185,1,0,male,42,26.55,S
223,1,0,male,?,0,S
...,...,...,...,...,...,...
1296,3,0,male,27,8.6625,S
1297,3,0,male,?,7.25,S
1302,3,0,male,?,7.225,C
1303,3,0,male,?,14.4583,C


In [17]:
# We will keep the first copy() # default
drop_dpc_df=mod_df.drop_duplicates()

# We will keep the last copy() 
#mod_df.drop_duplicates(subset=['pclass'],keep='last')

In [18]:
drop_dpc_df

Unnamed: 0,pclass,survived,sex,age,fare,embarked
0,1,1,female,29,211.3375,S
1,1,1,male,0.9167,151.55,S
2,1,0,female,2,151.55,S
3,1,0,male,30,151.55,S
4,1,0,female,25,151.55,S
...,...,...,...,...,...,...
1301,3,0,male,45.5,7.225,C
1304,3,0,female,14.5,14.4542,C
1306,3,0,male,26.5,7.225,C
1307,3,0,male,27,7.225,C


#### Step 3: Explore the dataset for missing values
- Assign `survived' as target variable `y` and the rest as `X`
- Print info about `X` to see if there are any null values and the type of the features
- Check if all columns have only alpha numeric characters. To do that you need to use `isalnum()` method of Python which works on strings. Before using this method. we need to change all features into string using `astype('str')`
- Replace all `?` characters with `np.nan`
- Now, turn 'age','fare' into numeric again using `apply(pd.to_numeric) `
- Print the first five samples of `X`
- Summarize the number of unique values in each column using `nunique()`

In [22]:
y=drop_dpc_df['survived']
X=drop_dpc_df.drop('survived',axis=1)

In [23]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1094 entries, 0 to 1308
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pclass    1094 non-null   int64 
 1   sex       1094 non-null   object
 2   age       1094 non-null   object
 3   fare      1094 non-null   object
 4   embarked  1094 non-null   object
dtypes: int64(1), object(4)
memory usage: 51.3+ KB


In [25]:
drop_dpc_df.describe()

Unnamed: 0,pclass,survived
count,1094.0,1094.0
mean,2.211152,0.423218
std,0.860979,0.494295
min,1.0,0.0
25%,1.0,0.0
50%,2.0,0.0
75%,3.0,1.0
max,3.0,1.0


In [26]:
X.dtypes

pclass       int64
sex         object
age         object
fare        object
embarked    object
dtype: object

In [29]:
# Changing datatypes of multiple columns at the same time
a={'pclass':'str'}
X=X.astype(a)

#cols=X.columns
# X[cols] = X[cols].astype('str')

In [40]:
X.isnull().sum(axis=0)

pclass      0
sex         0
age         0
fare        0
embarked    0
dtype: int64

In [42]:
X.columns

Index(['pclass', 'sex', 'age', 'fare', 'embarked'], dtype='object')

In [47]:
X[X.columns[3]]

0       211.3375
1         151.55
2         151.55
3         151.55
4         151.55
          ...   
1301       7.225
1304     14.4542
1306       7.225
1307       7.225
1308       7.875
Name: fare, Length: 1094, dtype: object

In [51]:
X.iloc[:,3]

0       211.3375
1         151.55
2         151.55
3         151.55
4         151.55
          ...   
1301       7.225
1304     14.4542
1306       7.225
1307       7.225
1308       7.875
Name: fare, Length: 1094, dtype: object

In [62]:
# Rows that fare has non alphanumeric character
X[~X.iloc[:,3].str.isalnum()]

Unnamed: 0,pclass,sex,age,fare,embarked
0,1,female,29,211.3375,S
1,1,male,0.9167,151.55,S
2,1,female,2,151.55,S
3,1,male,30,151.55,S
4,1,female,25,151.55,S
...,...,...,...,...,...
1301,3,male,45.5,7.225,C
1304,3,female,14.5,14.4542,C
1306,3,male,26.5,7.225,C
1307,3,male,27,7.225,C


In [60]:
X[X.iloc[:,3].str.isalnum()]

Unnamed: 0,pclass,sex,age,fare,embarked
7,1,male,39,0,S
14,1,male,80,30,S
22,1,male,26,30,C
25,1,male,25,26,C
31,1,male,40,31,C
...,...,...,...,...,...
1274,3,male,31,18,S
1275,3,male,16,18,S
1276,3,female,31,18,S
1281,3,male,22,9,S


In [68]:
chlist=['0','1','2','3','4','5','6','7','8','9','.']
checkFloat=lambda x: [i in chlist for i in x]
X[~(X.iloc[:,3].str.isalnum())]['fare'].apply(checkFloat)

In [72]:
chlist=['0','1','2','3','4','5','6','7','8','9','.']
s1='23.2?'
mylist= [i in chlist for i in s1]
np.prod(np.array(mylist))

0

In [None]:
# Image on phone

In [80]:
X=X.replace('?',np.nan)

In [81]:
X.dtypes

pclass      object
sex         object
age         object
fare        object
embarked    object
dtype: object

In [82]:
X.isnull().sum(axis=0)

pclass        0
sex           0
age         128
fare          1
embarked      2
dtype: int64

In [83]:
X['embarked'].unique()

array(['S', 'C', nan, 'Q'], dtype=object)

In [88]:
X['embarked']=X['embarked'].astype('str')

In [89]:
X['embarked'].unique()

array(['S', 'C', 'nan', 'Q'], dtype=object)

#### Step 4: Use `ColumnTransformer` by selecting column by names

##############################################################################
 
 We will train our classifier with the following features:

 Numeric Features:

 * ``age``: float;
 * ``fare``: float.

 Categorical Features:

 * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;
 * ``sex``: categories encoded as strings ``{'female', 'male'}``;
 * ``pclass``: ordinal integers ``{1, 2, 3}``.

 Create the preprocessing pipelines for both numeric and categorical data.
 - For numeric features, create a pipeline with using imputer with median strategy and standard scaler
 - For categorical features, use imputer with most frequent and onehot encoder
 
 Use columntransformer to transform the features
 
 Append the preprocessor to the KNeighborsClassifier


In [90]:
numeric_features=['age','fare']
numeric_transformer=Pipeline(steps=[('n_imputer',SimpleImputer(strategy='mean')),
                                   ('scaler',StandardScaler())])

In [91]:
numeric_transformer

Pipeline(steps=[('n_imputer', SimpleImputer()), ('scaler', StandardScaler())])

In [111]:
categorical_features=['embarked','sex','pclass']

In [112]:
categorical_transformer=Pipeline(steps=[('c_imputer',SimpleImputer(strategy='most_frequent',missing_values='nan')),
                                      ('ohe',OneHotEncoder(handle_unknown='ignore'))])

In [113]:
categorical_transformer

Pipeline(steps=[('c_imputer',
                 SimpleImputer(missing_values='nan', strategy='most_frequent')),
                ('ohe', OneHotEncoder(handle_unknown='ignore'))])

In [114]:
Preprocessor=ColumnTransformer(transformers=[('num',numeric_transformer,numeric_features),
                 ('cat',categorical_transformer,categorical_features)])

In [115]:
clf=Pipeline(steps=[('prep',Preprocessor),
               ('knn',KNeighborsClassifier())])

In [116]:
clf

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('n_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(steps=[('c_imputer',
                                                                   SimpleImputer(missing_values='nan',
                                                                                 strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='

In [117]:
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,stratify=y)

In [118]:
clf.fit(X_train,y_train)

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('n_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(steps=[('c_imputer',
                                                                   SimpleImputer(missing_values='nan',
                                                                                 strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='

In [119]:
clf.score(X_test,y_test)

0.7773722627737226

Using the prediction pipeline in a grid search

##############################################################################

 Grid search can also be performed on the different preprocessing steps
 defined in the `ColumnTransformer` object, together with the classifier's
 hyperparameters as part of the `Pipeline`.
 We will search for both the imputer strategy of the numeric preprocessing
 and the number of neighbor parameter of the kneighbors classifier using
 `GridSearchCV`.



In [126]:
parameters={'knn__n_neighbors':np.arange(1,30,2),
           'prep__num__n_imputer__strategy':['mean','median']}
mygrid=GridSearchCV(clf,param_grid=parameters,cv=5)
mygrid.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('prep',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('n_imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['age',
                                                                          'fare']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('c_imputer',
                                                                                   

In [127]:
clf.steps

[('prep',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('n_imputer', SimpleImputer()),
                                                   ('scaler', StandardScaler())]),
                                   ['age', 'fare']),
                                  ('cat',
                                   Pipeline(steps=[('c_imputer',
                                                    SimpleImputer(missing_values='nan',
                                                                  strategy='most_frequent')),
                                                   ('ohe',
                                                    OneHotEncoder(handle_unknown='ignore'))]),
                                   ['embarked', 'sex', 'pclass'])])),
 ('knn', KNeighborsClassifier())]

In [124]:
Preprocessor.transformers

[('num',
  Pipeline(steps=[('n_imputer', SimpleImputer()), ('scaler', StandardScaler())]),
  ['age', 'fare']),
 ('cat',
  Pipeline(steps=[('c_imputer',
                   SimpleImputer(missing_values='nan', strategy='most_frequent')),
                  ('ohe', OneHotEncoder(handle_unknown='ignore'))]),
  ['embarked', 'sex', 'pclass'])]

In [128]:
Preprocessor.transformers

[('num',
  Pipeline(steps=[('n_imputer', SimpleImputer()), ('scaler', StandardScaler())]),
  ['age', 'fare']),
 ('cat',
  Pipeline(steps=[('c_imputer',
                   SimpleImputer(missing_values='nan', strategy='most_frequent')),
                  ('ohe', OneHotEncoder(handle_unknown='ignore'))]),
  ['embarked', 'sex', 'pclass'])]

In [129]:
mygrid.best_params_

{'knn__n_neighbors': 29, 'prep__num__n_imputer__strategy': 'mean'}

In [130]:
mygrid.score(X_train,y_train)

0.7670731707317073

In [131]:
mygrid.score(X_test,y_test)

0.7664233576642335