## Column Transformer 

ColumnTransformer allows you to combine several feature extraction or transformation methods into a single transformer.

When  you are working on a machine learning problem, and you have a dataset containing a mixture of categorical and numerical columns. Rather than having to handle each of these separately, and perhaps writing a function to then apply this to new data. These can now be combined into a transformer which can easily be reapplied, and extended.

We will use loan dataset to predict wether or not a loan application will be successful based on a number of customer features. This contains both categorical and numerical variables, and is a nice simple data set to practice using the ColumnTransformer.

### Data Preparation 

In [83]:
#import packages
import numpy as np 
import pandas as pd 
import sklearn 
from sklearn.model_selection import train_test_split 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, OneHotEncoder 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report 
import warnings
warnings.filterwarnings("ignore")

In [84]:
# load train dataset 
train = pd.read_csv("data/loan/train.txt", sep=",")
test = pd.read_csv('data/loan/test.txt', sep=",")


In [85]:
#show shape of the dataset 
train.shape 

(614, 13)

In [86]:
print(train.shape, test.shape)
print(train.dtypes)

(614, 13) (367, 12)
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object


We have different data types such as object, integer and float 

In [87]:
# show sample of the data 
train.head() 

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [88]:
# drop loan_ID columns 
train = train.drop(['Loan_ID'], axis=1)
test =  test.drop(['Loan_ID'], axis=1)


In [89]:
# filling any null values with the most commonly occurring value for each column. 
train = train.apply(lambda x:x.fillna(x.value_counts().index[0]))
test = test.apply(lambda x:x.fillna(x.value_counts().index[0]))

In [90]:
# split features and target column 
feature_set = train.drop(['Loan_Status'], axis=1)
X = feature_set.columns[:len(feature_set.columns)]
y = 'Loan_Status'

In [91]:
# split data into train and test size 
X_train, X_test, y_train, y_test = train_test_split(train[X], train[y], test_size=0.1,  random_state=0)

### ColumnTransformer

Apply transformation to columns to optimise them for use in the classification model.I will transform the categorical columns using the sklearn OneHotEncoder. I will also normalize the numerical columns using the Normalizer function.

The ColumnTransformer takes a list of tuples specifying the transformers, and the corresponding columns on which the transformation needs to be applied. 

In [92]:
# The columns can be entered as integers which are interpreted as the column positions.
colTrans = ColumnTransformer(
    [("dummy_col", OneHotEncoder(categories=[['Male', 'Female'],
                                           ['Yes', 'No'],
                                            ['0','1', '2','3+'],
                                            ['Graduate', 'Not Graduate'],
                                            ['No', 'Yes'],
                                            ['Semiurban', 'Urban', 'Rural']]), [0,1,2,3,4,10]),
      ("norm", Normalizer(norm='l1'), [5,6,7,8,9])])

In [93]:
X_train_trans = colTrans.fit_transform(X_train)

#show shape of th transformed X_train
X_train_trans.shape 

(552, 20)

In [94]:
#show sample of the x_train_trans 
X_train_trans[:1]

array([[1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.29704541e-01,
        4.75907198e-01, 2.28038866e-02, 7.13860797e-02, 1.98294666e-04]])

In [95]:
# To transform the X_test data you simply apply the column transformer again.
X_test_trans = colTrans.fit_transform(X_test)


## Train the Model 

In [96]:
#  use RandomForestClassifier model with the default parameters.

random_forest = RandomForestClassifier()
random_forest.fit(X_train_trans, y_train)
y_pred = random_forest.predict(X_test_trans)
print(classification_report(y_test, y_pred, target_names=['Y', 'N'])) 

              precision    recall  f1-score   support

           Y       0.59      0.67      0.62        15
           N       0.89      0.85      0.87        47

   micro avg       0.81      0.81      0.81        62
   macro avg       0.74      0.76      0.75        62
weighted avg       0.82      0.81      0.81        62



In [97]:
# try in the test dataset 
test_samp = test[:10] 
test_samp = colTrans.fit_transform(test_samp) 

In [98]:
# peform prediction 
random_forest.predict(test_samp) 

array(['Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y'], dtype=object)