#### Categorical encoding - "Automated" approach (Using Pipelines)
In the manual approach, to encode the categorical columns numerically, we went through the following steps:

-Selected the categorical columns.

-Fitted a OneHotEncoder to them.

-Transformed the categorical columns with the encoder.

-Converted the sparse matrix into a dataframe.

-Recovered the names of the columns.

-Concatenated the one-hot columns with the numerical columns.

However, in the automated approach, we will synthesize all the steps using the  scikit-learn pipeline called Columntransformer.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
X = pd.read_csv('housing-classification-iter3.csv')
y = X.pop('Expensive')
X.head(3)

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
0,8450,65.0,856,3,0,0,2,0,0,RL,Norm,GasA,Pave,Y,PConc
1,9600,80.0,1262,3,1,0,2,298,0,RL,Feedr,GasA,Pave,Y,CBlock
2,11250,68.0,920,3,1,0,2,0,0,RL,Norm,GasA,Pave,Y,PConc


In [16]:
# splitting dataset 
X_train, X_test,y_train, y_test, = train_test_split(X,y, test_size=0.2, random_state=1245)

##### Creating the numerical and categorical pipeline

In [17]:
X_cat_columns = X.select_dtypes(exclude='number').copy().columns
X_num_columns = X.select_dtypes(include='number').copy().columns

In [28]:
# create numerical pipeline, only with the SimpleImputer(strategy="mean")
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
numeric_pipe = make_pipeline(
    SimpleImputer(strategy='mean'))


# create categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(strategy ='constant', fill_value='N_A'),
    OneHotEncoder(handle_unknown='ignore')#to ignore unseen categories, encoding them as all-zeros
)

##### Using ColumnTransformer a pipeline with 2 branches (the preprocessor)

In [29]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
    ('num_pipe', numeric_pipe, X_num_columns),
    ('cat_pipe', categoric_pipe, X_cat_columns)
])

##### Creating the full_pipeline (preprocessor + Decision Tree)

In [30]:
from sklearn.tree import DecisionTreeClassifier
full_pipeline = make_pipeline(preprocessor, DecisionTreeClassifier())

In [31]:
full_pipeline.fit(X_train, y_train)

#####  Using the new Pipeline with branches to train a DecisionTree with GridSearch cross validation.

In [32]:
param_grid = {'decisiontreeclassifier__max_depth': range(2,12),
             'decisiontreeclassifier__min_samples_leaf': range(3,10,2),
             'decisiontreeclassifier__min_samples_split': range(3,40,5),
             'decisiontreeclassifier__criterion':['gini', 'entropy']
             }


In [33]:
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(full_pipeline,
                      param_grid,
                      cv=5,
                      verbose=1)

search.fit(X_train, y_train)

Fitting 5 folds for each of 640 candidates, totalling 3200 fits


In [34]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 7,
 'decisiontreeclassifier__min_samples_split': 3}

In [35]:
search.best_score_

0.9246762774659769