**Column Info:**
1. LotFrontage: Linear feet of street connected to property
2. LotArea: Lot size in square feet
3. TotalBsmtSF: Total square feet of basement area
4. BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
5. Fireplaces: Number of fireplaces
6. PoolArea: Pool area in square feet
7. GarageCars: Size of garage in car capacity 
8. WoodDeckSF: Wood deck area in square feet
9. ScreenPorch: Screen porch area in square feet
10. MSZoning: Identifies the general zoning classification of the sale.
    * A		Agriculture
    * C		Commercial
    * FV	Floating Village Residential
    * I		Industrial
    * RH	Residential High Density
    * RL	Residential Low Density
    * RP	Residential Low Density Park 
    * RM	Residential Medium Density
11. Condition1: Proximity to various conditions
    * Artery	Adjacent to arterial street
    * Feedr	Adjacent to feeder street	
    * Norm	Normal	
    * RRNn	Within 200' of North-South Railroad
    * RRAn	Adjacent to North-South Railroad
    * PosN	Near positive off-site feature--park, greenbelt, etc.
    * PosA	Adjacent to postive off-site feature
    * RRNe	Within 200' of East-West Railroad
    * RRAe	Adjacent to East-West Railroad
12. Heating: Type of heating
    * Floor	Floor Furnace
    * GasA	Gas forced warm air furnace
    * GasW	Gas hot water or steam heat
    * Grav	Gravity furnace	
    * OthW	Hot water or steam heat other than gas
    * Wall	Wall furnace
13. Street: Type of road access to property
    * Grvl	Gravel	
    * Pave	Paved
14. CentralAir: Central air conditioning
    * N	No
    * Y	Yes
15. Foundation: Type of foundation
    * BrkTil	Brick & Tile
    * CBlock	Cinder Block
    * PConc	Poured Contrete	
    * Slab	Slab
    * Stone	Stone
    * Wood	Wood

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

# reading
url = "https://drive.google.com/file/d/17q13ZwCKugmtOfnMBbv3ZwX-1y-tbkUU/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = df = pd.read_csv(path)

In [None]:
data

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
0,8450,65.0,856,3,0,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
1,9600,80.0,1262,3,1,0,2,298,0,0,RL,Feedr,GasA,Pave,Y,CBlock
2,11250,68.0,920,3,1,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
3,9550,60.0,756,3,1,0,3,0,0,0,RL,Norm,GasA,Pave,Y,BrkTil
4,14260,84.0,1145,4,1,0,3,192,0,0,RL,Norm,GasA,Pave,Y,PConc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,62.0,953,3,1,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
1456,13175,85.0,1542,3,2,0,2,349,0,0,RL,Norm,GasA,Pave,Y,CBlock
1457,9042,66.0,1152,4,2,0,1,0,0,1,RL,Norm,GasA,Pave,Y,Stone
1458,9717,68.0,1078,2,0,0,1,366,0,0,RL,Norm,GasA,Pave,Y,CBlock


#  X and y creation

In [None]:
X = data
y = X.pop("Expensive")

In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
 9   MSZoning      1460 non-null   object 
 10  Condition1    1460 non-null   object 
 11  Heating       1460 non-null   object 
 12  Street        1460 non-null   object 
 13  CentralAir    1460 non-null   object 
 14  Foundation    1460 non-null   object 
dtypes: float64(1), int64(8), object(6)
memory usage: 171.2+ KB


# Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Categorical encoding - "Automated" approach (Using Pipelines)
In the manual approach, to encode the categorical columns numericall, we have:

Selected the categorical columns.
Fitted a OneHotEncoder to them.
Transformed the categorical columns with the encoder.
Converted the sparse matrix into a dataframe.
Recovered the names of the columns.
Concatenated the one-hot columns with the numerical columns.
All these steps can be synthetised by using Scikit-Learn Pipelines and specifically something called ColumnTransformer, which allows us to apply different transformations to two or more groups of columns: in our case, categorical and numerical columns.

This process is also called creating "branches" in the pipeline. One branch for the categorical columns and another for the numerical columns. Each branch will contain as many transformers as we want. Then, the branches will meet again, and the transformed columns will be automatically concatenated. Let's see the process in action:

## 3.1. Creating the "numeric pipe" and the "categoric pipe"

In [None]:
# import
from sklearn.preprocessing import OneHotEncoder

In [None]:
#OneHotEncoder(handle_unknown="ignore")

In [None]:
# select categorical and numerical column names
X_cat_columns = X.select_dtypes(exclude="number").copy().columns
X_num_columns = X.select_dtypes(include="number").copy().columns

# create numerical pipeline, only with the SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(SimpleImputer(strategy="mean"))
 
 # create categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"),
    OneHotEncoder()
)

## 3.2. Using ColumnTransformer a pipeline with 2 branches (the preprocessor)
We simply tell the pipeline the following:

One branch, called "num_pipe", will apply the steps in the numeric_pipe to the columns named in X_num_columns
The second branch, called "cat_pipe", will apply the steps in the categoric_pipe to the columns named in X_cat_columns

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num_columns),
        ("cat_pipe", categoric_pipe, X_cat_columns),
    ]
)

## 3.3. Creating the full_pipeline (preprocessor + Decision Tree)
Pipelines are modular. The preprocessor we created above with the ColumnTransformer can become now a step in a new pipeline, that we'll call full_piepline and will include, as a last step, a Decision Tree model:

In [None]:
full_pipeline = make_pipeline(preprocessor, 
                              DecisionTreeClassifier())

In [None]:
full_pipeline.fit(X_train, y_train)

In [None]:
full_pipeline.predict(X_train)

array([1, 0, 1, ..., 1, 0, 0])

# Use the new Pipeline with branches to train a DecisionTree with GridSearch cross validation.
We are basically asking to combine what you have learned in this notebook (categorical encoding & branches) with what you learned in the previous one (using GridSearchCV for a whole Pipeline).

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
#look at full_pipe and check if its good
from sklearn import set_config

set_config(display="diagram")
full_pipeline # click on the diagram below to see the details of each step

In [None]:
# 1. initialize transformers & model without specifying the parameters

dtree = DecisionTreeClassifier()
#s_scaler = StandardScaler()


# create parameter grid
param_grid = {
  #  "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
   # "columntransformer__cat_pipe__simpleimputer__strategy": ["constant"],
   # "columntransformer__cat_pipe__simpleimputer__fill_value": ["N_A"],
  #  "columntransformer__num_pipe__standardscaler__with_mean":[True, False],
   # "columntransformer__num_pipe__standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(2, 12, 2),
    "decisiontreeclassifier__min_samples_split": range(3, 40, 5),
    "decisiontreeclassifier__criterion":["gini", "entropy"],
    "columntransformer__cat_pipe__onehotencoder__handle_unknown": ["ignore"],
}


#4. define cross validation
search = GridSearchCV(full_pipeline,
                      param_grid,
                      cv=7,
                      verbose=1,
                    #   verbose=1,
                      scoring="accuracy"
                      )

#5. 
search.fit(X_train, y_train)

Fitting 7 folds for each of 960 candidates, totalling 6720 fits


In [None]:
# cross validation average accuracy
search.best_score_

0.9212384181722613

In [None]:
# best parameters
search.best_params_

{'columntransformer__cat_pipe__onehotencoder__handle_unknown': 'ignore',
 'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 7,
 'decisiontreeclassifier__min_samples_leaf': 8,
 'decisiontreeclassifier__min_samples_split': 28}

# Accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# training accuracy ON the ENTIRE TRAIN-DATA
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.9366438356164384

In [None]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.910958904109589