# &nbsp; Supervised ML: Predicting housing prices (Phase1: Classification)
- Iteration 4: ordinal encoding

In this notebook, we implement ordinal encoding to some columns, e.g. ``ExterQual``. In this column, the order of the categories is manually defined, starting at "Ex" (Excellent quality of the material on the exterior), then "Gd", "TA", "Fa" and ultimatelly with "Po" (Poor quality of the material on the exterior).

With this choice of preprocessing, we achieve a more compact dataset (fewer columns created with one-hot encoding), although our pipeline gets more complex: the categorical branch will contain now the SimpleImputer, followed by a ColumnTransformer that branches into One Hot and Ordinal encoding.

A common error is trying to use a ColumnTransformer after the column names have been lost in a previous pipeline step (in our case, the SimpleImputer). This can be solved either using a pandas method instead of a sklearn transformer (e.g. `df.fill_na`), wrapped in a FunctionTransformer, or by passing the indeces instead of the column names to the ColumnTransformer, as we show here. However, SimpleImputer imports the column names correctly (you can check it in the documentation in: ``feature_names_in``).

## 0.&nbsp; Understanding the datasets

**Dataset variables:**

*   LotArea --> Lot size in square feet
*   LotFrontage  -->  Linear feet of street connected to property
*   TotalBsmtSF  -->  Total square feet of basement area
*   BedroomAbvGr -->  Bedrooms above grade (does NOT include basement bedrooms)
*   Fireplaces   -->  Number of fireplaces
*   PoolArea     -->  Pool area in square feet
*   GarageCars   -->  Size of garage in car capacity
*   WoodDeckSF   -->  Wood deck area in square feet
*   ScreenPorch  -->  Screen porch area in square feet

* MSZoning  -->  Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

* Condition1  -->  Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

* Heating  -->  Type of heating
		
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace

* Street  -->  Type of road access to property

       Grvl	Gravel	
       Pave	Paved

* CentralAir  -->  Central air conditioning

       N	No
       Y	Yes

* Foundation  -->  Type of foundation
		
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
ExterCond: Evaluates the present condition of the material on the exterior
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches)
       NA	No Basement
		
BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
	
BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
	
BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace

## 1.&nbsp; Data reading & splitting

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn import set_config

In [2]:
# reading: housing_iteration_4_classification
url = "https://drive.google.com/file/d/1lK-T9d2UAgOQ_QeHThPZbuPRa82evmJ2/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

housing_data = pd.read_csv(path)

In [3]:
housing_data.sample(3)

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,...,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual,FireplaceQu
407,15576,63.0,840,4,0,0,1,0,0,0,...,Y,BrkTil,TA,TA,Gd,TA,No,Unf,TA,
564,13346,,1095,4,1,0,2,0,0,1,...,Y,PConc,Gd,TA,Gd,TA,No,GLQ,Gd,TA
394,10134,60.0,735,2,0,0,1,0,0,0,...,Y,CBlock,TA,TA,TA,TA,No,Unf,TA,


In [4]:
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
 9   Expensive     1460 non-null   int64  
 10  MSZoning      1460 non-null   object 
 11  Condition1    1460 non-null   object 
 12  Heating       1460 non-null   object 
 13  Street        1460 non-null   object 
 14  CentralAir    1460 non-null   object 
 15  Foundation    1460 non-null   object 
 16  ExterQual     1460 non-null   object 
 17  ExterCond     1460 non-null   object 
 18  BsmtQual      1423 non-null 

- In the first step when building our `X`, we drop some columns for the sake of simplicity.
- We will "pop" the "target column", `y`, out from the rest of the data.

In [5]:
# X and y creation
X = housing_data.copy()
y = X.pop("Expensive")

# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## 2.&nbsp; Building the `preprocessor`

We create the preprocessing pipeline here.

In [6]:
# 0. Set the config so that we can view our preprocessor, and to transform output from numpy arrays to pandas dataframes
set_config(display="diagram")
set_config(transform_output="pandas")

In [7]:
# 1. defining categorical & numerical columns
X_cat = X.select_dtypes(exclude="number").copy()
X_num = X.select_dtypes(include="number").copy()

In [8]:
# 2. numerical pipeline
numeric_pipe = make_pipeline(
    SimpleImputer(strategy="mean"))

In [9]:
# 3. categorical pipeline

# # 3.1 defining ordinal & onehot columns
# .get_indexer() get's the index to solve the problem described above about losing column names
ordinal_cols = ["ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "KitchenQual", "FireplaceQu"]
onehot_cols = ["MSZoning", "Condition1", "Heating", "Street", "CentralAir", "Foundation"]

In [10]:
# # 3.2. defining the categorical encoder

# # # 3.2.1. we manually establish the order of the categories for our ordinal features, from less important to the most important and including "N_A"
ExterQual_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
ExterCond_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
BsmtQual_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
BsmtCond_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
BsmtExposure_cats = ["N_A", "No", "Mn", "Av", "Gd"]
BsmtFinType1_cats = ["N_A", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]
KitchenQual_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
FireplaceQu_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]

In [11]:
# # # 3.2.2. defining the categorical encoder: a ColumnTransformer with 2 branches: ordinal & onehot
categorical_encoder = ColumnTransformer(
    transformers=[
        ("cat_ordinal", OrdinalEncoder(categories=[ExterQual_cats, ExterCond_cats, BsmtQual_cats, BsmtCond_cats, BsmtExposure_cats, BsmtFinType1_cats, KitchenQual_cats, FireplaceQu_cats]), ordinal_cols),
        ("cat_onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False), onehot_cols),
    ]
)

In [12]:
# # 3.3. categorical pipeline = "N_A" imputer + categorical encoder
categorical_pipe = make_pipeline(SimpleImputer(strategy="constant", fill_value="N_A"),
                                 categorical_encoder
                                )

In [13]:
# 4. full preprocessing: a ColumnTransformer with 2 branches: numeric & categorical
full_preprocessing = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num.columns),
        ("cat_pipe", categorical_pipe, X_cat.columns),
    ]
)

full_preprocessing

a second approach with make_column_transformer and three branches

In [None]:
# 1. defining categorical & numerical columns
X_cat = X.select_dtypes(exclude="number").copy()
X_num = X.select_dtypes(include="number").copy()

# 2. numerical pipeline
numeric_pipe = make_pipeline(
    SimpleImputer())

# 3. categorical pipeline

# # 3.1 defining ordinal & onehot columns
# .get_indexer() get's the index to solve the problem described above about losing column names
ordinal_cols = X.columns.get_indexer(["Cabin"])
onehot_cols = X.columns.get_indexer(["Sex", "Embarked"])

# # 3.2. defining the categorical encoder

# # # 3.2.1. we manually establish the order of the categories for our ordinal feature (Cabin), including "N_A"
cabin_cats = ["N_A", "G", "F", "E", "D", "C", "B", "A", "T"]


# # # 3.2.2. defining the categorical encoder: a ColumnTransformer with 2 branches: ordinal & onehot
ordinal_pipe = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="N_A"),
        (OrdinalEncoder(categories=[cabin_cats])),
        )

# # 3.3. categorical pipeline = "N_A" imputer + categorical encoder
onehot_pipe = make_pipeline(SimpleImputer(strategy="constant", fill_value="N_A"),
                                (OneHotEncoder(handle_unknown="ignore", sparse_output=False))
                                )

# 4. full preprocessing: a ColumnTransformer with 2 branches: numeric & categorical
preprocessor_three_pipes = make_column_transformer(
    (numeric_pipe, X_num.columns),
    (ordinal_pipe, ordinal_cols),
    (onehot_pipe, onehot_cols) )
preprocessor_three_pipes

## 3.&nbsp; Decision Tree

In [15]:
from sklearn.model_selection import GridSearchCV

# full pipeline: preprocessor + model
full_pipeline = make_pipeline(full_preprocessing,
                              DecisionTreeClassifier())

In [16]:
# define parameter grid
param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median", "constant"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "decisiontreeclassifier__max_depth": range(2, 16, 2),
    "decisiontreeclassifier__min_samples_leaf": range(3, 16, 2)
}

In [17]:
# define GridSearchCV
search = GridSearchCV(full_pipeline,
                      param_grid,
                      cv=5,
                      verbose=1)

search.fit(X_train, y_train)

search.best_score_

Fitting 5 folds for each of 147 candidates, totalling 735 fits


np.float64(0.9169692967976231)

In [18]:
search.best_estimator_