### WBS Coding School
___
### --PROJECT--
# Supervised Machine Learning
### :: Regression: Housing prices

- *01: Data Preparation and Exploration*
- **02: Baseline Model**
- *03: Feature Selection*
- *04: Model Training and Selection*

___

# Baseline Model
___

## Table of contents:
- [1 Preprocessing](#preprocessing)
    - [1.1 Subdivide Data Classes](#subdivide)
    - [1.2 Preprocessing Pipeline](#preprocessing_pipeline)
- [2 Baseline Model](#baseline)

#### Import Libraries & Data

In [1]:
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

In [2]:
# Import the cleaned dataframe
housing_data = pd.read_csv("data/train_clean.csv")

#### Train/Test Split

In [3]:
X = housing_data
y = X.pop("SalePrice")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

___
<a id="preprocessing"></a>
# 1&nbsp; Preprocessing

We here separate *numerical* from *categorical* data, and then further split categorical data into *ordinal* and *nominal* data. 

Finally, we create a preprocessing pipeline, which we'll export using `joblib` to be used in the following scripts.

<a id="subdivide"></a>
## 1.1&nbsp; Subdivide Data Classes

##### Split Numerical and Categorical Data

In [4]:
# Separating numeric and categorical data
X_num = X_train.select_dtypes(include="number").copy()
X_cat = X_train.select_dtypes(include="object").copy()

In [5]:
# Check if the number of columns adds up
len(X_train.columns) == len(X_num.columns) + len(X_cat.columns)

True

##### Split Ordinal and Nominal Data

To splitordinal and nominal data, I checked the description for every categorical feature and decided whether or not there was an order to it. If there was, I coded the ordered features as lists with elements ranging from poor to excellent.

The description of each column was provided with the `data_description.txt` file, a copy of which can be found in section 3 of the first notebook (01_Data_Preparation_and_Exploration).

In [6]:
# Ordinal categories
Alley =         ["NA", "Grvl", "Pave"]
BsmtCond =      ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
BsmtExposure =  ["NA", "No", "Mn", "Av", "Gd"]
BsmtFinType1 =  ["NA", 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
BsmtFinType2 =  ["NA", 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
BsmtQual =      ["NA", "Fa", "TA", "Gd", "Ex"]
CentralAir =    ["NA", "N", "Y"]
ExterCond =     ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
ExterQual =     ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
Electrical =    ["NA", "Mix", "FuseP", "FuseF", "FuseA", "SBrkr"]
Fence =         ["NA", "MnWw", "GdWo", "MnPrv", "GdPrv"]
FireplaceQu =   ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
Functional =    ["NA", "Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"]
GarageCond =    ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
GarageFinish =  ["NA", "Unf", "RFn", "Fin"]
GarageQual =    ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
HeatingQC =     ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
HouseStyle =    ["NA", "1Story", "1.5Unf", "1.5Fin", "SFoyer", "2Story", "2.5Unf", "2.5Fin", "SLvl"]
KitchenQual =   ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
LandContour =   ["NA", "Low", "HLS", "Bnk", "Lvl"]
LandSlope =     ["NA", "Sev", "Mod", "Gtl"]
LotShape =      ["NA", "IR3", "IR2", "IR1", "Reg"]
MiscFeature =   ["NA", "Shed", "Othr", "Gar2", "TenC", "Elev"]
PavedDrive =    ["NA", "N", "P", "Y"]
PoolQC =        ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
RoofMatl =      ["NA", "WdShngl", "WdShake", "Tar&Grv", "Roll", "Metal", "Membran", "CompShg", "ClyTile"]
SaleCondition = ["NA", "Partial", "Family", "Alloca", "AdjLand", "Abnorml", "Normal"]
SaleType =      ["NA", "WD", "CWD", "VWD", "New", "COD", "Con", "ConLw", "ConLI", "ConLD", "Oth"]
Utilities =     ["NA", "ELO", "NoSeWa", "NoSewr", "AllPub"]

ordinal_categories = [
    Alley,
    BsmtCond,
    BsmtExposure,
    BsmtFinType1,
    BsmtFinType2,
    BsmtQual,
    CentralAir,
    ExterCond,
    ExterQual,
    Electrical,
    Fence,
    FireplaceQu,
    Functional,
    GarageCond,
    GarageFinish,
    GarageQual,
    HeatingQC,
    HouseStyle,
    KitchenQual,
    LandContour,
    LandSlope,
    LotShape,
    MiscFeature,
    PavedDrive,
    PoolQC,
    RoofMatl,
    SaleCondition,
    SaleType,
    Utilities
]


# Ordinal column names
ordinal_cols = [
    "Alley",
    "BsmtCond",
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "BsmtQual",
    "CentralAir",
    "ExterCond",
    "ExterQual",
    "Electrical",
    "Fence",
    "FireplaceQu",
    "Functional",
    "GarageCond",
    "GarageFinish",
    "GarageQual",
    "HeatingQC",
    "HouseStyle",
    "KitchenQual",
    "LandContour",
    "LandSlope",
    "LotShape",
    "MiscFeature",
    "PavedDrive",
    "PoolQC",
    "RoofMatl",
    "SaleCondition",
    "SaleType",
    "Utilities"
]

# Nominal column names
nominal_cols = [
    "BldgType",
    "Condition1",
    "Condition2",
    "Exterior1st",
    "Exterior2nd",
    "Foundation",
    "GarageType",
    "Heating",
    "LotConfig",
    "MasVnrType",
    "MSZoning",
    "Neighborhood",
    "RoofStyle",
    "Street"
]

In [7]:
# Any overlapping columns?
set(ordinal_cols) & set(nominal_cols)

set()

In [8]:
# Do the number of columns add up?
len(ordinal_cols) + len(nominal_cols) == len(X_cat.columns)

True

##### Test OneHot and Ordinal Encoding

In [12]:
# Split the nominal and ordinal categorical test data
X_cat_nom = X_cat.copy()[nominal_cols]
X_cat_ord = X_cat.copy()[ordinal_cols]

In [13]:
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_cat_nom)
onehot_encoder.get_feature_names_out()

array(['BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex',
       'BldgType_Twnhs', 'BldgType_TwnhsE', 'Condition1_Artery',
       'Condition1_Feedr', 'Condition1_Norm', 'Condition1_PosA',
       'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn',
       'Condition1_RRNe', 'Condition1_RRNn', 'Condition2_Artery',
       'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA',
       'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn',
       'Condition2_RRNn', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn',
       'Exterior1st_BrkComm', 'Exterior1st_BrkFace', 'Exterior1st_CBlock',
       'Exterior1st_CemntBd', 'Exterior1st_HdBoard',
       'Exterior1st_ImStucc', 'Exterior1st_MetalSd',
       'Exterior1st_Plywood', 'Exterior1st_Stone', 'Exterior1st_Stucco',
       'Exterior1st_VinylSd', 'Exterior1st_Wd Sdng',
       'Exterior1st_WdShing', 'Exterior2nd_AsbShng',
       'Exterior2nd_AsphShn', 'Exterior2nd_Brk Cmn',
       'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd

In [11]:
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=np.nan)
ordinal_encoder.fit(X_cat_ord)
ordinal_encoder.get_feature_names_out()

array(['Alley', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtQual', 'CentralAir', 'ExterCond', 'ExterQual',
       'Electrical', 'Fence', 'FireplaceQu', 'Functional', 'GarageCond',
       'GarageFinish', 'GarageQual', 'HeatingQC', 'HouseStyle',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotShape',
       'MiscFeature', 'PavedDrive', 'PoolQC', 'RoofMatl', 'SaleCondition',
       'SaleType', 'Utilities'], dtype=object)

<a id="preprocessing_pipeline"></a>
## 1.2&nbsp; Preprocessing Pipeline

We here create a preprocessing pipeline, which we'll use in the following scripts. It will consist of ordinal and onehot encoding for the categorical features and a standard scaler with a simple imputer for the numerical ones. We use the standard scaler to get our numerical data on the same scale. The two imputers will deal with missing values.

The `ColumnTransformer` in the categorical pipeline splits the data up into nominal and ordinal columns (using `nominal_cols_indices` and `ordinal_cols_indices`), transforms them separately using one-hot and ordinal encoding, respectively, and fuses them back together afterwards. The `ColumnTransformer` in the final preprocessing pipeline does the same thing, only it separates the numerical from the categorical data and finally fuses them back together.

In [18]:
# Separating numerical and categorical data
X_num = X_train.select_dtypes(include="number").copy()
X_cat = X_train.select_dtypes(include="object").copy()


# Numerical pipeline
numeric_pipe = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
)


# Categorical pipeline
nominal_cols_indices = X_cat.columns.get_indexer(nominal_cols)
ordinal_cols_indices = X_cat.columns.get_indexer(ordinal_cols)

categoric_encoder = ColumnTransformer(
    transformers=[
        ("cat_nominal", OneHotEncoder(handle_unknown="ignore"), nominal_cols_indices),
        ("cat_ordinal", OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=np.nan), ordinal_cols_indices)
    ]
)

categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    categoric_encoder
)


# Preprocessing pipeline
preprocessing_pipe = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num.columns),
        ("cat_pipe", categoric_pipe, X_cat.columns)
    ]
)

preprocessing_pipe

In [21]:
# Test if the preprocessing pipeline works
trial = preprocessing_pipe.fit_transform(X_train)
trial.shape

(1168, 185)

##### Export the `preprocessing_pipe`

In [22]:
# Create the directory if it doesn't already exist
Path("./pipelines").mkdir(exist_ok=True)

joblib.dump(preprocessing_pipe, 'pipelines/preprocessing_pipe.joblib')

['pipelines/preprocessing_pipe.joblib']

___
<a id="baseline"></a>
# 2&nbsp; Baseline Model

We here create a simple random forest baseline model by attaching the preprocessing pipeline to a `RandomForestRegressor`.

In [23]:
full_pipe_forest = make_pipeline(
    preprocessing_pipe, 
    RandomForestRegressor()
)

full_pipe_forest.fit(X_train, y_train)

In [24]:
def get_model_metrics(y_true, y_pred):
    metrics = pd.DataFrame({
        "R2": r2_score(y_true=y_true, y_pred=y_pred),
        "MAE": mean_absolute_error(y_true=y_true, y_pred=y_pred),
        "MAPE": mean_absolute_percentage_error(y_true=y_true, y_pred=y_pred),
        "MSE": mean_squared_error(y_true=y_true, y_pred=y_pred)
    }, index=[0]).round(3)
    return metrics


y_pred = full_pipe_forest.predict(X_test)
get_model_metrics(y_true=y_test, y_pred=y_pred)

Unnamed: 0,R2,MAE,MAPE,MSE
0,0.896,17298.207,0.105,796179900.0


Our baseline model has an **R^2 value** of almost **0.9** - not bad. The mean absolute percentage errors (**MAPE**) tells us that our **predictions are** on average **10.5 % off**.

Let's try to improve these metrics using different types of models, hyper parameter tuning and feature selection.