**DS 301: Applied Data Modeling and Predictive Analysis**

# Lab 8 – Random Forests, AdaBoost

Nok Wongpiromsarn, 10 August 2022

**Instructions:**
Perform regression with 'SalePrice' as the output.
1. Select at least 2 features of your choice. Explain why you select these features.
2. Prepare the data using Pipeline and ColumnTransformer. Explain the reasoning behind having each transformation in the Pipeline. Hint: Consider, e.g., StandardScaler, OneHotEncoder, etc.
3. Train the following models
   - RandomForestRegressor
   - AdaBoostRegressor
   - XGBRegressor
4. Evaluate each of the above models based on RMSE.

**Get the data and allocate some for testing**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("datasets/house-price.csv")
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)

### 1. Select at least 2 features of your choice. Explain why you select these features.

In [2]:
# We'll only use the training set to not contaminate the test set

data_encoded = pd.get_dummies(data_train)
abs_corr = abs(data_encoded.corr()["SalePrice"])
abs_corr.sort_values(ascending=False)

SalePrice           1.000000
OverallQual         0.785555
GrLivArea           0.695652
GarageCars          0.640991
GarageArea          0.624139
                      ...   
Foundation_Stone    0.002416
Fence_GdPrv         0.002171
Condition1_RRNe     0.002107
GarageCond_Gd       0.001725
RoofMatl_Metal      0.000546
Name: SalePrice, Length: 288, dtype: float64

In [3]:
# Pick the features with correlation > 0.55
# Essentially, we want to pick features with high correlation.

attribs_encoded = data_encoded.columns[abs_corr > 0.55]
attribs_encoded = attribs_encoded[(attribs_encoded != "SalePrice")]
attribs_encoded

Index(['OverallQual', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath',
       'GarageCars', 'GarageArea', 'ExterQual_TA'],
      dtype='object')

In [4]:
# Convert attribs_encoded to the original attributes before one-hot encoding.
# Note that the following code assumes that the encoded attribute name is obtained from the original attribute name
# by appending "_" and that the original attribute names do not include "_"

attribs = []

for a in attribs_encoded:
    index = a.find('_')
    if index > 0:
        a = a[:index]
    if a not in attribs:
        attribs.append(a)
        
# Print selected attributes and their corresponding types
print("Selected {} atrributes".format(len(attribs)))
print("  {:15} {:10} {:^10}".format("Column", "Dtype", "Null Count"))
print("  {:15} {:10} {:^10}".format("------", "-----", "----------"))
for attr in attribs:
    print("  {:15} {:10} {:^10}".format(attr, str(data_train[attr].dtype), data_train[attr].isnull().sum()))

Selected 8 atrributes
  Column          Dtype      Null Count
  ------          -----      ----------
  OverallQual     int64          0     
  TotalBsmtSF     int64          0     
  1stFlrSF        int64          0     
  GrLivArea       int64          0     
  FullBath        int64          0     
  GarageCars      int64          0     
  GarageArea      int64          0     
  ExterQual       object         0     


### 2. Prepare the data using Pipeline and ColumnTransformer

In [5]:
# Separate the selected features based on their types

num_attribs = [a for a in attribs if data[a].dtypes == 'int64']
cat_attribs = [a for a in attribs if data[a].dtypes == 'object']

# Ensure that we've covered all the selected features
assert len(num_attribs) + len(cat_attribs) == len(attribs)

print(num_attribs)
print(cat_attribs)

['OverallQual', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'GarageCars', 'GarageArea']
['ExterQual']


In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# We need to scale the features and convert categorical features to numerical ones.
# There is no missing values in this case, so there is really no need to use SimpleImputer.
# I'll still add SimpleImputer to illustrate how you may use Pipeline, together with ColumnTransformer
# to create a complete transformer.
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_attribs), # Apply StandardScaler to numerical features
    ('cat', cat_transformer, cat_attribs),  # Apply cat_transformer to categorical features
])

X_train = preprocessor.fit_transform(data_train[attribs])
y_train = data_train['SalePrice']

### 3. Train the following models

- RandomForestRegressor
- AdaBoostRegressor
- XGBRegressor

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

reg_rnd = RandomForestRegressor(n_estimators=100, criterion='squared_error', n_jobs=-1)
reg_rnd.fit(X_train, y_train)

reg_adb = AdaBoostRegressor(DecisionTreeRegressor(max_depth=3), n_estimators=100, loss="square", random_state=4)
reg_adb.fit(X_train, y_train)

reg_xgb = XGBRegressor(n_estimators=100, n_jobs=-1)
reg_xgb.fit(X_train, y_train)

### 4. Evaluate each of the above models based on RMSE

In [8]:
# We need to transform the test data in the same way that we transform the training data
X_test = preprocessor.transform(data_test[attribs])
y_test = data_test['SalePrice']

# Make prediction on the test set
y_pred_rnd = reg_rnd.predict(X_test)
y_pred_adb = reg_adb.predict(X_test)
y_pred_xgb = reg_xgb.predict(X_test)

In [9]:
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse_rnd = sqrt(mean_squared_error(y_test, y_pred_rnd))
print("RMSE RandomForestRegressor: {}".format(rmse_rnd))

rmse_adb = sqrt(mean_squared_error(y_test, y_pred_adb))
print("RMSE AdaBoostRegressor: {}".format(rmse_adb))

rmse_xgb = sqrt(mean_squared_error(y_test, y_pred_xgb))
print("RMSE XGBRegressor: {}".format(rmse_xgb))

RMSE RandomForestRegressor: 30889.08408687132
RMSE AdaBoostRegressor: 37735.3069269716
RMSE XGBRegressor: 30840.253489670064
