## Developing a better Machine Learning Model

I was very unsatisfied with the supplied model, and so decided to make my own. I know, this was outside the given task, but I finished early, and wanted to see if I could do better.

In [1]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.linear_model import LinearRegression, ElasticNetCV
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_percentage_error,
    mean_absolute_error
)
import numpy as np
import pandas as pd
from model.data import load_data

def print_metrics(predictions, target):
    print("RMSE: ", np.sqrt(mean_squared_error(predictions, target)))
    print("MAPE: ", mean_absolute_percentage_error(predictions, target))
    print("MAE : ", mean_absolute_error(predictions, target))

Load and separate data into train, validation, and test. Test data will only be used at the end, once I've decided not to tweak the models any further.

In [2]:
train, test = load_data()
X_train_full, Y_train_full, X_test, Y_test = train.drop("price", axis=1), train["price"], test.drop("price", axis=1), test["price"]
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train_full, Y_train_full, test_size=0.2, random_state=42)
X_train.shape

(12969, 8)

Just a quick run-down of the available features and some examples.

In [3]:
X_train.head(5).transpose()

Unnamed: 0,3950,2286,10662,714,10474
type,departamento,casa,casa,departamento,departamento
sector,providencia,lo barnechea,las condes,nunoa,providencia
net_usable_area,91.0,577.0,365.0,40.0,60.0
net_area,97.0,1276.0,980.0,43.0,60.0
n_rooms,3.0,7.0,7.0,1.0,2.0
n_bathroom,2.0,6.0,4.0,1.0,1.0
latitude,-33.4376,-33.33118,-33.42117,-33.4531,-33.42707
longitude,-70.627,-70.52862,-70.50105,-70.602,-70.6114


Having seen the features, I have decided to apply some feature engineering, aside from the OneHotEncoding that was already used in the original model.

1. Tweak the latitude and longitude, since all properties are somewhat close-by. My suggestion is standard normalization using mean and standard deviation.

2. A couple of new features: 
    - pct_usable_area: Dividing the net_usable_area by the total net_area, gives a percentage of usable area.
    - avg_area_per_room: Dividing the net_usable_area by the n_rooms, to give a sense for how big the rooms are.

3. Drop the transformed columns so there is no data co-dependecy and repetition.

In [4]:
@FunctionTransformer
def feature_expansion(input_df: pd.DataFrame) -> pd.DataFrame:
    input_df["avg_area_per_room"] = input_df["net_usable_area"].div(input_df["n_rooms"], fill_value=0.0)
    input_df["pct_usable_area"] = input_df["net_usable_area"].div(input_df["net_area"], fill_value=0.0)
    return input_df.fillna(0.0).replace(np.inf, 0.0)

class columnDropperTransformer(TransformerMixin):
    def __init__(self,columns):
        self.columns=columns

    def transform(self, X, y=None):
        all_cols = np.ones(X.shape[-1]).astype(bool)
        all_cols[self.columns] = False
        return X[:, all_cols]

    def fit(self, X, y=None):
        return self 

column_name_index = {k: v for v,k in enumerate(train.columns.to_list())}

preprocessing_pipeline = Pipeline(
    steps=[
        ("Feature Transformations and Scaling", 
         ColumnTransformer(
            transformers=[
                ('Standard Scaler for Lat/Long', StandardScaler(), [column_name_index["latitude"], column_name_index["longitude"]]),
                ('One Hot Encoder for Type/Sector', OneHotEncoder(handle_unknown='ignore'), [column_name_index['type'], column_name_index['sector']]),
                ('Additional Feature Engineering', feature_expansion, make_column_selector('.*')),
            ],
            remainder='passthrough'
        )),
        ("Drop transformed features", 
         columnDropperTransformer([11, 10, 16, 17])) # Drop type, sector, lat, long
    ]
)
preprocessing_pipeline.fit_transform(X_train)

array([[-0.9435156932680938, -1.8282648093401392, 0.0, ..., 2.0,
        30.333333333333332, 0.9381443298969072],
       [2.007267618074679, 0.8919459513170356, 1.0, ..., 6.0,
        82.42857142857143, 0.45219435736677116],
       [-0.48794935418370217, 1.654257505202155, 1.0, ..., 4.0,
        52.142857142857146, 0.37244897959183676],
       ...,
       [-0.5966419926019136, 0.3441986940620855, 1.0, ..., 1.0,
        36.666666666666664, 0.43824701195219123],
       [-0.8755827942566378, 1.0290901409170414, 1.0, ..., 1.0, 20.0,
        0.18461538461538463],
       [-0.33073321647162157, -0.031012324661606826, 0.0, ..., 2.0, 33.0,
        0.9428571428571428]], dtype=object)

Once the preprocessing is done, I have to decide which model to use. For this, I picked some examples from the sklearn documentation, and my personal favorite, RandomForestRegressor. I skipped XGBoost, or other gradient-boosted trees, because they often perform very similarly to RFR, and it seemed like it would be a waste of effort.

In [6]:
lr_regressor = LinearRegression()
lr_pipeline = Pipeline([
    ('Preprocessing', preprocessing_pipeline),
    ('Regression', lr_regressor)
])

encv_regressor = ElasticNetCV()
encv_params = {
    "max_iter": [3000, 5000, 10000],
    "l1_ratio": list(np.linspace(0.05, 0.95, 10)),
    "selection": ['random', 'cyclic']
}

rfr_regressor = RandomForestRegressor()
rfr_pipeline = Pipeline([
    ('Preprocessing', preprocessing_pipeline),
    ('Regression', rfr_regressor)
])
rfr_params = {
    "n_estimators": [100, 300, 500, 700],
    "criterion": ["squared_error", "friedman_mse"],
    "max_depth": [50, 100]
}

I then trained each model using GridSearchCV to check for the best parameters. Linear Regression did not need GridSearch or CV, since it has no parameters and is completely deterministic.

In [7]:
X_train = preprocessing_pipeline.fit_transform(X_train)
X_valid = preprocessing_pipeline.fit_transform(X_valid)

In [8]:
best_lr = lr_regressor.fit(X_train, Y_train)
preds = best_lr.predict(X_valid)
print_metrics(preds, Y_valid)

RMSE:  9637.299292253598
MAPE:  0.4827414601975945
MAE :  5524.312616903214


In [9]:
encv_grid = GridSearchCV(encv_regressor, encv_params)
encv_grid.fit(X_train, Y_train)
best_encv = encv_grid.best_estimator_
preds = best_encv.predict(X_valid)
print_metrics(preds, Y_valid)

# encv_pipeline = Pipeline([
#     ('Preprocessing', preprocessing_pipeline),
#     ('Regression', encv_regressor)
# ])

RMSE:  13301.431569229171
MAPE:  0.5509217631167143
MAE :  8895.797455962736


In [10]:
rfr_grid = GridSearchCV(rfr_regressor, rfr_params, n_jobs=-1, cv=3)
rfr_grid.fit(X_train, Y_train)
best_rfr = rfr_grid.best_estimator_
preds = best_rfr.predict(X_valid)
print_metrics(preds, Y_valid)

RMSE:  4635.834611122243
MAPE:  0.12092692450043732
MAE :  2171.1305229281716


Having trained all three, RFR did the best (including better than the original model) by a very large margin. Having settled on this model, I then assembled the final pipeline and applied it to the as-of-yet untouched test data to get a sense for real-world performance.

In [13]:
print("Best parameters for Random Forest Regressor", best_rfr.get_params())
best_model = Pipeline(
    [   
        ("Preprocessing", preprocessing_pipeline),
        ("Regressor", best_rfr),
    ]
)
best_model

Best parameters for Random Forest Regressor {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'max_depth': 100, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [14]:
final_prediction = best_model.predict(X_test)
print_metrics(final_prediction, Y_test)

RMSE:  4419.856688043476
MAPE:  0.12061920163764589
MAE :  2136.564544270231


And once again, very good performance. I decided not to replace this model with the original in the API, since it is outside the scope of the project, but it would be very easy. Just a few tweaks to the model.py class, and maybe change the BaseModel for the API in case the inputs don't match.