# Regression Models for each Property Type

IN the last notebook our analysis concluded that the regression errors have different behavior for each property type. This has leaded us to the necessity to handle the regression problem differently for each property type.

In this notebook we'll address this problem by testing different regression models and selecting the best ones for each property type.

In [1]:
import os
import os.path as P
import pickle
import typing as T

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, VotingRegressor
from sklearn.linear_model import BayesianRidge, ElasticNet, Lasso, Ridge
from sklearn.metrics import PredictionErrorDisplay, make_scorer, mean_squared_log_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from xgboost import XGBRegressor



In [2]:
sklearn.set_config(transform_output="pandas")

In [3]:
random_state = 42

# Models Evaluation

## Evaluation Function

We'll bring our evaluation function (from the last notebook) back to the game.

In [4]:
class Predictor(T.Protocol):
    def fit(self, X: T.Any, y: T.Any, *args, **kwargs) -> None:
        raise NotImplementedError

    def predict(self, X: T.Any) -> T.Any:
        raise NotImplementedError


def evaluate_models(
    X: pd.DataFrame, y: pd.Series, test_models: T.Iterable[T.Tuple[str, Predictor]]
) -> pd.DataFrame:
    results = []
    for model_name, model in test_models:
        scorer = make_scorer(mean_squared_log_error)

        scorer = cross_val_score(model, X, y, cv=5, scoring=scorer)
        avg_msle = scorer.mean()

        results.append({"Model": model_name, "Avg MSLE": avg_msle})

    results_df = pd.DataFrame(results).sort_values(by="Avg MSLE").reset_index(drop=True)

    return results_df

Just to rembember, we'll proceed to test the following list of regression models:

- [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)
- [Bayesian](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [Extra Trees Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
- [XGBoost](https://xgboost.readthedocs.io/en/stable/)
- [LightGBM](https://lightgbm.readthedocs.io/en/stable/)
- [CatBoost Regressor](https://catboost.ai/)

The evaluations will be executed for the property types in the follwing order:
- **Residential Building**
- **Penthouse**
- **House**
- **Two-story House**
- **Apartment**
- **Studio Apartment**
- **Condominium**
- **Flat**

## Data Gathering

To proceed to the models evaluation, we need first to collect all features and house prices for all proprties types.