# Modeling

## Description

This notebook will prepare final dataset based on processed files and training a regression model.

## Installation

Make sure that you have executed the command below in the project's root path.

```
pip install -r requirements.txt
```

## Index

- [Imports](#Imports)
- [Parameters](#Parameters)
- [Load train dataset](#Load-train-dataset)
- [Prepare dataset](#Prepare-dataset)
- [Training-XGBRegressor](#Training-XGBRegressor)
 - [Baseline](#Baseline)
 - [Feature importances](#Feature-importances)
 - [Hyperparameter optimization](#Hyperparameter-optimization)
 - [Cross validation](#Cross-validation)
- [Evaluate trained model](#Evaluate-trained-model)
 - [Check some predictions](#Check-some-predictions)
 - [Price range score](#Price-range-score)
 - [Score by price_range](#Score-by-price_ranges)
- [Save trained model](#Save-trained-model)
- [Predict test dataset](#Predict-test-dataset)
 - [Save predictions.csv](#Save-predictions.csv)

# Imports

All custom functions are stored on [modeling.py](modeling.py).

In [3]:
from modeling import *
import pandas as pd
import numpy as np
import plotly.express as px

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBRegressor
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# Parameters

In [25]:
target_col = "price"

invalid_cols = ["id", "has_elevator", "rural_urbano"]
invalid_cols.append(target_col)

seed = 1993

np.random.seed(seed)

knn_n = 500

geohash_delimiters = [5]

# Load train dataset

In [26]:
df_train = pd.read_feather("../data/processed/train.feather")
print(df_train.shape)

(72241, 49)


In [27]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72241 entries, 0 to 72240
Data columns (total 49 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             72241 non-null  object 
 1   usableAreas                    72231 non-null  float64
 2   parkingSpaces                  71019 non-null  float64
 3   suites                         66274 non-null  float64
 4   bathrooms                      72240 non-null  float64
 5   totalAreas                     43864 non-null  float64
 6   bedrooms                       72241 non-null  int64  
 7   publicationType                72241 non-null  object 
 8   geohash                        72231 non-null  object 
 9   price                          72241 non-null  int64  
 10  businessType                   72241 non-null  object 
 11  yearlyIptu                     62236 non-null  float64
 12  monthlyCondoFee                68461 non-null 

# Prepare dataset

To prepare dataset, we'll split into X and Y, then, we'll apply `prep_modeling()` individually to avoid escaping data from validation to training dataset.

In [28]:
y = df_train[target_col].values
X = df_train.drop(columns=[target_col])

To stratify the train test split I decided to use a quantile-based discretization function.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=seed,
                                                    stratify=pd.qcut(y, q=20))

print(X_train.shape)
print(X_test.shape)

(48401, 48)
(23840, 48)


In [30]:
X_train = prep_modeling(X_train, invalid_cols, geohash=geohash_delimiters,
                        generate_encoder=True, knn_neighbors=knn_n)

X_test = prep_modeling(X_test, invalid_cols, geohash=geohash_delimiters,
                       generate_encoder=False, knn_neighbors=knn_n)

In [31]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48401 entries, 0 to 48400
Data columns (total 45 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   usableAreas                    48401 non-null  float64
 1   parkingSpaces                  48401 non-null  float64
 2   suites                         48401 non-null  float64
 3   bathrooms                      48401 non-null  float64
 4   totalAreas                     48401 non-null  float64
 5   bedrooms                       48401 non-null  float64
 6   publicationType                48401 non-null  float64
 7   businessType                   48401 non-null  float64
 8   yearlyIptu                     48401 non-null  float64
 9   monthlyCondoFee                48401 non-null  float64
 10  has_gym                        48401 non-null  float64
 11  has_garden                     48401 non-null  float64
 12  has_pool                       48401 non-null 

# Training XGBRegressor

The algorithm used here is [XGBRegressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor)

## Baseline

The baseline model help us to understand how hyperparameters optimization will improve the quality of the final optimized model.

In [32]:
xgb_model = XGBRegressor(random_state=seed)
xgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)],
              eval_metric=mse_score)

y_pred = xgb_model.predict(X_test)

xgb_score = {}
xgb_score["mse"] = mean_squared_error(y_test, y_pred)
xgb_score["r2_score"] = xgb_model.score(X_test, y_test)

xgb_score

[0]	validation_0-rmse:897328.62500	validation_0-mse:805198823424.00000
[1]	validation_0-rmse:707191.06250	validation_0-mse:500119830528.00000
[2]	validation_0-rmse:586893.62500	validation_0-mse:344444993536.00000
[3]	validation_0-rmse:512727.62500	validation_0-mse:262890373120.00000
[4]	validation_0-rmse:465803.28125	validation_0-mse:216973131776.00000
[5]	validation_0-rmse:433877.53125	validation_0-mse:188249505792.00000
[6]	validation_0-rmse:414571.96875	validation_0-mse:171868405760.00000
[7]	validation_0-rmse:399747.12500	validation_0-mse:159787581440.00000
[8]	validation_0-rmse:393797.59375	validation_0-mse:155050819584.00000
[9]	validation_0-rmse:386819.71875	validation_0-mse:149602172928.00000
[10]	validation_0-rmse:381191.87500	validation_0-mse:145259855872.00000
[11]	validation_0-rmse:379115.50000	validation_0-mse:143654240256.00000
[12]	validation_0-rmse:376775.75000	validation_0-mse:141891993600.00000
[13]	validation_0-rmse:377151.90625	validation_0-mse:142171619328.00000
[1

{'mse': 133454961847.80215, 'r2_score': 0.8595974870762524}

In [33]:
evals_result = xgb_model.evals_result()
px.line(y=evals_result["validation_0"]["rmse"], title="Validation loss (RMSE)")

In [36]:
xgb_model.get_params()

{'objective': 'reg:squarederror',
 'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.300000012,
 'max_delta_step': 0,
 'max_depth': 6,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 100,
 'n_jobs': 0,
 'num_parallel_tree': 1,
 'random_state': 1993,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

## Feature importances

In [35]:
df_feats_importances = pd.DataFrame({"features": X_train.columns,
                                     "importances": xgb_model.feature_importances_})

df_feats_importances.sort_values(by="importances", ascending=True, inplace=True)

px.bar(df_feats_importances, x="importances", y="features", height=1200)

## Hyperparameter optimization

In the next cell we'll run Grid Search to find best parameters. We'll apply cross-validation as well.


- [XGBRegressor Class - Docs](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)
- [XGBoost Parameters - Docs](https://xgboost.readthedocs.io/en/latest/parameter.html)

In [13]:
opt_params = {
    "max_depth": Integer(5, 15),
    "learning_rate": Real(3e-4, 3e-1, prior="log-uniform"),
    "objective": Categorical(["reg:squarederror", "reg:gamma", "reg:tweedie"]),
    "booster": Categorical(["gbtree", "dart"]),
    "reg_alpha": Real(0.01, 1.0, prior="log-uniform"),
    "reg_lambda": Real(0.5, 2.0, prior="log-uniform"),
    "colsample_bytree": Real(0.01, 1.0, prior="log-uniform"),
    "colsample_bylevel": Real(0.01, 1.0, prior="log-uniform"),
    "colsample_bynode": Real(0.01, 1.0, prior="log-uniform"),
    "subsample": Real(0.01, 1.0, prior="log-uniform")
}

opt = BayesSearchCV(estimator=XGBRegressor(),
                    search_spaces=opt_params, n_iter=200,
                    n_jobs=4, n_points=4, verbose=3, cv=5,
                    random_state=seed)

opt.fit(X_train, y_train)

print("val. score: %s" % opt.best_score_)
print("test score: %s" % opt.score(X_test, y_test))

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.5s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.5s finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   28.4s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   28.4s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   41.9s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   41.9s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   53.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   53.0s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   34.4s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   34.4s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   12.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   12.0s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   21.6s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   21.6s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   27.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   27.7s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.3s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.3s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   51.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   51.7s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   55.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   55.0s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.2s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.2s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   25.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   25.7s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.0min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.0min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   17.3s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   17.3s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   51.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   51.0s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.0min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.0min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   50.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   50.0s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   50.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   50.1s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   37.7s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   44.5s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   44.5s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   14.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   14.1s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   14.3s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   14.3s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   31.5s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   31.5s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   52.7s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   52.7s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   20.8s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   20.8s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.1min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.1min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.6min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.6min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   59.5s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   59.5s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.2min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   39.6s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   39.6s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.4min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.5min finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.1min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.1min finished


val. score: 0.8380687938392506
test score: 0.8715813046359541


In [18]:
best_params = dict(opt.best_estimator_.get_params())
best_params

{'objective': 'reg:tweedie',
 'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'colsample_bytree': 1.0,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.3,
 'max_delta_step': 0,
 'max_depth': 5,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 100,
 'n_jobs': 0,
 'num_parallel_tree': 1,
 'random_state': 0,
 'reg_alpha': 1.0,
 'reg_lambda': 0.5,
 'scale_pos_weight': None,
 'subsample': 1.0,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

## Cross validation

In [59]:
n_folds = 5

best_params["n_estimators"] = 1000

xgb_model = XGBRegressor(**best_params)

xgb_scores = cross_val_score(xgb_model, X_train, y_train, cv=n_folds)

for n in range(n_folds):
    print("Fold {} - r2_score: {:.4f}".format(n+1, xgb_scores[n]))

Fold 1 - r2_score: 0.9119
Fold 2 - r2_score: 0.9116
Fold 3 - r2_score: 0.9190
Fold 4 - r2_score: 0.8995
Fold 5 - r2_score: 0.5486


## Evaluate trained model

In [79]:
xgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)],
              eval_metric=mse_score)

y_pred = xgb_model.predict(X_test)

xgb_score = {}

xgb_score["mse"] = mean_squared_error(y_test, y_pred)
xgb_score["r2_score"] = xgb_model.score(X_test, y_test)

xgb_score

[0]	validation_0-tweedie-nloglik@1.5:1418811.75000	validation_0-mse:1409011417088.00000
[1]	validation_0-tweedie-nloglik@1.5:1051079.62500	validation_0-mse:1409010630656.00000
[2]	validation_0-tweedie-nloglik@1.5:778664.75000	validation_0-mse:1409009975296.00000
[3]	validation_0-tweedie-nloglik@1.5:576851.18750	validation_0-mse:1409009057792.00000
[4]	validation_0-tweedie-nloglik@1.5:427347.18750	validation_0-mse:1409008140288.00000
[5]	validation_0-tweedie-nloglik@1.5:316593.18750	validation_0-mse:1409007353856.00000
[6]	validation_0-tweedie-nloglik@1.5:234547.25000	validation_0-mse:1409006567424.00000
[7]	validation_0-tweedie-nloglik@1.5:173769.32812	validation_0-mse:1409005780992.00000
[8]	validation_0-tweedie-nloglik@1.5:128747.67188	validation_0-mse:1409004994560.00000
[9]	validation_0-tweedie-nloglik@1.5:95400.85938	validation_0-mse:1409004208128.00000
[10]	validation_0-tweedie-nloglik@1.5:70704.38281	validation_0-mse:1409003421696.00000
[11]	validation_0-tweedie-nloglik@1.5:5241

[96]	validation_0-tweedie-nloglik@1.5:2839.10327	validation_0-mse:1408992542720.00000
[97]	validation_0-tweedie-nloglik@1.5:2839.02490	validation_0-mse:1408992542720.00000
[98]	validation_0-tweedie-nloglik@1.5:2838.99194	validation_0-mse:1408992542720.00000
[99]	validation_0-tweedie-nloglik@1.5:2838.95239	validation_0-mse:1408992542720.00000
[100]	validation_0-tweedie-nloglik@1.5:2838.91138	validation_0-mse:1408992542720.00000
[101]	validation_0-tweedie-nloglik@1.5:2838.88623	validation_0-mse:1408992542720.00000
[102]	validation_0-tweedie-nloglik@1.5:2838.84326	validation_0-mse:1408992542720.00000
[103]	validation_0-tweedie-nloglik@1.5:2838.81543	validation_0-mse:1408992542720.00000
[104]	validation_0-tweedie-nloglik@1.5:2838.80884	validation_0-mse:1408992542720.00000
[105]	validation_0-tweedie-nloglik@1.5:2838.79126	validation_0-mse:1408992542720.00000
[106]	validation_0-tweedie-nloglik@1.5:2838.74878	validation_0-mse:1408992542720.00000
[107]	validation_0-tweedie-nloglik@1.5:2838.712

[191]	validation_0-tweedie-nloglik@1.5:2837.68042	validation_0-mse:1408992542720.00000
[192]	validation_0-tweedie-nloglik@1.5:2837.65503	validation_0-mse:1408992542720.00000
[193]	validation_0-tweedie-nloglik@1.5:2837.62280	validation_0-mse:1408992542720.00000
[194]	validation_0-tweedie-nloglik@1.5:2837.62427	validation_0-mse:1408992542720.00000
[195]	validation_0-tweedie-nloglik@1.5:2837.61841	validation_0-mse:1408992542720.00000
[196]	validation_0-tweedie-nloglik@1.5:2837.60230	validation_0-mse:1408992542720.00000
[197]	validation_0-tweedie-nloglik@1.5:2837.58594	validation_0-mse:1408992542720.00000
[198]	validation_0-tweedie-nloglik@1.5:2837.54468	validation_0-mse:1408992542720.00000
[199]	validation_0-tweedie-nloglik@1.5:2837.53076	validation_0-mse:1408992542720.00000
[200]	validation_0-tweedie-nloglik@1.5:2837.50903	validation_0-mse:1408992542720.00000
[201]	validation_0-tweedie-nloglik@1.5:2837.49194	validation_0-mse:1408992542720.00000
[202]	validation_0-tweedie-nloglik@1.5:2837

[286]	validation_0-tweedie-nloglik@1.5:2836.92188	validation_0-mse:1408992542720.00000
[287]	validation_0-tweedie-nloglik@1.5:2836.92578	validation_0-mse:1408992542720.00000
[288]	validation_0-tweedie-nloglik@1.5:2836.92505	validation_0-mse:1408992542720.00000
[289]	validation_0-tweedie-nloglik@1.5:2836.91284	validation_0-mse:1408992542720.00000
[290]	validation_0-tweedie-nloglik@1.5:2836.90747	validation_0-mse:1408992542720.00000
[291]	validation_0-tweedie-nloglik@1.5:2836.90796	validation_0-mse:1408992542720.00000
[292]	validation_0-tweedie-nloglik@1.5:2836.90479	validation_0-mse:1408992542720.00000
[293]	validation_0-tweedie-nloglik@1.5:2836.91138	validation_0-mse:1408992542720.00000
[294]	validation_0-tweedie-nloglik@1.5:2836.92017	validation_0-mse:1408992542720.00000
[295]	validation_0-tweedie-nloglik@1.5:2836.91138	validation_0-mse:1408992542720.00000
[296]	validation_0-tweedie-nloglik@1.5:2836.91455	validation_0-mse:1408992542720.00000
[297]	validation_0-tweedie-nloglik@1.5:2836

[381]	validation_0-tweedie-nloglik@1.5:2836.61890	validation_0-mse:1408992542720.00000
[382]	validation_0-tweedie-nloglik@1.5:2836.61670	validation_0-mse:1408992542720.00000
[383]	validation_0-tweedie-nloglik@1.5:2836.62085	validation_0-mse:1408992542720.00000
[384]	validation_0-tweedie-nloglik@1.5:2836.62109	validation_0-mse:1408992542720.00000
[385]	validation_0-tweedie-nloglik@1.5:2836.62085	validation_0-mse:1408992542720.00000
[386]	validation_0-tweedie-nloglik@1.5:2836.62183	validation_0-mse:1408992542720.00000
[387]	validation_0-tweedie-nloglik@1.5:2836.62158	validation_0-mse:1408992542720.00000
[388]	validation_0-tweedie-nloglik@1.5:2836.61353	validation_0-mse:1408992542720.00000
[389]	validation_0-tweedie-nloglik@1.5:2836.61499	validation_0-mse:1408992542720.00000
[390]	validation_0-tweedie-nloglik@1.5:2836.60742	validation_0-mse:1408992542720.00000
[391]	validation_0-tweedie-nloglik@1.5:2836.59082	validation_0-mse:1408992542720.00000
[392]	validation_0-tweedie-nloglik@1.5:2836

[476]	validation_0-tweedie-nloglik@1.5:2836.43896	validation_0-mse:1408992542720.00000
[477]	validation_0-tweedie-nloglik@1.5:2836.43970	validation_0-mse:1408992542720.00000
[478]	validation_0-tweedie-nloglik@1.5:2836.43823	validation_0-mse:1408992542720.00000
[479]	validation_0-tweedie-nloglik@1.5:2836.43262	validation_0-mse:1408992542720.00000
[480]	validation_0-tweedie-nloglik@1.5:2836.41870	validation_0-mse:1408992542720.00000
[481]	validation_0-tweedie-nloglik@1.5:2836.41919	validation_0-mse:1408992542720.00000
[482]	validation_0-tweedie-nloglik@1.5:2836.41943	validation_0-mse:1408992542720.00000
[483]	validation_0-tweedie-nloglik@1.5:2836.42139	validation_0-mse:1408992542720.00000
[484]	validation_0-tweedie-nloglik@1.5:2836.42188	validation_0-mse:1408992542720.00000
[485]	validation_0-tweedie-nloglik@1.5:2836.42749	validation_0-mse:1408992542720.00000
[486]	validation_0-tweedie-nloglik@1.5:2836.42627	validation_0-mse:1408992542720.00000
[487]	validation_0-tweedie-nloglik@1.5:2836

[571]	validation_0-tweedie-nloglik@1.5:2836.28955	validation_0-mse:1408992542720.00000
[572]	validation_0-tweedie-nloglik@1.5:2836.28125	validation_0-mse:1408992542720.00000
[573]	validation_0-tweedie-nloglik@1.5:2836.28223	validation_0-mse:1408992542720.00000
[574]	validation_0-tweedie-nloglik@1.5:2836.28516	validation_0-mse:1408992542720.00000
[575]	validation_0-tweedie-nloglik@1.5:2836.28857	validation_0-mse:1408992542720.00000
[576]	validation_0-tweedie-nloglik@1.5:2836.29004	validation_0-mse:1408992542720.00000
[577]	validation_0-tweedie-nloglik@1.5:2836.29028	validation_0-mse:1408992542720.00000
[578]	validation_0-tweedie-nloglik@1.5:2836.29053	validation_0-mse:1408992542720.00000
[579]	validation_0-tweedie-nloglik@1.5:2836.29272	validation_0-mse:1408992542720.00000
[580]	validation_0-tweedie-nloglik@1.5:2836.29199	validation_0-mse:1408992542720.00000
[581]	validation_0-tweedie-nloglik@1.5:2836.29126	validation_0-mse:1408992542720.00000
[582]	validation_0-tweedie-nloglik@1.5:2836

[666]	validation_0-tweedie-nloglik@1.5:2836.22754	validation_0-mse:1408992542720.00000
[667]	validation_0-tweedie-nloglik@1.5:2836.23413	validation_0-mse:1408992542720.00000
[668]	validation_0-tweedie-nloglik@1.5:2836.23389	validation_0-mse:1408992542720.00000
[669]	validation_0-tweedie-nloglik@1.5:2836.23267	validation_0-mse:1408992542720.00000
[670]	validation_0-tweedie-nloglik@1.5:2836.23022	validation_0-mse:1408992542720.00000
[671]	validation_0-tweedie-nloglik@1.5:2836.23022	validation_0-mse:1408992542720.00000
[672]	validation_0-tweedie-nloglik@1.5:2836.23364	validation_0-mse:1408992542720.00000
[673]	validation_0-tweedie-nloglik@1.5:2836.23828	validation_0-mse:1408992542720.00000
[674]	validation_0-tweedie-nloglik@1.5:2836.23682	validation_0-mse:1408992542720.00000
[675]	validation_0-tweedie-nloglik@1.5:2836.23804	validation_0-mse:1408992542720.00000
[676]	validation_0-tweedie-nloglik@1.5:2836.23804	validation_0-mse:1408992542720.00000
[677]	validation_0-tweedie-nloglik@1.5:2836

[761]	validation_0-tweedie-nloglik@1.5:2836.34082	validation_0-mse:1408992542720.00000
[762]	validation_0-tweedie-nloglik@1.5:2836.33691	validation_0-mse:1408992542720.00000
[763]	validation_0-tweedie-nloglik@1.5:2836.33594	validation_0-mse:1408992542720.00000
[764]	validation_0-tweedie-nloglik@1.5:2836.34741	validation_0-mse:1408992542720.00000
[765]	validation_0-tweedie-nloglik@1.5:2836.34570	validation_0-mse:1408992542720.00000
[766]	validation_0-tweedie-nloglik@1.5:2836.34424	validation_0-mse:1408992542720.00000
[767]	validation_0-tweedie-nloglik@1.5:2836.33862	validation_0-mse:1408992542720.00000
[768]	validation_0-tweedie-nloglik@1.5:2836.33813	validation_0-mse:1408992542720.00000
[769]	validation_0-tweedie-nloglik@1.5:2836.34229	validation_0-mse:1408992542720.00000
[770]	validation_0-tweedie-nloglik@1.5:2836.34399	validation_0-mse:1408992542720.00000
[771]	validation_0-tweedie-nloglik@1.5:2836.34229	validation_0-mse:1408992542720.00000
[772]	validation_0-tweedie-nloglik@1.5:2836

[856]	validation_0-tweedie-nloglik@1.5:2836.33081	validation_0-mse:1408992542720.00000
[857]	validation_0-tweedie-nloglik@1.5:2836.32959	validation_0-mse:1408992542720.00000
[858]	validation_0-tweedie-nloglik@1.5:2836.32764	validation_0-mse:1408992542720.00000
[859]	validation_0-tweedie-nloglik@1.5:2836.32544	validation_0-mse:1408992542720.00000
[860]	validation_0-tweedie-nloglik@1.5:2836.32227	validation_0-mse:1408992542720.00000
[861]	validation_0-tweedie-nloglik@1.5:2836.32153	validation_0-mse:1408992542720.00000
[862]	validation_0-tweedie-nloglik@1.5:2836.31958	validation_0-mse:1408992542720.00000
[863]	validation_0-tweedie-nloglik@1.5:2836.31470	validation_0-mse:1408992542720.00000
[864]	validation_0-tweedie-nloglik@1.5:2836.31152	validation_0-mse:1408992542720.00000
[865]	validation_0-tweedie-nloglik@1.5:2836.31421	validation_0-mse:1408992542720.00000
[866]	validation_0-tweedie-nloglik@1.5:2836.31104	validation_0-mse:1408992542720.00000
[867]	validation_0-tweedie-nloglik@1.5:2836

[951]	validation_0-tweedie-nloglik@1.5:2836.31812	validation_0-mse:1408992542720.00000
[952]	validation_0-tweedie-nloglik@1.5:2836.32007	validation_0-mse:1408992542720.00000
[953]	validation_0-tweedie-nloglik@1.5:2836.32275	validation_0-mse:1408992542720.00000
[954]	validation_0-tweedie-nloglik@1.5:2836.32373	validation_0-mse:1408992542720.00000
[955]	validation_0-tweedie-nloglik@1.5:2836.32495	validation_0-mse:1408992542720.00000
[956]	validation_0-tweedie-nloglik@1.5:2836.32617	validation_0-mse:1408992542720.00000
[957]	validation_0-tweedie-nloglik@1.5:2836.32690	validation_0-mse:1408992542720.00000
[958]	validation_0-tweedie-nloglik@1.5:2836.32544	validation_0-mse:1408992542720.00000
[959]	validation_0-tweedie-nloglik@1.5:2836.32593	validation_0-mse:1408992542720.00000
[960]	validation_0-tweedie-nloglik@1.5:2836.32495	validation_0-mse:1408992542720.00000
[961]	validation_0-tweedie-nloglik@1.5:2836.32544	validation_0-mse:1408992542720.00000
[962]	validation_0-tweedie-nloglik@1.5:2836

{'mse': 128522463228.4819, 'r2_score': 0.8647867673514608}

## Check some predictions

In [85]:
for i in range(50):
    print(check_prediction(xgb_model, X_test.iloc[i], y_test[i]))

{'Prediction': 'R$ 299,617.50', 'Real': 'R$ 315,000.00'}
{'Prediction': 'R$ 1,175.22', 'Real': 'R$ 2,800.00'}
{'Prediction': 'R$ 220,586.39', 'Real': 'R$ 301,000.00'}
{'Prediction': 'R$ 461,148.47', 'Real': 'R$ 392,000.00'}
{'Prediction': 'R$ 3,997.23', 'Real': 'R$ 3,849.00'}
{'Prediction': 'R$ 478,642.22', 'Real': 'R$ 486,499.00'}
{'Prediction': 'R$ 345,251.19', 'Real': 'R$ 343,000.00'}
{'Prediction': 'R$ 319,405.66', 'Real': 'R$ 371,000.00'}
{'Prediction': 'R$ 344,907.62', 'Real': 'R$ 364,000.00'}
{'Prediction': 'R$ 1,868,469.00', 'Real': 'R$ 2,050,999.00'}
{'Prediction': 'R$ 265,740.72', 'Real': 'R$ 203,000.00'}
{'Prediction': 'R$ 426,435.66', 'Real': 'R$ 460,039.00'}
{'Prediction': 'R$ 289,741.94', 'Real': 'R$ 350,000.00'}
{'Prediction': 'R$ 269,690.34', 'Real': 'R$ 259,699.00'}
{'Prediction': 'R$ 195,506.27', 'Real': 'R$ 224,000.00'}
{'Prediction': 'R$ 701,194.62', 'Real': 'R$ 910,000.00'}
{'Prediction': 'R$ 380,444.78', 'Real': 'R$ 409,500.00'}
{'Prediction': 'R$ 2,231,330.75', '

## Price range score

This metric allow was to check in a binary vision if the model performs good or not.

In [86]:
df_score = []

for i in range(len(X_test)):
    y_pred = xgb_model.predict(X_test.iloc[[i]])[0]
    score_range = price_range_score(y_pred, y_test[i])
    df_score.append({"real": y_test[i], "prediction": y_pred, "pr_score": score_range})
    
df_score = pd.DataFrame(df_score)

for col in ["real", "prediction"]:
    df_score[col] = df_score[col].astype(int)
    
df_score["bins"] = pd.qcut(df_score["real"], q=4)
df_score["bins"] = df_score["bins"].astype(str)

df_score["pr_score"].sum() / len(df_score)

0.8283557046979866

## Score by price_ranges

Here we can see the model performance based on bins.

In [87]:
from sklearn.metrics import r2_score

r2_bins = []
for _bin in df_score["bins"].unique():
    y_true = df_score[df_score["bins"] == _bin]["real"].values
    y_pred = df_score[df_score["bins"] == _bin]["prediction"].values
    r2 = r2_score(y_true, y_pred)
    r2_bins.append({"bin": _bin, "r2_score": r2})
                    
df_r2 = pd.DataFrame(r2_bins)
df_r2

Unnamed: 0,bin,r2_score
0,"(224000.0, 409045.0]",-0.098336
1,"(199.999, 224000.0]",0.918003
2,"(409045.0, 770000.0]",-0.264923
3,"(770000.0, 25434920.0]",0.771646


In [97]:
px.bar(df_r2, range_y=[-1, 1], x="bin", y="r2_score",
       title="R2 Score for range of prices", text="r2_score")

# Save trained model

In [89]:
xgb_model.save_model("../models/model.xgb")

# Predict test dataset

In [90]:
df_test = pd.read_feather("../data/processed/test.feather")
print(df_test.shape)

(16036, 49)


In [91]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16036 entries, 0 to 16035
Data columns (total 49 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             16036 non-null  object 
 1   usableAreas                    16029 non-null  float64
 2   parkingSpaces                  15772 non-null  float64
 3   suites                         14641 non-null  float64
 4   bathrooms                      16035 non-null  float64
 5   totalAreas                     9942 non-null   float64
 6   bedrooms                       16036 non-null  int64  
 7   publicationType                16036 non-null  object 
 8   geohash                        16031 non-null  object 
 9   price                          0 non-null      object 
 10  businessType                   16036 non-null  object 
 11  yearlyIptu                     13639 non-null  float64
 12  monthlyCondoFee                15100 non-null 

In [92]:
ids = df_test["id"].values

df_test.drop(columns=["price"], inplace=True)

X2_test = prep_modeling(df_test, invalid_cols, geohash=geohash_delimiters, generate_encoder=False)

In [93]:
X2_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16036 entries, 0 to 16035
Data columns (total 45 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   usableAreas                    16036 non-null  float64
 1   parkingSpaces                  16036 non-null  float64
 2   suites                         16036 non-null  float64
 3   bathrooms                      16036 non-null  float64
 4   totalAreas                     16036 non-null  float64
 5   bedrooms                       16036 non-null  float64
 6   publicationType                16036 non-null  float64
 7   businessType                   16036 non-null  float64
 8   yearlyIptu                     16036 non-null  float64
 9   monthlyCondoFee                16036 non-null  float64
 10  has_gym                        16036 non-null  float64
 11  has_garden                     16036 non-null  float64
 12  has_pool                       16036 non-null 

In [94]:
def test_prediction(test, ids, model):
    """
    Function to apply predict on test dataset.
    
    Arguments:
    - test (pd.DataFrame): Test dataset.
    - ids (list): ID for each row in test dataset.
    - model (XGBRegressor model): Trained model.
    
    Output:
    List of dicts like {"id": "X", "price": 100.0}
    """
    result = []
    for i in range(len(test)):
        pred = model.predict(test.iloc[[i]])[0]
        result.append({"id": ids[i], "price": pred})
        
    return result

In [95]:
test_preds = test_prediction(X2_test, ids, xgb_model)
test_preds = pd.DataFrame(test_preds)

print(test_preds.shape)
test_preds.head()

(16036, 2)


Unnamed: 0,id,price
0,89224365f8,424430.6
1,363731333f,225276.7
2,6e6283378a,435035.5
3,4c29a27f44,1532090.0
4,7b16cf224b,523594.2


## Save predictions.csv

In [96]:
test_preds.to_csv("../predictions.csv", index=False, encoding="utf-8")

---
---
---

# That's the end!