# Introduction
In this notebook, we will:
1. Encode Categorical Features using features descriptions provided in the original dataset.
2. Ensebmle Gradient Boosting Trees Models, specifically XGBoost, LightGBM and CatBoost.
3. Incorporate Original Dataset with competition's dataset.


# Purpose:
The purpose of this notebook is to serve as a simple but strong baseline for you as you go on to engineer fearures and tune your models.

In [1]:
!pip install --upgrade tensorflow_decision_forests

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-1.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tensorflow~=2.11.0
  Downloading tensorflow-2.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.3.3-py2.py3-none-any.whl (26 kB)
Collecting keras

# Imports

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
from IPython.display import display
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from category_encoders import LeaveOneOutEncoder
import optuna
import tensorflow_decision_forests as tfdf

In [3]:
from warnings import filterwarnings
filterwarnings("ignore")

#### NOTE:
If you are interested in dataset's insights and EDA, checkout this excellent [notebook](https://www.kaggle.com/code/craigmthomas/play-s3e8-eda-models) by Craig Thomas. (his notebooks are always awesome!)

# Loading Data

In [4]:
BASE_PATH = Path("/kaggle/input/playground-series-s3e8")
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns="id")
test = pd.read_csv(BASE_PATH  / "test.csv")
test_idx = test.id
test = test.drop(columns="id")

# Craig Thomas has shown in his excellent notebook that the original dataset is pretty similar to the compeition's one
# so hopefully fusing the original and competition dataset should boost our score.
# The notebook is linked above

original = pd.read_csv("/kaggle/input/gemstone-price-prediction/cubic_zirconia.csv").drop(columns="Unnamed: 0")

print(f"Loaded train with {len(train)} rows.")
print(f"Loaded test with {len(test)} rows.")
print(f"Loaded original with {len(original)} rows.")

Loaded train with 193573 rows.
Loaded test with 129050 rows.
Loaded original with 26967 rows.


In [5]:
all_datasets = {"train": train,
               "test": test,
               "original": original}

# Checking for Null values

In [6]:
pd.concat([dataset.isnull().sum().rename(f"Missing in {dataset_name}") 
               for dataset_name, dataset in all_datasets.items()],
                 axis=1)

Unnamed: 0,Missing in train,Missing in test,Missing in original
carat,0,0.0,0
cut,0,0.0,0
color,0,0.0,0
clarity,0,0.0,0
depth,0,0.0,697
table,0,0.0,0
x,0,0.0,0
y,0,0.0,0
z,0,0.0,0
price,0,,0


## INSIGHTS: 
Only original dataset contains 697 missing values, which we'll simnply drop because no other dataset contains any missing values. Because not only is it a waste of time trying to come up with a imputation technique and applying it but also because doing so may introduce a bit noisy input samples compared to the rest of the data and hence the model's performance may suffer.

In [7]:
original.dropna(axis=0, how="any", inplace=True)

# Identifying categorical features

In [8]:
pd.concat([train.dtypes.rename("Data Type")] + \
          [dataset.nunique().rename(f"{dataset_name} UniqueValues") for dataset_name, dataset in all_datasets.items()],
          axis=1).sort_values(by="train UniqueValues")

Unnamed: 0,Data Type,train UniqueValues,test UniqueValues,original UniqueValues
cut,object,5,5.0,5
color,object,7,7.0,7
clarity,object,8,8.0,8
table,float64,108,101.0,112
depth,float64,153,143.0,169
carat,float64,248,252.0,256
z,float64,349,342.0,354
y,float64,521,516.0,525
x,float64,522,521.0,530
price,int64,8738,,8629


In [9]:
cat_features = ["cut", "color", "clarity"]

# Encoding Categorical Features
Leveraging the feature descriptions from this [discussion](https://www.kaggle.com/competitions/playground-series-s3e8/discussion/389213) we will encode the above categorical values.
Check out that discussion as it provides feature descriptions for all features in the dataset and will surely help you understand these features better and then engineer new features based of these.

### Encoding Cut
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good, Premium, Ideal

In [10]:
cut_labels = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
cut_labels_map = {label: rank for rank, label in enumerate(cut_labels)}
cut_labels_map

{'Fair': 0, 'Good': 1, 'Very Good': 2, 'Premium': 3, 'Ideal': 4}

### Encoding Color
Colour of the cubic zirconia.With D being the best and J the worst.

In [11]:
color_labels = ['D', 'E', 'F', 'G', 'H', 'I', 'J']
color_labels_map = {label: rank for rank, label in enumerate(reversed(color_labels))}
color_labels_map

{'J': 0, 'I': 1, 'H': 2, 'G': 3, 'F': 4, 'E': 5, 'D': 6}

### Encoding Clarity feature
cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

In [12]:
clarity_labels = ['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1', 'I2', 'I3']
clarity_labels_map = {label: rank for rank, label in enumerate(reversed(clarity_labels))}
clarity_labels_map

{'I3': 0,
 'I2': 1,
 'I1': 2,
 'SI2': 3,
 'SI1': 4,
 'VS2': 5,
 'VS1': 6,
 'VVS2': 7,
 'VVS1': 8,
 'IF': 9,
 'FL': 10}

In [13]:
for dataset in all_datasets.values():
    dataset["cut"] = dataset["cut"].map(cut_labels_map)
    dataset["color"] = dataset["color"].map(color_labels_map)    
    dataset["clarity"] = dataset["clarity"].map(clarity_labels_map)    

# Preprocessing

In [14]:
X = train.drop(columns="price")
y = train.price

# Setting Up Cross Validation
I'll just cross validate xgboost here, but you can do it for all models.

## Setting up TFDF

In [15]:
tf_cat_features = []
for feature in cat_features:
    tf_cat_features.append(tfdf.keras.FeatureUsage(name=str(feature), semantic=tfdf.keras.FeatureSemantic.CATEGORICAL))

In [16]:
def cross_validate_tfdf(X, y, X_org=None, y_org=None):
    # we'll use 5 fold cross validation
    N_FOLDS = 5
    cv_scores = np.zeros(N_FOLDS)
    kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)
    
    for fold_id, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
        
        if X_org is not None and y_org is not None:
            X_train = pd.concat([X_train, X_org], axis=0)
            y_train = pd.concat([y_train, y_org], axis=0)
        
        X_train = pd.concat([X_train, y_train], axis=1)
        
        X_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_train, label="price", task=tfdf.keras.Task.REGRESSION)
        X_val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_val, task=tfdf.keras.Task.REGRESSION)
        
        model = tfdf.keras.GradientBoostedTreesModel(task=tfdf.keras.Task.REGRESSION, 
                                                     verbose=0, features=tf_cat_features, 
                                                     exclude_non_specified_features=False,
                                                    hyperparameter_template="benchmark_rank1@v1")
        model.fit(X_train_ds)
        
        y_preds = model.predict(X_val_ds)     
        rmse = mean_squared_error(y_val, y_preds, squared=False)
        cv_scores[fold_id] = rmse
        
        print(f"Fold {fold_id} | rmse: {rmse}")
    
    avg_rmse = np.mean(cv_scores)
    print(f"Avg RMSE across folds: {avg_rmse}")

### using competitin data only

In [17]:
cross_validate_tfdf(X, y)

Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:00:15.153619686+00:00 kernel.cc:1214] Loading model from path /tmp/tmpr8fpd81u/model/ with prefix 5b1b70a6c6ca4856
[INFO 2023-03-05T13:00:15.182845615+00:00 decision_forest.cc:661] Model loaded with 87 root(s), 5299 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:00:15.182912948+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:00:15.182951733+00:00 kernel.cc:1046] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Fold 0 | rmse: 565.0860760833093
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:01:23.078375208+00:00 kernel.cc:1214] Loading model from path /tmp/tmpwodvan39/model/ with prefix 198b83f61efc471e
[INFO 2023-03-05T13:01:23.107598766+00:00 decision_forest.cc:661] Model loaded with 111 root(s), 6761 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:01:23.107857338+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:01:23.107948004+00:00 kernel.cc:1046] Use fast generic engine


Fold 1 | rmse: 567.7647314542417
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:02:08.334756584+00:00 kernel.cc:1214] Loading model from path /tmp/tmp5_fwlt_c/model/ with prefix 8dddddff845e42e9
[INFO 2023-03-05T13:02:08.353193736+00:00 decision_forest.cc:661] Model loaded with 68 root(s), 4148 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:02:08.353588893+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:02:08.353633773+00:00 kernel.cc:1046] Use fast generic engine


Fold 2 | rmse: 596.276784359588
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:04:15.212696682+00:00 kernel.cc:1214] Loading model from path /tmp/tmps4yz57wg/model/ with prefix 30c042e55f4548ce
[INFO 2023-03-05T13:04:15.255952719+00:00 decision_forest.cc:661] Model loaded with 175 root(s), 10573 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:04:15.256037904+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:04:15.256077343+00:00 kernel.cc:1046] Use fast generic engine


Fold 3 | rmse: 594.927366480423
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:05:28.153486292+00:00 kernel.cc:1214] Loading model from path /tmp/tmptrdnr6g1/model/ with prefix c08e9cfd90ef4b8d
[INFO 2023-03-05T13:05:28.188995939+00:00 decision_forest.cc:661] Model loaded with 132 root(s), 8028 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:05:28.189070593+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:05:28.189111224+00:00 kernel.cc:1046] Use fast generic engine


Fold 4 | rmse: 573.6090119887746
Avg RMSE across folds: 579.5327940732673


### using original + comp data

In [18]:
X_original = original.drop(columns="price")
y_original = original.price

cross_validate_tfdf(X, y, X_original, y_original)

Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:08:37.66410601+00:00 kernel.cc:1214] Loading model from path /tmp/tmpha2ndwev/model/ with prefix 119e9a47fc444fb7
[INFO 2023-03-05T13:08:37.692878711+00:00 decision_forest.cc:661] Model loaded with 115 root(s), 7001 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:08:37.692952617+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:08:37.692990968+00:00 kernel.cc:1046] Use fast generic engine


Fold 0 | rmse: 561.2152794601946
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:10:12.522467904+00:00 kernel.cc:1214] Loading model from path /tmp/tmp1iy7005f/model/ with prefix f6085494cd8d4ae5
[INFO 2023-03-05T13:10:12.561371605+00:00 decision_forest.cc:661] Model loaded with 156 root(s), 9440 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:10:12.561441048+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:10:12.561467528+00:00 kernel.cc:1046] Use fast generic engine


Fold 1 | rmse: 563.2153328528723
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:11:38.975611957+00:00 kernel.cc:1214] Loading model from path /tmp/tmp646ueirm/model/ with prefix 51e0b26ab7c042f1
[INFO 2023-03-05T13:11:39.00873158+00:00 decision_forest.cc:661] Model loaded with 136 root(s), 8262 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:11:39.008801724+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:11:39.008841603+00:00 kernel.cc:1046] Use fast generic engine


Fold 2 | rmse: 592.9907448649969
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:13:03.545747604+00:00 kernel.cc:1214] Loading model from path /tmp/tmpmt51l4m9/model/ with prefix 14d4e19904d049b6
[INFO 2023-03-05T13:13:03.5782229+00:00 decision_forest.cc:661] Model loaded with 130 root(s), 7894 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:13:03.578276707+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:13:03.578302143+00:00 kernel.cc:1046] Use fast generic engine


Fold 3 | rmse: 589.3314906288741
Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:14:41.962561708+00:00 kernel.cc:1214] Loading model from path /tmp/tmp5cvxixvq/model/ with prefix 0844311bc58f4341
[INFO 2023-03-05T13:14:42.00161577+00:00 decision_forest.cc:661] Model loaded with 158 root(s), 9518 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:14:42.001687087+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:14:42.001725294+00:00 kernel.cc:1046] Use fast generic engine


Fold 4 | rmse: 573.0557346066032
Avg RMSE across folds: 575.9617164827083


In [None]:
# def cross_validate(X, y, X_org=None, y_org=None):
#     # we'll use 5 fold cross validation
#     N_FOLDS = 5
#     cv_scores = np.zeros(N_FOLDS)
#     kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)
    
#     for fold_id, (train_idx, val_idx) in enumerate(kf.split(X)):
#         X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
#         X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
        
#         if X_org is not None and y_org is not None:
#             X_train = pd.concat([X_train, X_org], axis=0)
#             y_train = pd.concat([y_train, y_org], axis=0)
        
#         model = lgbm.LGBMRegressor()
#         model.fit(X_train, y_train,
#                      eval_set=[(X_val, y_val)],
#                      eval_metric="rmse",
#                      early_stopping_rounds=50,
#                      verbose=-1)
        
#         y_preds = model.predict(X_val)        
#         rmse = mean_squared_error(y_val, y_preds, squared=False)
#         cv_scores[fold_id] = rmse
        
#         print(f"Fold {fold_id} | rmse: {rmse}")
    
#     avg_rmse = np.mean(cv_scores)
#     print(f"Avg RMSE across folds: {avg_rmse}")

### using competition data only

In [None]:
# cross_validate(X, y)

### using original + competition data

In [33]:
# X_original = original.drop(columns="price")
# y_original = original.price

In [34]:
# cross_validate_tfdf(X, y, X_original, y_original)

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-02-27T07:40:18.08984644+00:00 kernel.cc:1214] Loading model from path /tmp/tmpv_67d6r4/model/ with prefix 5d0f0a67fbfc4427
[INFO 2023-02-27T07:40:18.118276269+00:00 decision_forest.cc:661] Model loaded with 115 root(s), 7001 node(s), and 9 input feature(s).
[INFO 2023-02-27T07:40:18.118325669+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-02-27T07:40:18.11835023+00:00 kernel.cc:1046] Use fast generic engine


Fold 0 | rmse: 561.2152794601946
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-02-27T07:41:49.051956624+00:00 kernel.cc:1214] Loading model from path /tmp/tmp5rosvq7q/model/ with prefix 821260c2b2e84f20
[INFO 2023-02-27T07:41:49.090272082+00:00 decision_forest.cc:661] Model loaded with 156 root(s), 9440 node(s), and 9 input feature(s).
[INFO 2023-02-27T07:41:49.090376174+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-02-27T07:41:49.090421882+00:00 kernel.cc:1046] Use fast generic engine


Fold 1 | rmse: 563.2153328528723
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-02-27T07:43:12.847649683+00:00 kernel.cc:1214] Loading model from path /tmp/tmp2x56hry2/model/ with prefix 040e464a77014589
[INFO 2023-02-27T07:43:12.881258602+00:00 decision_forest.cc:661] Model loaded with 136 root(s), 8262 node(s), and 9 input feature(s).
[INFO 2023-02-27T07:43:12.881312958+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-02-27T07:43:12.881338065+00:00 kernel.cc:1046] Use fast generic engine


Fold 2 | rmse: 592.9907448649969
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-02-27T07:44:33.850899236+00:00 kernel.cc:1214] Loading model from path /tmp/tmp9y4_ypm4/model/ with prefix f43bfc72d3674934
[INFO 2023-02-27T07:44:33.881505819+00:00 decision_forest.cc:661] Model loaded with 130 root(s), 7894 node(s), and 9 input feature(s).
[INFO 2023-02-27T07:44:33.881550442+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-02-27T07:44:33.881574406+00:00 kernel.cc:1046] Use fast generic engine


Fold 3 | rmse: 589.3314906288741
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-02-27T07:46:06.868085716+00:00 kernel.cc:1214] Loading model from path /tmp/tmpkopqu8eo/model/ with prefix 565bc30abd7d4436
[INFO 2023-02-27T07:46:06.906225698+00:00 decision_forest.cc:661] Model loaded with 158 root(s), 9518 node(s), and 9 input feature(s).
[INFO 2023-02-27T07:46:06.906291202+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-02-27T07:46:06.906317251+00:00 kernel.cc:1046] Use fast generic engine


Fold 4 | rmse: 573.0557346066032
Avg RMSE across folds: 575.9617164827083


## INSIGHTS: Looks like including original dataset does help!

# Training Models

In [19]:
# creating a validation set
# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, shuffle=True, random_state=1337)

In [21]:
# let's add original data to the mix
X_train = pd.concat([X, X_original], axis=0)
y_train = pd.concat([y, y_original], axis=0)

In [None]:
# xgb_model = xgb.XGBRegressor(eval_metric="rmse", early_stopping_rounds=50)
# xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

In [None]:
# lgbm_model = lgbm.LGBMRegressor()
# lgbm_model.fit(X_train, y_train, 
#                eval_set=[(X_val, y_val)],
#                eval_metric="rmse",
#                early_stopping_rounds=50,
#                verbose=-1)

In [None]:
# cat_model = catboost.CatBoostRegressor(eval_metric="RMSE", early_stopping_rounds=50)
# cat_model.fit(X_train, y_train,
#               eval_set=[(X_val, y_val)],
#               verbose=False)

### TFDF

In [22]:
X_train = pd.concat([X_train, y_train], axis=1)

X_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_train, label="price", task=tfdf.keras.Task.REGRESSION)

model = tfdf.keras.GradientBoostedTreesModel(task=tfdf.keras.Task.REGRESSION, 
                                             verbose=0, features=tf_cat_features, 
                                             exclude_non_specified_features=False,
                                            hyperparameter_template="benchmark_rank1@v1")
model.fit(X_train_ds)

Resolve hyper-parameter template "benchmark_rank1@v1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.


[INFO 2023-03-05T13:24:12.739290248+00:00 kernel.cc:1214] Loading model from path /tmp/tmp7kkprpt6/model/ with prefix 2aaa762f761545ba
[INFO 2023-03-05T13:24:12.764789007+00:00 decision_forest.cc:661] Model loaded with 100 root(s), 6044 node(s), and 9 input feature(s).
[INFO 2023-03-05T13:24:12.765044196+00:00 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 2023-03-05T13:24:12.765092761+00:00 kernel.cc:1046] Use fast generic engine


<keras.callbacks.History at 0x7fc573dc1590>

# Making Predictions

In [23]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test, task=tfdf.keras.Task.REGRESSION)

In [24]:
# y_preds_xgb = xgb_model.predict(test)
# y_preds_lgbm = lgbm_model.predict(test)
# y_preds_cat = cat_model.predict(test)
y_preds_tfdf = model.predict(test_ds)



# Ensembling
We'll use simple average for ensembling but feel free to use more advanced ensembling techniques.

In [33]:
# y_preds_final = np.array([y_preds_xgb, y_preds_lgbm, y_preds_cat]).mean(axis=0)
y_preds_final = y_preds_tfdf.squeeze()

# Submission

In [34]:
submission = pd.DataFrame({"id": test_idx, "price": y_preds_final})
submission.head()

Unnamed: 0,id,price
0,193573,878.719727
1,193574,2555.009766
2,193575,2359.241699
3,193576,826.966553
4,193577,5756.602539


In [35]:
submission.to_csv("submission.csv", index=False)