# Introduction
In this notebook, we will:
1. Encode Categorical Features using features descriptions provided in the original dataset.
2. Ensebmle Gradient Boosting Trees Models, specifically XGBoost, LightGBM and CatBoost.
3. Incorporate Original Dataset with competition's dataset.


# Purpose:
The purpose of this notebook is to serve as a simple but strong baseline for you as you go on to engineer fearures and tune your models.

# Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
from IPython.display import display
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from category_encoders import LeaveOneOutEncoder
import optuna

In [2]:
from warnings import filterwarnings
filterwarnings("ignore")

#### NOTE:
If you are interested in dataset's insights and EDA, checkout this excellent [notebook](https://www.kaggle.com/code/craigmthomas/play-s3e8-eda-models) by Craig Thomas. (his notebooks are always awesome!)

# Loading Data

In [6]:
BASE_PATH = Path("/kaggle/input/playground-series-s3e8")
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns="id")
test = pd.read_csv(BASE_PATH  / "test.csv")
test_idx = test.id
test = test.drop(columns="id")

# Craig Thomas has shown in his excellent notebook that the original dataset is pretty similar to the compeition's one
# so hopefully fusing the original and competition dataset should boost our score.
# The notebook is linked above

original = pd.read_csv("/kaggle/input/gemstone-price-prediction/cubic_zirconia.csv").drop(columns="Unnamed: 0")

print(f"Loaded train with {len(train)} rows.")
print(f"Loaded test with {len(test)} rows.")
print(f"Loaded original with {len(original)} rows.")

Loaded train with 193573 rows.
Loaded test with 129050 rows.
Loaded original with 26967 rows.


In [7]:
all_datasets = {"train": train,
               "test": test,
               "original": original}

# Checking for Null values

In [8]:
pd.concat([dataset.isnull().sum().rename(f"Missing in {dataset_name}") 
               for dataset_name, dataset in all_datasets.items()],
                 axis=1)

Unnamed: 0,Missing in train,Missing in test,Missing in original
carat,0,0.0,0
cut,0,0.0,0
color,0,0.0,0
clarity,0,0.0,0
depth,0,0.0,697
table,0,0.0,0
x,0,0.0,0
y,0,0.0,0
z,0,0.0,0
price,0,,0


## INSIGHTS: 
Only original dataset contains 697 missing values, which we'll simnply drop because no other dataset contains any missing values.

In [9]:
original.dropna(axis=0, how="any", inplace=True)

# Identifying categorical features

In [11]:
pd.concat([train.dtypes.rename("Data Type")] + \
          [dataset.nunique().rename(f"{dataset_name} UniqueValues") for dataset_name, dataset in all_datasets.items()],
          axis=1).sort_values(by="train UniqueValues")

Unnamed: 0,Data Type,train UniqueValues,test UniqueValues,original UniqueValues
cut,object,5,5.0,5
color,object,7,7.0,7
clarity,object,8,8.0,8
table,float64,108,101.0,112
depth,float64,153,143.0,169
carat,float64,248,252.0,256
z,float64,349,342.0,354
y,float64,521,516.0,525
x,float64,522,521.0,530
price,int64,8738,,8629


In [12]:
cat_features = ["cut", "color", "clarity"]

# Encoding Categorical Features
Leveraging the feature descriptions from this [discussion](https://www.kaggle.com/competitions/playground-series-s3e8/discussion/389213) we will encode the above categorical values.
Check out that discussion as it provides feature descriptions for all features in the dataset and will surely help you understand these features better and then engineer new features based of these.

### Encoding Cut
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good, Premium, Ideal

In [13]:
cut_labels = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
cut_labels_map = {label: rank for rank, label in enumerate(cut_labels)}
cut_labels_map

{'Fair': 0, 'Good': 1, 'Very Good': 2, 'Premium': 3, 'Ideal': 4}

### Encoding Color
Colour of the cubic zirconia.With D being the best and J the worst.

In [14]:
color_labels = ['D', 'E', 'F', 'G', 'H', 'I', 'J']
color_labels_map = {label: rank for rank, label in enumerate(reversed(color_labels))}
color_labels_map

{'J': 0, 'I': 1, 'H': 2, 'G': 3, 'F': 4, 'E': 5, 'D': 6}

### Encoding Clarity feature

In [15]:
clarity_labels = ['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1', 'I2', 'I3']
clarity_labels_map = {label: rank for rank, label in enumerate(reversed(clarity_labels))}
clarity_labels_map

{'I3': 0,
 'I2': 1,
 'I1': 2,
 'SI2': 3,
 'SI1': 4,
 'VS2': 5,
 'VS1': 6,
 'VVS2': 7,
 'VVS1': 8,
 'IF': 9,
 'FL': 10}

In [16]:
for dataset in all_datasets.values():
    dataset["cut"] = dataset["cut"].map(cut_labels_map)
    dataset["color"] = dataset["color"].map(color_labels_map)    
    dataset["clarity"] = dataset["clarity"].map(clarity_labels_map)    

# Preprocessing

In [17]:
X = train.drop(columns="price")
y = train.price

# Setting Up Cross Validation
I'll just cross validate xgboost here, but you can do it for all models.

In [25]:
def cross_validate(X, y, X_org=None, y_org=None):
    # we'll use 5 fold cross validation
    N_FOLDS = 5
    cv_scores = np.zeros(N_FOLDS)
    kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)
    
    for fold_id, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
        
        if X_org is not None and y_org is not None:
            X_train = pd.concat([X_train, X_org], axis=0)
            y_train = pd.concat([y_train, y_org], axis=0)
        
        model = lgbm.LGBMRegressor()
        model.fit(X_train, y_train,
                     eval_set=[(X_val, y_val)],
                     eval_metric="rmse",
                     early_stopping_rounds=50,
                     verbose=-1)
        
        y_preds = model.predict(X_val)        
        rmse = mean_squared_error(y_val, y_preds, squared=False)
        cv_scores[fold_id] = rmse
        
        print(f"Fold {fold_id} | rmse: {rmse}")
    
    avg_rmse = np.mean(cv_scores)
    print(f"Avg RMSE across folds: {avg_rmse}")

### using competition data only

In [22]:
cross_validate(X, y)

Fold 0 | rmse: 562.9377075715311
Fold 1 | rmse: 562.6182229269959
Fold 2 | rmse: 592.0190753235149
Fold 3 | rmse: 590.8050806000434
Fold 4 | rmse: 570.6389904127958
Avg RMSE across folds: 575.8038153669762


### using origina + competition data

In [23]:
X_original = original.drop(columns="price")
y_original = original.price

In [26]:
cross_validate(X, y, X_original, y_original)

Fold 0 | rmse: 561.5306866850726
Fold 1 | rmse: 563.5696465527328
Fold 2 | rmse: 589.8391954563184
Fold 3 | rmse: 589.8930357428613
Fold 4 | rmse: 569.768468805967
Avg RMSE across folds: 574.9202066485905


## INSIGHTS: Looks like including original dataset does help!

# Training Models

In [27]:
# creating a validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, shuffle=True, random_state=1337)

In [28]:
# let's add original data to the mix
X_train = pd.concat([X_train, X_original], axis=0)
y_train = pd.concat([y_train, y_original], axis=0)

In [29]:
xgb_model = xgb.XGBRegressor(eval_metric="rmse", early_stopping_rounds=50)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=50, enable_categorical=False,
             eval_metric='rmse', gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

In [31]:
lgbm_model = lgbm.LGBMRegressor()
lgbm_model.fit(X_train, y_train, 
               eval_set=[(X_val, y_val)],
               eval_metric="rmse",
               early_stopping_rounds=50,
               verbose=-1)

LGBMRegressor()

In [32]:
cat_model = catboost.CatBoostRegressor(eval_metric="RMSE", early_stopping_rounds=50)
cat_model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              verbose=False)

<catboost.core.CatBoostRegressor at 0x7f4d74f29650>

# Making Predictions

In [33]:
y_preds_xgb = xgb_model.predict(test)
y_preds_lgbm = lgbm_model.predict(test)
y_preds_cat = cat_model.predict(test)

# Ensembling
We'll use simple average for ensembling but feel free to use more advanced ensembling techniques.

In [34]:
y_preds_final = np.array([y_preds_xgb, y_preds_lgbm, y_preds_cat]).mean(axis=0)

# Submission

In [35]:
submission = pd.DataFrame({"id": test_idx, "price": y_preds_final})
submission.head()

Unnamed: 0,id,price
0,193573,873.746079
1,193574,2578.418743
2,193575,2312.299368
3,193576,841.74094
4,193577,5813.526453


In [36]:
submission.to_csv("submission.csv", index=False)