<a href="https://www.kaggle.com/code/khawajaabaidullah/ps3e8-starting-strong-ensembling-gdbts?scriptVersionId=119927826" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction
In this notebook, we will:
1. Encode Categorical Features using features descriptions provided in the original dataset.
2. Ensebmle Gradient Boosting Trees Models, specifically XGBoost, LightGBM and CatBoost.
3. Incorporate Original Dataset with competition's dataset.


# Purpose:
The purpose of this notebook is to serve as a simple but strong baseline for you as you go on to engineer fearures and tune your models.

# Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
from IPython.display import display
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from category_encoders import LeaveOneOutEncoder
import optuna

In [None]:
from warnings import filterwarnings
filterwarnings("ignore")

#### NOTE:
If you are interested in dataset's insights and EDA, checkout this excellent [notebook](https://www.kaggle.com/code/craigmthomas/play-s3e8-eda-models) by Craig Thomas. (his notebooks are always awesome!)

# Loading Data

In [None]:
BASE_PATH = Path("/kaggle/input/playground-series-s3e8")
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns="id")
test = pd.read_csv(BASE_PATH  / "test.csv")
test_idx = test.id
test = test.drop(columns="id")

# Craig Thomas has shown in his excellent notebook that the original dataset is pretty similar to the compeition's one
# so hopefully fusing the original and competition dataset should boost our score.
# The notebook is linked above

original = pd.read_csv("/kaggle/input/gemstone-price-prediction/cubic_zirconia.csv").drop(columns="Unnamed: 0")

print(f"Loaded train with {len(train)} rows.")
print(f"Loaded test with {len(test)} rows.")
print(f"Loaded original with {len(original)} rows.")

In [None]:
all_datasets = {"train": train,
               "test": test,
               "original": original}

# Checking for Null values

In [None]:
pd.concat([dataset.isnull().sum().rename(f"Missing in {dataset_name}") 
               for dataset_name, dataset in all_datasets.items()],
                 axis=1)

## INSIGHTS: 
Only original dataset contains 697 missing values, which we'll simnply drop because no other dataset contains any missing values. Because not only is it a waste of time trying to come up with a imputation technique and applying it but also because doing so may introduce a bit noisy input samples compared to the rest of the data and hence the model's performance may suffer.

In [None]:
original.dropna(axis=0, how="any", inplace=True)

# Identifying categorical features

In [None]:
pd.concat([train.dtypes.rename("Data Type")] + \
          [dataset.nunique().rename(f"{dataset_name} UniqueValues") for dataset_name, dataset in all_datasets.items()],
          axis=1).sort_values(by="train UniqueValues")

In [None]:
cat_features = ["cut", "color", "clarity"]

# Encoding Categorical Features
Leveraging the feature descriptions from this [discussion](https://www.kaggle.com/competitions/playground-series-s3e8/discussion/389213) we will encode the above categorical values.
Check out that discussion as it provides feature descriptions for all features in the dataset and will surely help you understand these features better and then engineer new features based of these.

### Encoding Cut
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good, Premium, Ideal

In [None]:
cut_labels = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
cut_labels_map = {label: rank for rank, label in enumerate(cut_labels)}
cut_labels_map

### Encoding Color
Colour of the cubic zirconia.With D being the best and J the worst.

In [None]:
color_labels = ['D', 'E', 'F', 'G', 'H', 'I', 'J']
color_labels_map = {label: rank for rank, label in enumerate(reversed(color_labels))}
color_labels_map

### Encoding Clarity feature

In [None]:
clarity_labels = ['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1', 'I2', 'I3']
clarity_labels_map = {label: rank for rank, label in enumerate(reversed(clarity_labels))}
clarity_labels_map

In [None]:
for dataset in all_datasets.values():
    dataset["cut"] = dataset["cut"].map(cut_labels_map)
    dataset["color"] = dataset["color"].map(color_labels_map)    
    dataset["clarity"] = dataset["clarity"].map(clarity_labels_map)    

# Preprocessing

In [None]:
X = train.drop(columns="price")
y = train.price

# Setting Up Cross Validation
I'll just cross validate xgboost here, but you can do it for all models.

In [None]:
def cross_validate(X, y, X_org=None, y_org=None):
    # we'll use 5 fold cross validation
    N_FOLDS = 5
    cv_scores = np.zeros(N_FOLDS)
    kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)
    
    for fold_id, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
        
        if X_org is not None and y_org is not None:
            X_train = pd.concat([X_train, X_org], axis=0)
            y_train = pd.concat([y_train, y_org], axis=0)
        
        model = lgbm.LGBMRegressor()
        model.fit(X_train, y_train,
                     eval_set=[(X_val, y_val)],
                     eval_metric="rmse",
                     early_stopping_rounds=50,
                     verbose=-1)
        
        y_preds = model.predict(X_val)        
        rmse = mean_squared_error(y_val, y_preds, squared=False)
        cv_scores[fold_id] = rmse
        
        print(f"Fold {fold_id} | rmse: {rmse}")
    
    avg_rmse = np.mean(cv_scores)
    print(f"Avg RMSE across folds: {avg_rmse}")

### using competition data only

In [None]:
cross_validate(X, y)

### using original + competition data

In [None]:
X_original = original.drop(columns="price")
y_original = original.price

In [None]:
cross_validate(X, y, X_original, y_original)

## INSIGHTS: Looks like including original dataset does help!

# Training Models

In [None]:
# creating a validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, shuffle=True, random_state=1337)

In [None]:
# let's add original data to the mix
X_train = pd.concat([X_train, X_original], axis=0)
y_train = pd.concat([y_train, y_original], axis=0)

In [None]:
xgb_model = xgb.XGBRegressor(eval_metric="rmse", early_stopping_rounds=50)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

In [None]:
lgbm_model = lgbm.LGBMRegressor()
lgbm_model.fit(X_train, y_train, 
               eval_set=[(X_val, y_val)],
               eval_metric="rmse",
               early_stopping_rounds=50,
               verbose=-1)

In [None]:
cat_model = catboost.CatBoostRegressor(eval_metric="RMSE", early_stopping_rounds=50)
cat_model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              verbose=False)

# Making Predictions

In [None]:
y_preds_xgb = xgb_model.predict(test)
y_preds_lgbm = lgbm_model.predict(test)
y_preds_cat = cat_model.predict(test)

# Ensembling
We'll use simple average for ensembling but feel free to use more advanced ensembling techniques.

In [None]:
y_preds_final = np.array([y_preds_xgb, y_preds_lgbm, y_preds_cat]).mean(axis=0)

# Submission

In [None]:
submission = pd.DataFrame({"id": test_idx, "price": y_preds_final})
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)