# First Place - Single Model - LB 38.81
I'm excited to share my `First Place - Gold Medal - Single Model` solution in Kaggle's playground backpack prediction competition. We begin with my starter notebook [here][1] and then we add more engineered features using [RAPIDS cuDF-Pandas][2]. The notebook you are reading has 138 features. This is a simplfied version of my actual final solution which has 500 features! None-the-less, the notebook you are reading achieves first place accuracy!

A simple way to boost the performance of the notebook you are reading is to train 20 folds inside 20 folds instead of 7 inside 7 (or 10 in 10, or 15 in 15). And we can decrease XGBoost learning rate from 0.01 to 0.005 (and change early stopping to 1000). This will of course increase runtime but it will also improve model accuracy too!

Another way to boost this notebook's accuracy is to include the additional 362 features from my actual final solution. The text below is the introduction from my original starter notebook:

[1]: https://www.kaggle.com/code/cdeotte/feature-engineering-with-rapids-lb-38-847
[2]: https://rapids.ai/cudf-pandas/

# Feature Engineering with Fast cuDF-Pandas!
One of the most powerful feature engineering techniques is `groupby(COL1)[COL2].agg(STAT)`. This is where we group by `COL1` column and aggregate (i.e. compute) a statistic `STAT` over another column `COL2`. This is the underlying method to compute `target encoding` and `count encoding`. By computing raw statistics and inputting them into our model, our model can do more than only receiving `TE` or `CE`. This notebook illustrates creating 50 engineered features, but we can create hundreds more and improve CV score and LB score!

When our dataset has millions of rows like Kaggle's Backpack competition, then `groupby` operations take time to compute. The fastest way to compute a `groupby` aggregation is to use GPU with [RAPIDS cuDF-Pandas][1] library.

There are two ways to use [RAPIDS cuDF][2]. We can write cuDF code which looks just like Pandas code and starts with `import cudf`. Or we can write normal Pandas code with `import pandas` but before that we add the cell magic command `%load_ext cudf.pandas`. By adding this magic command all calls to Pandas afterward will use [RAPIDS cuDF][2] behind the scenes taking advantage of the massive speed boost of GPU!

Alternatively, we can use [cuDF-Polars][3]. To use [cuDF-Polars][3], we write Polars code with lazy frame. Then the final call includes `.collect(engine="gpu")` which will run all previous Polars code behind the scenes with [RAPIDS cuDF][2].

[1]: https://rapids.ai/cudf-pandas/
[2]: https://docs.rapids.ai/install/
[3]: https://rapids.ai/polars-gpu-engine/

# RAPIDS v25.02
[RAPIDS v25.02][1] was just released Feb 15, 2025! Instructions on installing RAPIDS is [here][1]. On Kaggle, the easiest way to pip install new libraries is to do it once in a `Utility Script` notebook. Then whenever we attach the `Utility Script` notebook to another Kaggle notebook, the second Kaggle notebook immediately gets the benefit of the pip installed libraries. We created a [RAPIDS 25.02][1] `Utility Script` [here][2], and we attach it to the notebook you are reading. Therefore the notebook you are reading can import RAPIDS v25.02 without needing to pip install!

[1]: https://docs.rapids.ai/install/
[2]: https://www.kaggle.com/code/cdeotte/rapids-cudf-25-02-cuml-25-02

# GPU Acceleration
We activate [cuDF-Pandas][1] with the magic command `%load_ext cudf.pandas` below. Afterward, all calls to Pandas will use fast GPU [RAPIDS cuDF][2] behind the scenes! Since we attached `Utility Script` notebook [here][3] to the notebook you are reading, we will be using the new [RAPIDS v25.02][2]!

[1]: https://rapids.ai/cudf-pandas/
[2]: https://docs.rapids.ai/install/
[3]: https://www.kaggle.com/code/cdeotte/rapids-cudf-25-02-cuml-25-02

In [None]:
%load_ext cudf.pandas

import numpy as np, pandas as pd, gc
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)

VER=1

# Load Data
We load train, train extra, and test data. The combined train data has 4 million rows! This means we do not need to fear overfitting train. We can make hundreds/thousands of new features and every time our CV improves our LB will improve too!

In [None]:
train = pd.read_csv("/kaggle/input/playground-series-s5e2/train.csv")
print("Train shape", train.shape )
train.head()

In [None]:
train2 = pd.read_csv("/kaggle/input/playground-series-s5e2/training_extra.csv")
print("Extra Train shape", train2.shape )
train2.head()

In [None]:
train = pd.concat([train,train2],axis=0,ignore_index=True)
print("Combined Train shape", train.shape)

In [None]:
test = pd.read_csv("/kaggle/input/playground-series-s5e2/test.csv")
print("Test shape", test.shape )
test.head()

# Feature Engineer Columns

First we will engineer 8 new columns by combining each categorical column with Weight Capacity as was done in my starter notebook. Next we engineer 63 more columns listed below:
* One column to indicate all NANs using a base-2 encoding
* NANs per feature combined with Weight Capacity
* Bin Weight Capacity by rounding it in different ways
* Merge original dataset price
* Extract digits from Weight Capacity
* Combine digit features
* Combine original categorical columns

In [None]:
CATS = list(train.columns[1:-2])
print(f"There are {len(CATS)} categorical columns:")
print( CATS )
print(f"There are 1 numerical column:")
print( ["Weight Capacity (kg)"] )

In [None]:
COMBO = ["NaNs"]
train["NaNs"] = np.float32(0)
test["NaNs"] = np.float32(0)

for i,c in enumerate(CATS):

    # NEW FEATURE - ENCODE ALL NAN AS ONE BASE-2 FEATURE
    train["NaNs"] += train[c].isna()*2**i
    test["NaNs"] += test[c].isna()*2**i

    # NEW FEATURE - COMBINE EACH COLUMN'S NAN WITH WEIGHT CAPACITY
    n = f"{c}_nan_wc"
    train[n] = train[c].isna()*100 + train["Weight Capacity (kg)"]
    test[n] = test[c].isna()*100 + test["Weight Capacity (kg)"]
    COMBO.append(n)
    
    combine = pd.concat([train[c],test[c]],axis=0)
    combine,_ = pd.factorize(combine)
    train[c] = combine[:len(train)].astype("float32")
    test[c] = combine[len(train):].astype("float32")
    n = f"{c}_wc"
    train[n] = train[c]*100 + train["Weight Capacity (kg)"]
    test[n] = test[c]*100 + test["Weight Capacity (kg)"]
    COMBO.append(n)

In [None]:
# NEW FEATURE - BIN WEIGHT CAPACITY USING ROUNDING
for k in range(7,10):
    n = f"round{k}"
    train[n] = train["Weight Capacity (kg)"].round(k)
    test[n] = test["Weight Capacity (kg)"].round(k)
    COMBO.append(n)

In [None]:
# NEW FEATURE - ORIGINAL DATASET PRICE
NEW_COLS = []
orig = pd.read_csv("/kaggle/input/student-bag-price-prediction-dataset/Noisy_Student_Bag_Price_Prediction_Dataset.csv")
tmp = orig.groupby("Weight Capacity (kg)").Price.mean()
tmp.name = "orig_price"
train = train.merge(tmp, on="Weight Capacity (kg)", how="left")
test = test.merge(tmp, on="Weight Capacity (kg)", how="left")
NEW_COLS.append("orig_price")

In [None]:
# NEW FEATURE - ORIGINAL DATASET PRICE FROM ROUNDED WEIGHT CAPACITY 
for k in range(7,10):
    n = f"round{k}"
    orig[n] = orig["Weight Capacity (kg)"].round(k)
    tmp = orig.groupby(n).Price.mean()
    tmp.name = f"orig_price_r{k}"
    train = train.merge(tmp, on=n, how="left")
    test = test.merge(tmp, on=n, how="left")
    NEW_COLS.append(f"orig_price_r{k}")

In [None]:
# NEW FEATURE - DIGIT EXTRACTION FROM WEIGHT CAPACITY
for k in range(1,10):
    train[f'digit{k}'] = ((train['Weight Capacity (kg)'] * 10**k) % 10).fillna(-1).astype("int8")
    test[f'digit{k}'] = ((test['Weight Capacity (kg)'] * 10**k) % 10).fillna(-1).astype("int8")
DIGITS = [f"digit{k}" for k in range(1,10)]

In [None]:
# NEW FEATURE - COMBINATIONS OF DIGITS 
for i in range(4):
    for j in range(i+1,5):
        n = f"digit_{i+1}_{j+1}"
        train[n] = ((train[f'digit{i+1}']+1)*11 + train[f'digit{j+1}']+1).astype("int8")
        test[n] = ((test[f'digit{i+1}']+1)*11 + test[f'digit{j+1}']+1).astype("int8")
        COMBO.append(n)

In [None]:
# NEW FEATURE - COMBINATIONS OF CATS
PAIRS = []
for i,c1 in enumerate(CATS[:-1]):
    for j,c2 in enumerate(CATS[i+1:]):
        n = f"{c1}_{c2}"
        m1 = train[c1].max()+1
        m2 = train[c2].max()+1
        train[n] = ((train[c1]+1 + (train[c2]+1)/(m2+1))*(m2+1)).astype("int8")
        test[n] = ((test[c1]+1 + (test[c2]+1)/(m2+1))*(m2+1)).astype("int8")
        COMBO.append(n)
        PAIRS.append(n)

In [None]:
print(f"New Train shape:", train.shape )
train.head()

In [None]:
FEATURES = CATS + ["Weight Capacity (kg)"] + COMBO + DIGITS + NEW_COLS
print(f"We now have {len(FEATURES)} columns:")
print( FEATURES )

# XGBoost with Feature Engineer GroupBy
We train XGBoost with nested folds. We use the inner nested fold to create new features that aggregate the target `price`. We must do this to prevent target leakage. And we use the outer fold to create new features that do not aggregate the target `price`. In each k fold loop, we engineer new features using the advanced feature engineering technique `groupby(COL1)[COL2].agg(STAT)`. Since we are using [RAPIDS cuDF-Pandas][1], these groupby computations will run fast on GPU! And we will train our model quickly on GPU using XGBoost!

**NEW FEATURES** Compared with our starter notebook, we add two new types of aggregations. Namely we groupby and compute quantiles. And we groupby and compute histogram bin counts! We also create two new division features. We divide aggregated count by nunique. And we divide aggregated std by count. (We also removed some features from my starter notebook which saves memory).

**UPDATE** We reduce all engineered features to `float32` to reduce memory usage. (But we keep original column Weight Capacity (and it's combinations) as `float64` so as not to lose the original digits). 

[1]: https://rapids.ai/cudf-pandas/

In [None]:
from sklearn.model_selection import KFold
from xgboost import XGBRegressor
import xgboost as xgb
print(f"XGBoost version",xgb.__version__)

In [None]:
# STATISTICS TO AGGEGATE FOR OUR FEATURE GROUPS
STATS = ["mean","std","count","nunique","median","min","max","skew"]
STATS2 = ["mean"]

In [None]:
# QUANTILES AND HISTOGRAM BINS TO AGGREGATE
BINS=10
QUANTILES = [5,10,40,45,55,60,90,95]
def make_histogram(prices, bins=BINS, range_min=15, range_max=150):
    hist, _ = np.histogram(prices, bins=bins, range=(range_min, range_max))
    return hist

In [None]:
%%time

FOLDS = 7
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)

oof = np.zeros((len(train)))
pred = np.zeros((len(test)))

# OUTER K FOLD
for i, (train_index, test_index) in enumerate(kf.split(train)):
    print(f"### OUTER Fold {i+1} ###")

    X_train = train.loc[train_index,FEATURES+['Price']].reset_index(drop=True).copy()
    y_train = train.loc[train_index,'Price']

    X_valid = train.loc[test_index,FEATURES].reset_index(drop=True).copy()
    y_valid = train.loc[test_index,'Price']

    X_test = test[FEATURES].reset_index(drop=True).copy()

    # INNER K FOLD (TO PREVENT LEAKAGE WHEN USING PRICE)
    kf2 = KFold(n_splits=FOLDS, shuffle=True, random_state=42)   
    for j, (train_index2, test_index2) in enumerate(kf2.split(X_train)):
        print(f" ## INNER Fold {j+1} (outer fold {i+1}) ##")

        X_train2 = X_train.loc[train_index2,FEATURES+['Price']].copy()
        X_valid2 = X_train.loc[test_index2,FEATURES].copy()

        ### FEATURE SET 1 (uses price) ###
        col = "Weight Capacity (kg)"
        tmp = X_train2.groupby(col).Price.agg(STATS)
        tmp.columns = [f"TE1_wc_{s}" for s in STATS]
        X_valid2 = X_valid2.merge(tmp, on=col, how="left")
        for c in tmp.columns:
            X_train.loc[test_index2,c] = X_valid2[c].values.astype("float32")

        ### FEATURE SET 2 (uses price) ###
        for col in COMBO:
            tmp = X_train2.groupby(col).Price.agg(STATS2)
            tmp.columns = [f"TE2_{col}_{s}" for s in STATS2]
            X_valid2 = X_valid2.merge(tmp, on=col, how="left")
            for c in tmp.columns:
                X_train.loc[test_index2,c] = X_valid2[c].values.astype("float32")

        # AGGREGATE QUANTILES (uses price)
        for k in QUANTILES:
            result = X_train2.groupby('Weight Capacity (kg)').agg({'Price': lambda x: x.quantile(k/100)})
            result.columns = [f"quantile_{k}"]
            X_valid2 = X_valid2.merge(result, on="Weight Capacity (kg)", how="left")
            X_train.loc[test_index2,f"quantile_{k}"] = X_valid2[f"quantile_{k}"].values.astype("float32")

        # AGGREGATE HISTOGRAMS (uses price)
        tmp = X_train2.loc[~X_train2.orig_price.isna()].groupby("Weight Capacity (kg)")[["Price"]].agg("count")
        tmp.columns = ['ct']
        X_train3 = X_train2.merge(tmp.loc[tmp['ct']>1],on="Weight Capacity (kg)",how="left")
        X_train3 = X_train3.loc[~X_train3['ct'].isna()]
        result = X_train3.groupby("Weight Capacity (kg)")["Price"].apply(make_histogram)
        result = result.to_frame()['Price'].apply(pd.Series)
        result.columns = [f"histogram_{x}" for x in range(BINS)]
        X_valid2 = X_valid2.merge(result, on="Weight Capacity (kg)", how="left")
        for c in [f"histogram_{x}" for x in range(BINS)]:
            X_train.loc[test_index2,c] = X_valid2[c].values.astype("float32")
            
        del result, X_train3, tmp
        del X_train2, X_valid2
        gc.collect()

    ### FEATURE SET 1 (uses price) ###
    col = "Weight Capacity (kg)"
    tmp = X_train.groupby(col).Price.agg(STATS)
    tmp.columns = [f"TE1_wc_{s}" for s in STATS]
    tmp = tmp.astype("float32")
    X_valid = X_valid.merge(tmp, on=col, how="left")
    X_test = X_test.merge(tmp, on=col, how="left")

    ### FEATURE SET 2 (uses price) ###
    for col in COMBO:
        tmp = X_train.groupby(col).Price.agg(STATS2)
        tmp.columns = [f"TE2_{col}_{s}" for s in STATS2]
        tmp = tmp.astype("float32")
        X_valid = X_valid.merge(tmp, on=col, how="left")
        X_test = X_test.merge(tmp, on=col, how="left")

    # AGGREGATE QUANTILES (uses price)
    for k in QUANTILES:
        result = X_train.groupby('Weight Capacity (kg)').agg({'Price': lambda x: x.quantile(k/100)})
        result.columns = [f"quantile_{k}"]
        result = result.astype("float32")
        X_valid = X_valid.merge(result, on="Weight Capacity (kg)", how="left")
        X_test = X_test.merge(result, on="Weight Capacity (kg)", how="left")

    # AGGREGATE HISTOGRAMS (uses price)
    tmp = X_train.loc[~X_train.orig_price.isna()].groupby("Weight Capacity (kg)")[["Price"]].agg("count")
    tmp.columns = ['ct']
    X_train3 = X_train.merge(tmp.loc[tmp['ct']>1],on="Weight Capacity (kg)",how="left")
    X_train3 = X_train3.loc[~X_train3['ct'].isna()]
    result = X_train3.groupby("Weight Capacity (kg)")["Price"].apply(make_histogram)
    result = result.to_frame()['Price'].apply(pd.Series)
    result.columns = [f"histogram_{x}" for x in range(BINS)]
    result = result.astype("float32")
    X_valid = X_valid.merge(result, on="Weight Capacity (kg)", how="left")
    X_test = X_test.merge(result, on="Weight Capacity (kg)", how="left")
    del result, X_train3, tmp

    # COUNT PER NUNIQUE
    X_train['TE1_wc_count_per_nunique'] = X_train['TE1_wc_count']/X_train['TE1_wc_nunique']
    X_valid['TE1_wc_count_per_nunique'] = X_valid['TE1_wc_count']/X_valid['TE1_wc_nunique']
    X_test['TE1_wc_count_per_nunique'] = X_test['TE1_wc_count']/X_test['TE1_wc_nunique']
    
    # STD PER COUNT
    X_train['TE1_wc_std_per_count'] = X_train['TE1_wc_std']/X_train['TE1_wc_count']
    X_valid['TE1_wc_std_per_count'] = X_valid['TE1_wc_std']/X_valid['TE1_wc_count']
    X_test['TE1_wc_std_per_count'] = X_test['TE1_wc_std']/X_test['TE1_wc_count']

    # CONVERT TO CATS SO XGBOOST RECOGNIZES THEM
    X_train[CATS+DIGITS] = X_train[CATS+DIGITS].astype("category")
    X_valid[CATS+DIGITS] = X_valid[CATS+DIGITS].astype("category")
    X_test[CATS+DIGITS] = X_test[CATS+DIGITS].astype("category")

    # DROP PRICE THAT WAS USED FOR TARGET ENCODING
    X_train = X_train.drop(['Price'],axis=1)

    # DROP NON-TE CAT PAIRS
    X_train = X_train.drop(PAIRS,axis=1)
    X_valid = X_valid.drop(PAIRS,axis=1)
    X_test = X_test.drop(PAIRS,axis=1)

    # BUILD MODEL
    model = XGBRegressor(
        device="cuda",
        max_depth=6,  
        colsample_bynode=0.3, 
        subsample=0.8,  
        n_estimators=50_000,  
        learning_rate=0.01,  
        enable_categorical=True,
        min_child_weight=10,
        early_stopping_rounds=500,
    )
    
    # TRAIN MODEL
    COLS = X_train.columns
    model.fit(
        X_train[COLS], y_train,
        eval_set=[(X_valid[COLS], y_valid)],  
        verbose=500,
    )

    # PREDICT OOF AND TEST
    oof[test_index] = model.predict(X_valid[COLS])
    pred += model.predict(X_test[COLS])

    # CLEAR MEMORY
    del X_train, X_valid, X_test
    del y_train, y_valid
    if i != FOLDS-1: del model
    gc.collect()

pred /= FOLDS

# Overall CV Score
Below we display overall cv score and save oof predictions to disk so we can use them later to assist finding ensemble weights with our other models.

In [None]:
# COMPUTE OVERALL CV SCORE
true = train.Price.values
s = np.sqrt(np.mean( (oof-true)**2.0 ) )
print(f"=> Overall CV Score = {s}")

In [None]:
# SAVE OOF TO DISK FOR ENSEMBLES
np.save(f"oof_v{VER}",oof)
print("Saved oof to disk")

# Feature Names
Below we list all our engineered features. We are using 138 features in total!

In [None]:
print(f"\nIn total, we used {len(COLS)} features, Wow!\n")
print( list(COLS) )

# XGB Feature Importance
Here is top 100 XGBoost feature importance sorted by `gain`.

In [None]:
import xgboost as xgb
fig, ax = plt.subplots(figsize=(10, 20))
xgb.plot_importance(model, max_num_features=100, importance_type='gain',ax=ax)
plt.title("Top 100 Feature Importances (XGBoost)")
plt.show()

# Make Submission CSV
We save our test predictions to submission.csv and plot our predictions. 

In [None]:
sub = pd.read_csv("/kaggle/input/playground-series-s5e2/sample_submission.csv")
sub.Price = pred
sub.to_csv(f"submission_v{VER}.csv",index=False)
sub.head()

In [None]:
plt.figure(figsize=(6,4))
plt.hist(sub.Price,bins=100)
plt.title("Test Predictions")
plt.show()