# Baseline Models
In Kaggle's Backpack Prediction Challenge, there is not much signal. The target appears to be mostly random. This is discussed in many discussion topics including [here][1]. In this notebook we create two baselines and try to extract a little signal.

This competition's dataset is synthetically created, so there will most likely be some artifact signal in the numerical column named `Weight Capacity`. This column is numerical, but synthetic generation may repeat values and give repeated values similar targets. Even if the original targets were random, having repeated targets in this competition's data will allow us to predict them. Thus there may be a little signal here.

We will submit baseline 1 (train mean) in version 1 of this notebook and we will submit baseline 2 (target encode weight capacity) in version 2 of this notebook. Afterward we will have two LB scores which we can use as baselines to compare against future models.

UPDATE: Baseline 1 achieves LB=39.16450. Now let's see what baseline 2 achieves...

# Load Data

[1]: https://www.kaggle.com/competitions/playground-series-s5e2/discussion/560669

In [None]:
import pandas as pd, numpy as np

train = pd.read_csv("/kaggle/input/playground-series-s5e2/train.csv")
print("Train shape",train.shape)
train_extra = pd.read_csv("/kaggle/input/playground-series-s5e2/training_extra.csv")
print("Extra Train shape",train_extra.shape)
train = pd.concat([train,train_extra],axis=0,ignore_index=True)
print("Combined Train shape",train.shape)
train.head()

# Baseline 1 - Train Mean - CV 38.93, LB 39.16

In [None]:
train_mean = train.Price.mean()
train['pred'] = train_mean
s = np.sqrt(np.mean( (train.Price-train.pred)**2.0 ) )
print(f"Validation RMSE using Train Mean = {s}")

In [None]:
sub = pd.read_csv("/kaggle/input/playground-series-s5e2/sample_submission.csv")
print('Submission shape', sub.shape)
sub['Price'] = train_mean
sub.to_csv("submission_mean.csv",index=False)
sub.head()

# Baseline 2 - Target Encode Weight Capacity - CV 38.71
We will use [RAPIDS][1] Target Encoder to TE the numeric column `weight capacity`.

[1]: https://docs.rapids.ai/install/

In [None]:
from cuml.preprocessing import TargetEncoder

TE = TargetEncoder(n_folds=25, smooth=20, split_method='random', stat='mean')
train['pred'] = TE.fit_transform(train['Weight Capacity (kg)'],train.Price)
s = np.sqrt(np.mean( (train.Price-train.pred)**2.0 ) )
print(f"Validation RSME using Target Encode Weight Capacity = {s}")

In [None]:
test = pd.read_csv("/kaggle/input/playground-series-s5e2/test.csv")
sub['Price'] = TE.transform(test['Weight Capacity (kg)'])
sub.to_csv("submission_TE_weight_capacity.csv",index=False)
sub.head()