# 20210825_CatBoost_test
Experiment with CatBoost with just scaling transform.

---------
**Best to date**: From `20210810_XGBRegressor_tree_sweep2` I had found the following was the best result (RMSE = 7.89519 on the LB) thus far:

```python
model = XGBRegressor(
        tree_method='auto',
        booster='dart',
        n_estimators=400, 
        max_depth=3,
        learning_rate=0.1522, 
        subsample=1,
        random_state=42,
        n_jobs=-1, 
        verbosity=1, 
    )
```

Also found that `MaxAbsScaler` performed best, by a smidgen, prior to any feature selection.

**Baseline**: The "control" config for experiments as of today (20210823), yielding RMSE of `7.8619006924521` on an unaltered feature set:

```python
config_defaults = {
    "library": "xgboost",
    "tree_method": "auto", # set to 'gpu_hist' to try GPU if available
    "booster": 'gbtree', # dart may be marginally better, but will opt for this quicker approach as a default
    "n_estimators": 100, # a very low number -- optimal is probably 300ish -- but this will be quicker
    "max_depth": 3,
    "learning_rate": 0.1,
    "test_size": 0.2,
    # "scaler": MaxAbsScaler # determined to be best with the above hyperparameters
}
```

In [1]:
baseline_rmse = 7.8619006924521

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

# general ML tooling
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error
import wandb
from wandb.xgboost import wandb_callback
# import timm
from pathlib import Path # for handling filenames
import os # for fixing the wandb notebook name issue
import math # for faster-than-numpy 
import seaborn as sns # for viz

# feature engineering tools
from sklearn.feature_selection import chi2, f_classif, f_regression, mutual_info_regression, SelectKBest, SelectPercentile, VarianceThreshold
from scipy import stats # for the yeojohnson transform for right-skewed features with some negative values
from sklearn.preprocessing import MaxAbsScaler, StandardScaler, MinMaxScaler, RobustScaler

# import featuretools as ft

# models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
# from sklearn.ensemble import RandomForestRegressor
# import lightgbm # fill in
# import catboost # fill in


In [3]:
%matplotlib inline 
%config Completer.use_jedi = False 
os.environ['WANDB_NOTEBOOK_NAME'] = '20210825_CatBoost_test.ipynb' # to avoid notebook name detection issue in JupyterLab

# preliminary configuration for a `wandb` run; use it for the run's metadata
config_run = {
    'name': os.environ['WANDB_NOTEBOOK_NAME'][:-6], # just removes the .ipynb extension, leaving the notebook filename's stem
    'tags': ['CatBoost', 'baseline', 'modeling'],
    'notes': "Experimenting with CatBoost with just scaling transform",
}

Note that while the above dict provides basic metadata for runs, another dict will provide model hyperparameters. **NOTE HOWEVER THAT WANDB DOES NOT SUPPORT CATBOOST** (though there's a workaround posted [here](https://github.com/wandb/client/issues/965).

In [4]:
datapath = Path('/media/sf/easystore/kaggle_data/tabular_playgrounds/202108_august/')

Here, we'll load the unaltered training dataset from a feather file (for improved speed).

In [5]:
# df = pd.read_csv(datapath/'train.csv', index_col='id', low_memory=False)
# df.index.name = None
# df.to_feather(path='./dataset_df.feather')
df = pd.read_feather(path='dataset_df.feather')
df.index.name = 'id'

There are no NaNs or missing datapoints that have to be imputed.

Now, let's just set the basic features vs target

In [6]:
y = df.loss

In [7]:
features = [x for x in df.columns if x != 'loss']

In [8]:
X = df[features]

# Configurations
Below is the configuration I've found best for `XGBRegressor` thus far. It provides some metadata that I want logged in `wandb` (but which is not `wandb`-specific) and some hyperparameters for models.
- Note that the `n_estimators` might actually be far too low (this is the highest I've tried, but in the discussion forum I've seen that people are running 1500-2500 estimators). 
- Note also that `dart` seems to be virtually identical to `gbtree` in terms of performance (potentially actually identical), and takes significantly longer to train.

In [9]:
# config_best = {
#     "library": "xgboost",
#     "tree_method": "auto", # set to 'gpu_hist' to try GPU if available
#     "booster": 'dart', 
#     "n_estimators": 400, 
#     "max_depth": 3,
#     "learning_rate": 0.1522,
#     "test_size": 0.2,
#     "scaler": MaxAbsScaler,
#     "feature_selector": SelectKBest,
#     "k_best": 80,
#     "feature_selection_scoring": f_regression,
#     'random_state': 42,
#     'subsample': 1,
#     'n_jobs': -1,
#     'verbosity': 1,
# }

Below is a faster set of configuration options that trains a bit faster; I used it for baseline testing. 

In [20]:
# as of 20210825, largely borrowing from XGBRegressor baseline
config = {
    "library": "catboost",
    "model": CatBoostRegressor,
    "n_estimators": 10000, # low number for testing purposes; alias for catboost-native "iterations"
    "learning_rate": 0.1, 
    "max_depth": 3, # catboost native is "depth"
    "task_type": "GPU", # should get it running on GPU
    "scaler": MaxAbsScaler, # provisional
    "test_size": 0.2,
    "random_state": 42,
#     "subsample":1,
    "n_jobs":-1,
    "verbosity":1,
#     "bootstrap_type": "Poisson",
#     "device": 0
}
#     "tree_method": "auto", # set to 'gpu_hist' to try GPU if available
#     "booster": 'gbtree', # dart may be marginally better, but will opt for this quicker approach as a default
#     "n_estimators": 100, # a very low number -- optimal is probably 300ish -- but this will be quicker
#     "max_depth": 3,
#     "learning_rate": 0.1,
#     "test_size": 0.2,
#     "scaler": MaxAbsScaler # determined to be best with the above hyperparameters

In [11]:
# ?CatBoostRegressor

# Scaling

We'll permit some data leakage here and scale the entire dataset at once, insofar as we're not being confronted with data leak from the *actual* test set (in the `test.csv` file, sans labels).

I'm undecided right now whether the target should be transformed along with the features -- for now, I will just leave the target as is.

In [12]:
scaler = config['scaler']()
X_scaled = scaler.fit_transform(X)
# y_scaled = scaler.fit_transform(y)

# Training Function

Here is a preliminary training function -- I'm allowing specification of a scaler at least for the time being.

In [16]:
def train(X, y, config):#, scaler): # passed in via config dict for now
#     wandb.init(
#         project="202108_Kaggle_tabular_playground",
#         save_code=True,
#         tags=config_run['tags'],
#         name=config_run['name'],
#         notes=config_run['notes'],
#         config=wandb_config)
    
#     config = wandb.config
        
    # applying hold-out before scaling
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=config['test_size'], random_state=config['random_state'])
    
#     wandb.log({'scaler': MaxAbsScaler})
#     s = MaxAbsScaler()
#     X_train = s.fit_transform(X_train)
#     X_valid = s.fit_transform(X_valid)
    
#     # instantiating the scaler and fitting it
#     if scaler:
#         s = scaler()
#         X_train = s.fit_transform(X_train)
#         X_valid = s.fit_transform(X_valid)
    
    model = CatBoostRegressor(
        n_estimators=config['n_estimators'],
        learning_rate=config['learning_rate'],
        max_depth=config['max_depth'],
        task_type=config['task_type'],
#         n_jobs=config['n_jobs'],
#         verbosity=config['verbosity'],
#         subsample=config['subsample'],
        random_state=config['random_state'],
#         bootstrap_type=config['bootstrap_type'],
#         device:config['device']
    ) 
    
#     wandb.log({'params': model.get_params()})
    model.fit(X_train, y_train)#, callbacks=[wandb.xgboost.wandb_callback()])
    y_preds = model.predict(X_valid)
    mse = mean_squared_error(y_valid, y_preds)
    rmse = math.sqrt(abs(mse))
#     wandb.log({'mse':mse, 'rmse':rmse})
    print(f"MSE is {mse}\nRMSE is {rmse}")   
#     wandb.finish()
    return model, y_preds
    

In [21]:
catboost_test_model, y_valid_preds = train(X_scaled, y, config)

0:	learn: 7.9431099	total: 2.5ms	remaining: 25s
1:	learn: 7.9408891	total: 4.56ms	remaining: 22.8s
2:	learn: 7.9389562	total: 6.45ms	remaining: 21.5s
3:	learn: 7.9372687	total: 8.58ms	remaining: 21.4s
4:	learn: 7.9357530	total: 10.5ms	remaining: 21s
5:	learn: 7.9345422	total: 12.8ms	remaining: 21.2s
6:	learn: 7.9334753	total: 14.8ms	remaining: 21.1s
7:	learn: 7.9324438	total: 16.9ms	remaining: 21.1s
8:	learn: 7.9315389	total: 18.8ms	remaining: 20.9s
9:	learn: 7.9304618	total: 20.9ms	remaining: 20.9s
10:	learn: 7.9295826	total: 23ms	remaining: 20.9s
11:	learn: 7.9286219	total: 25ms	remaining: 20.8s
12:	learn: 7.9278370	total: 27.2ms	remaining: 20.9s
13:	learn: 7.9268187	total: 29.5ms	remaining: 21.1s
14:	learn: 7.9260403	total: 31.6ms	remaining: 21.1s
15:	learn: 7.9252246	total: 33.8ms	remaining: 21.1s
16:	learn: 7.9244233	total: 35.8ms	remaining: 21s
17:	learn: 7.9237296	total: 38ms	remaining: 21.1s
18:	learn: 7.9229234	total: 40.2ms	remaining: 21.1s
19:	learn: 7.9222380	total: 42.3ms	

# Predictions

In [44]:
# y_pred = model.predict(X)



In [45]:
y_pred[:10]

array([7.5242977, 8.158392 , 4.832366 , 7.419939 , 6.553831 , 7.712717 ,
       7.8510847, 7.165042 , 5.162752 , 5.5137396], dtype=float32)

In [46]:
test_df = pd.read_csv(datapath/'test.csv', index_col='id', low_memory=False)

In [47]:
test_df.head()

Unnamed: 0_level_0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
250000,0.812665,15,-1.23912,-0.893251,295.577,15.8712,23.0436,0.942256,29.898,1.11394,...,0.446389,-422.332,-1.4463,1.69075,1.0593,-3.01057,1.94664,0.52947,1.38695,8.78767
250001,0.190344,131,-0.501361,0.801921,64.8866,3.09703,344.805,0.807194,38.4219,1.09695,...,0.377179,10352.2,21.0627,1.84351,0.251895,4.44057,1.90309,0.248534,0.863881,11.7939
250002,0.919671,19,-0.057382,0.901419,11961.2,16.3965,273.24,-0.0033,37.94,1.15222,...,0.99014,3224.02,-2.25287,1.551,-0.559157,17.8386,1.83385,0.931796,2.33687,9.054
250003,0.860985,19,-0.549509,0.471799,7501.6,2.80698,71.0817,0.792136,0.395235,1.20157,...,1.39688,9689.76,14.7715,1.4139,0.329272,0.802437,2.23251,0.893348,1.35947,4.84833
250004,0.313229,89,0.588509,0.167705,2931.26,4.34986,1.57187,1.1183,7.75463,1.16807,...,0.862502,2693.35,44.1805,1.5802,-0.191021,26.253,2.68238,0.361923,1.5328,3.7066


In [48]:
X_test = test_df[features] # this is just for naming consistency

In [49]:
y_test_preds = model.predict(X_test)



In [50]:
sample_df = pd.read_csv(datapath/'sample_submission.csv')

In [51]:
sample_df.loc[:, 'loss'] = y_test_preds

In [52]:
sample_df.head()

Unnamed: 0,id,loss
0,250000,8.155357
1,250001,4.386151
2,250002,7.510742
3,250003,7.391403
4,250004,7.604273


In [53]:
sample_df.to_csv('202108062038_XGBoost.csv', index=False)

In [51]:
# wandb.finish()

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…