# Intro to 4BAI Kaggle 30 days of ML competition

## contributors
* Labbe, Chris (gclabbe)

### Notes
It appears from online discussions that this particular data works better with CPU.  So, GPU is good for rapid iteration through options, however, final computations for submission need to be run on CPU which can take 30+ minutes depending on the number of folds.

### Revisions to 30_days_abishek_1.ipynb
* V5 - 
* V6 - 10-fold with params from raw GridSearchCV
* V7 - failed compile because of grid search layout
* V8 - need to remember to disable GPU in XGB when saving with CPU
* V9 - 10-fold with params from tutorial ... this version

### Revisions to 30_days_abishek_5_6.ipynb (this notebook)
* V10 

### Plans
* Tutorial 6 -- shows how to pull all of the different models together as a stack

These tutorials are laying out the code long-hand ... no functions to clean up repetitive tasks. So, instead of forking and operating like others are doing, let's build out some support functions and clean up the code.

Also, need to understand how the optimized parameters for a couple of the models were discovered.  Abishek mentions a kernel that he leveraged from ... https://www.kaggle.com/stevenrferrer/30-days-of-ml-optimized-xgboost-5folds

Here is Abishek talking about tuning specifically for this challenge ... https://www.youtube.com/watch?v=m5YSKPMjkrk.  He has many videos online and mentions a longer parameter tuning video that is not challenge specific.


In [1]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()

ModuleNotFoundError: No module named 'plotly'

Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

### This version of the notebook will follow the tutorials published by Abishek
https://www.kaggle.com/abhishek/

* part-1: Baseline

### Import the necessary modules

In [3]:
import time
import os
import numpy as np
import pandas as pd
from pathlib import Path

# modules called out in part-1
from sklearn import model_selection
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing


# other modules from previous work
# from sklearn.preprocessing import StandardScaler
# from sklearn.decomposition import PCA
# import statsmodels.api as sm
# from sklearn.feature_selection import RFE
# from sklearn.metrics import confusion_matrix

# Regressors
# from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
# from lightgbm import LGBMRegressor
# from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# suppress "torch" warning in TPOT and GridSearchCV warning
import warnings
warnings.filterwarnings('ignore')


# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [4]:
# Load the training data
path = ""  # "../input/30-days-of-ml/"

df_train = pd.read_csv(f"{path}train.csv")
df_test = pd.read_csv(f"{path}test.csv")
sample_submission = pd.read_csv(f"{path}sample_submission.csv")


In [None]:
df = pd.read_csv("../input/30days-folds/train_folds.csv")
df_test = pd.read_csv("../input/30-days-of-ml/test.csv")

df1 = pd.read_csv("../input/stacking30days/train_pred_1.csv")
df1.columns = ["id", "pred_1"]
df2 = pd.read_csv("../input/stacking30days/train_pred_2.csv")
df2.columns = ["id", "pred_2"]
df3 = pd.read_csv("../input/stacking30days/train_pred_3.csv")
df3.columns = ["id", "pred_3"]
df4 = pd.read_csv("../input/stacking30days/train_pred_4.csv")
df4.columns = ["id", "pred_4"]
df5 = pd.read_csv("../input/stacking30days/train_pred_5.csv")
df5.columns = ["id", "pred_5"]

df_test1 = pd.read_csv("../input/stacking30days/test_pred_1.csv")
df_test1.columns = ["id", "pred_1"]
df_test2 = pd.read_csv("../input/stacking30days/test_pred_2.csv")
df_test2.columns = ["id", "pred_2"]
df_test3 = pd.read_csv("../input/stacking30days/test_pred_3.csv")
df_test3.columns = ["id", "pred_3"]
df_test4 = pd.read_csv("../input/stacking30days/test_pred_4.csv")
df_test4.columns = ["id", "pred_4"]
df_test5 = pd.read_csv("../input/stacking30days/test_pred_5.csv")
df_test5.columns = ["id", "pred_5"]

df = df.merge(df1, on="id", how="left")
df = df.merge(df2, on="id", how="left")
df = df.merge(df3, on="id", how="left")
df = df.merge(df4, on="id", how="left")
df = df.merge(df5, on="id", how="left")

df_test = df_test.merge(df_test1, on="id", how="left")
df_test = df_test.merge(df_test2, on="id", how="left")
df_test = df_test.merge(df_test3, on="id", how="left")
df_test = df_test.merge(df_test4, on="id", how="left")
df_test = df_test.merge(df_test5, on="id", how="left")

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

### Implement KFold techniques
posted in the discussions by KGM - Abishek Thakur


In [23]:
force_refold = True
folds = 10

# create train_folds.csv if it does not exist
if not Path("train_folds.csv").is_file() or force_refold==True:
    df_train = pd.read_csv(f"{path}train.csv")
    df_train["kfold"] = -1

    kf = model_selection.KFold(
        n_splits=folds,
        shuffle=True,
        random_state=42
    )

    for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=df_train)):
        df_train.loc[valid_indicies, "kfold"] = fold

    df_train.to_csv("train_folds.csv", index=False)

df_train = pd.read_csv(f"{path}train_folds.csv")


In [28]:
print(df_train.head)

<bound method NDFrame.head of             id cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8  ...     cont6  \
0            1    B    B    B    C    B    B    A    E    C  ...  0.160266   
1            2    B    B    A    A    B    D    A    F    A  ...  0.558922   
2            3    A    A    A    C    B    D    A    D    A  ...  0.375348   
3            4    B    B    A    C    B    D    A    E    C  ...  0.239061   
4            6    A    A    A    C    B    D    A    E    A  ...  0.420667   
...        ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...       ...   
299995  499993    B    B    A    A    B    D    A    E    A  ...  0.450538   
299996  499996    A    B    A    C    B    B    A    E    E  ...  0.508502   
299997  499997    B    B    A    C    B    C    A    E    G  ...  0.372425   
299998  499998    A    B    A    C    B    B    A    E    E  ...  0.424243   
299999  499999    A    A    A    C    A    D    A    E    A  ...  0.328669   

           cont7     cont8     co

In [29]:
useful_features = [c for c in df_train.columns if c not in ("id", "target", "kfold")]
numerical_cols = [col for col in useful_features if col.startswith("cont")]
object_cols = [col for col in useful_features if 'cat' in col]

df_test = df_test[useful_features]

## Feature encoding tests - Tutorial 2

### Log transformation

In [30]:
'''
# 0.72562 std 0.00109
for col in numerical_cols:
    df_train[col] = np.log1p(df_train[col])
    df_test[col] = np.log1p(df_test[col])
'''

pass

### Polynomial transformation

In [31]:
'''
# 0.72963 std 0.00066
poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)
train_poly = poly.fit_transform(df[numerical_cols])
test_poly = poly.fit_transform(df_test[numerical_cols])

df_poly = pd.DataFrame(train_poly, columns=[f"poly_{i}" for i in range(train_poly.shape[1])])
df_test_poly = pd.DataFrame(test_poly, columns=[f"poly_{i}" for i in range(test_poly.shape[1])])

df = pd.concat([df, df_poly], axis=1)
df_test = pd.concat([df_test, df_test_poly], axis=1)

useful_features = [c for c in df.columns if c not in ("id", "target", "kfold")]
object_cols = [col for col in useful_features if 'cat' in col]
df_test = df_test[useful_features]
'''

pass

### Target Encoding (tutorial 3)

In [32]:
# target encoding
for col in object_cols:
    temp_df = []
    temp_test_feat = None
    for fold in range(folds):
        x_train = df_train[df_train.kfold != fold].reset_index(drop=True)
        x_valid = df_train[df_train.kfold == fold].reset_index(drop=True)

        feat = x_train.groupby(col)["target"].agg("mean")
        feat = feat.to_dict()

        x_valid.loc[:, f"tar_enc_{col}"] = x_valid[col].map(feat)
        temp_df.append(x_valid)

        if temp_test_feat is None:
            temp_test_feat = df_test[col].map(feat)
        else:
            temp_test_feat += df_test[col].map(feat)

    temp_test_feat /= folds
    df_test.loc[:, f"tar_enc_{col}"] = temp_test_feat
    df_train = pd.concat(temp_df)

In [33]:
useful_features = [c for c in df_train.columns if c not in ("id", "target", "kfold")]
numerical_cols = [col for col in useful_features if col.startswith("cont")]
object_cols = [col for col in useful_features if col.startswith("cat")]
df_test = df_test[useful_features]


# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

In [34]:
final_test_predictions = []
final_valid_predictions = {}
scores = []

for fold in range(folds):
    x_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    x_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()

    valid_ids = x_valid.id.values.tolist()

    y_train = x_train.target
    y_valid = x_valid.target

    x_train = x_train[useful_features]
    x_valid = x_valid[useful_features]

    # encode categorical columns
    ordinal_encoder = OrdinalEncoder()
    x_train[object_cols] = ordinal_encoder.fit_transform(x_train[object_cols])
    x_valid[object_cols] = ordinal_encoder.transform(x_valid[object_cols])
    x_test[object_cols] = ordinal_encoder.transform(x_test[object_cols])

    # standardize numerical columns
    
    # 0.725506 std 0.00119
    scaler = preprocessing.StandardScaler()
    x_train[numerical_cols] = scaler.fit_transform(x_train[numerical_cols])
    x_valid[numerical_cols] = scaler.transform(x_valid[numerical_cols])
    x_test[numerical_cols] = scaler.transform(x_test[numerical_cols])
    

    # binning of numerical features
    '''
    # 0.72550 std 0.00088
    ohe = preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore")
    xtrain_ohe = ohe.fit_transform(xtrain[object_cols])
    xvalid_ohe = ohe.transform(xvalid[object_cols])
    xtest_ohe = ohe.transform(xtest[object_cols])

    xtrain_ohe = pd.DataFrame(xtrain_ohe, columns=[f"ohe_{i}" for i in range(xtrain_ohe.shape[1])])
    xvalid_ohe = pd.DataFrame(xvalid_ohe, columns=[f"ohe_{i}" for i in range(xvalid_ohe.shape[1])])
    xtest_ohe = pd.DataFrame(xtest_ohe, columns=[f"ohe_{i}" for i in range(xtest_ohe.shape[1])])
    '''

    xgb_params_tutorial = {
        'random_state': fold,
        'n_jobs': -1,
    }

    xgb_params_from_gridsearch = {
        # 'tree_method': 'hist',  # 'hist',
        'booster': 'gbtree',
        'predictor': 'cpu_predictor',
        'n_estimators': 10000,
        'learning_rate': 0.03628302216953097,
        'reg_lambda': 0.0008746338866473539,
        'reg_alpha': 23.13181079976304,
        'subsample': 0.7875490025178415,
        'colsample_bytree': 0.11807135201147481,
        'max_depth': 3,
        'random_state': 0,
        # 'n_jobs': -1,
        # 'gpu_id': 0,
        # 'single_precision_histogram': True,
    }

    # xgb_model = XGBRegressor(**xgb_params_tutorial)
    xgb_model = XGBRegressor(**xgb_params_from_gridsearch)

    start = time.time()

    xgb_model.fit(
        x_train, y_train,
        early_stopping_rounds=300,
        eval_set=[(x_valid, y_valid)],
        verbose=1000
    )

    print('time: ', (time.time() - start) / 60.0)

    preds_valid = xgb_model.predict(x_valid)
    test_preds = xgb_model.predict(x_test)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    
    mae = mean_squared_error(y_valid, preds_valid, squared=False)
    scores.append(mae)

    print(fold, mae)

print(f"\nAverage MAE: {np.mean(scores) :0.6f}\tStd: {np.std(scores) :0.4f}")

[0]	validation_0-rmse:7.50936
[1000]	validation_0-rmse:0.72271
[2000]	validation_0-rmse:0.71880
[3000]	validation_0-rmse:0.71722
[4000]	validation_0-rmse:0.71647
[5000]	validation_0-rmse:0.71610
[6000]	validation_0-rmse:0.71593
[6766]	validation_0-rmse:0.71582
time:  4.956782607237498
0 0.715807786165432
[0]	validation_0-rmse:7.49115
[1000]	validation_0-rmse:0.72238
[2000]	validation_0-rmse:0.71843
[3000]	validation_0-rmse:0.71677
[4000]	validation_0-rmse:0.71591
[5000]	validation_0-rmse:0.71553
[6000]	validation_0-rmse:0.71533
[7000]	validation_0-rmse:0.71527
[7139]	validation_0-rmse:0.71529
time:  5.279765991369883
1 0.7152381792086441
[0]	validation_0-rmse:7.49700
[1000]	validation_0-rmse:0.72063
[2000]	validation_0-rmse:0.71684
[3000]	validation_0-rmse:0.71551
[4000]	validation_0-rmse:0.71487
[5000]	validation_0-rmse:0.71465
[6000]	validation_0-rmse:0.71454
[7000]	validation_0-rmse:0.71450
[7756]	validation_0-rmse:0.71455
time:  5.923243029912313
2 0.71447077947362
[0]	validation_0

### Current best score
    Local:         0.71659 (V10 CPU 5-fold optimized params)
    Kaggle-test:   0.????? (V9 CPU 10-fold)
    Kaggle-result: 0.71751 (V9 CPU 10-fold)

K-Fold tutorial run as published results in:

    avg ~= 0.725
    
With XGB params from Sferrer & lesson 5:

    Local:        0.71659 (V10 CPU 5-fold)
    Kaggle-test:  0.71774 (GPU V10 10-fold)
    Kagle-result: 0.71751 (CPU V9 10-fold)

In [35]:
preds = np.mean(np.column_stack(final_test_predictions), axis=1)

In [36]:
# sample_submission.target = preds
# sample_submission.to_csv("submission.csv", index=False)

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [22]:
# Use the model to generate predictions
predictions = xgb_model.predict(x_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': x_test.index,
                       'target': predictions})

output.to_csv('tenc_5f.csv', index=False)

## Stacking the different solutions


In [None]:
sample_submission = pd.read_csv("../input/30-days-of-ml/sample_submission.csv")
useful_features = ["pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]
df_test = df_test[useful_features]

final_test_predictions = []
final_valid_predictions = {}
scores = []
for fold in range(5):
    xtrain =  df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()

    valid_ids = xvalid.id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    

    params = {
        'random_state': 1, 
        'booster': 'gbtree',
        'n_estimators': 7000,
        'learning_rate': 0.03,
        'max_depth': 2
    }
    
    model = XGBRegressor(
        n_jobs=4,
        **params
    )
    model.fit(xtrain, ytrain, early_stopping_rounds=300, eval_set=[(xvalid, yvalid)], verbose=1000)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    print(fold, rmse)
    scores.append(rmse)

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_1"]
final_valid_predictions.to_csv("level1_train_pred_1.csv", index=False)

sample_submission.target = np.mean(np.column_stack(final_test_predictions), axis=1)
sample_submission.columns = ["id", "pred_1"]
sample_submission.to_csv("level1_test_pred_1.csv", index=False)

In [None]:
sample_submission = pd.read_csv("../input/30-days-of-ml/sample_submission.csv")
useful_features = ["pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]
df_test = df_test[useful_features]

final_test_predictions = []
final_valid_predictions = {}
scores = []
for fold in range(5):
    xtrain =  df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()

    valid_ids = xvalid.id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    model = RandomForestRegressor(n_estimators=500, n_jobs=-1, max_depth=3)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    print(fold, rmse)
    scores.append(rmse)

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_2"]
final_valid_predictions.to_csv("level1_train_pred_2.csv", index=False)

sample_submission.target = np.mean(np.column_stack(final_test_predictions), axis=1)
sample_submission.columns = ["id", "pred_2"]
sample_submission.to_csv("level1_test_pred_2.csv", index=False)

In [None]:
sample_submission = pd.read_csv("../input/30-days-of-ml/sample_submission.csv")
useful_features = ["pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]
df_test = df_test[useful_features]

final_test_predictions = []
final_valid_predictions = {}
scores = []
for fold in range(5):
    xtrain =  df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()

    valid_ids = xvalid.id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    model = GradientBoostingRegressor(n_estimators=500, max_depth=3)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    print(fold, rmse)
    scores.append(rmse)

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_3"]
final_valid_predictions.to_csv("level1_train_pred_3.csv", index=False)

sample_submission.target = np.mean(np.column_stack(final_test_predictions), axis=1)
sample_submission.columns = ["id", "pred_3"]
sample_submission.to_csv("level1_test_pred_3.csv", index=False)

In [None]:
df = pd.read_csv("../input/30days-folds/train_folds.csv")
df_test = pd.read_csv("../input/30-days-of-ml/test.csv")
sample_submission = pd.read_csv("../input/30-days-of-ml/sample_submission.csv")

df1 = pd.read_csv("level1_train_pred_1.csv")
df2 = pd.read_csv("level1_train_pred_2.csv")
df3 = pd.read_csv("level1_train_pred_3.csv")

df_test1 = pd.read_csv("level1_test_pred_1.csv")
df_test2 = pd.read_csv("level1_test_pred_2.csv")
df_test3 = pd.read_csv("level1_test_pred_3.csv")

df = df.merge(df1, on="id", how="left")
df = df.merge(df2, on="id", how="left")
df = df.merge(df3, on="id", how="left")

df_test = df_test.merge(df_test1, on="id", how="left")
df_test = df_test.merge(df_test2, on="id", how="left")
df_test = df_test.merge(df_test3, on="id", how="left")

df.head()

In [None]:
useful_features = ["pred_1", "pred_2", "pred_3"]
df_test = df_test[useful_features]

final_predictions = []
scores = []
for fold in range(5):
    xtrain =  df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    model = LinearRegression()
    model.fit(xtrain, ytrain)
    
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_predictions.append(test_preds)
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    print(fold, rmse)
    scores.append(rmse)

print(np.mean(scores), np.std(scores))