Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

### This version of the notebook will follow the tutorials published by Abishek
https://www.kaggle.com/abhishek/

* part-1: Baseline

### Tensorflow
Including GPU support - sometimes - having trouble keeping tf-gpu working in Anaconda on Windoze


In [1]:
import tensorflow as tf
print(tf.__version__)

config = tf.compat.v1.ConfigProto(
    gpu_options=tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.8),
    device_count={'GPU': 1},
    # session = tf.compat.v1.Session(config=config) 
    # tf.compat.v1.keras.backend.set_session(session)
)

session = tf.compat.v1.Session(config=config)
tf.compat.v1.keras.backend.set_session(session)

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


2.3.0
Num GPUs Available:  0


### Import the necessary modules

In [2]:
import time
import os
import numpy as np
import pandas as pd

# modules called out in part-1
from sklearn import model_selection
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor


# other modules from previous work
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.metrics import confusion_matrix

# Regressors
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import LinearRegression
from lightgbm import LGBMRegressor
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# suppress "torch" warning in TPOT
import warnings
warnings.filterwarnings('ignore')


# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [3]:
# Load the training data
path = ""  # "../input/30-days-of-ml/"

df_test = pd.read_csv(f"{path}test.csv")
sample_submission = pd.read_csv(f"{path}sample_submission.csv")


### Implement KFold techniques
posted in the discussions by KGM - Abishek Thakur


In [4]:
# create train_folds.csv if it does not exist
if not Path("train_folds.csv").is_file():
    df_train = pd.read_csv(f"{path}train.csv")
    df_train["kfold"] = -1

    kf = model_selection.KFold(
        n_splits=5,
        shuffle=True,
        random_state=42
    )

    for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=df_train)):
        df_train.loc[valid_indicies, "kfold"] = fold

    df_train.to_csv("train_folds.csv", index=False)

df_train = pd.read_csv(f"{path}train_folds.csv")


In [5]:
print(df_train.head)

<bound method NDFrame.head of             id cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8  ...     cont6  \
0            1    B    B    B    C    B    B    A    E    C  ...  0.160266   
1            2    B    B    A    A    B    D    A    F    A  ...  0.558922   
2            3    A    A    A    C    B    D    A    D    A  ...  0.375348   
3            4    B    B    A    C    B    D    A    E    C  ...  0.239061   
4            6    A    A    A    C    B    D    A    E    A  ...  0.420667   
...        ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...       ...   
299995  499993    B    B    A    A    B    D    A    E    A  ...  0.450538   
299996  499996    A    B    A    C    B    B    A    E    E  ...  0.508502   
299997  499997    B    B    A    C    B    C    A    E    G  ...  0.372425   
299998  499998    A    B    A    C    B    B    A    E    E  ...  0.424243   
299999  499999    A    A    A    C    A    D    A    E    A  ...  0.328669   

           cont7     cont8     co

??? Why are we stripping test to only the catX columns ???

In [6]:
useful_features = [c for c in df_train.columns if c not in ("id", "target", "kfold")]

object_cols = [col for col in useful_features if 'cat' in col]

df_test = df_test[useful_features]

In [7]:
# Separate target from features
y = df_train['target']
features = df_train.drop(['target'], axis=1)

# Preview features
features.head()

Unnamed: 0,id,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,kfold
0,1,B,B,B,C,B,B,A,E,C,...,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985,0
1,2,B,B,A,A,B,D,A,F,A,...,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,2
2,3,A,A,A,C,B,D,A,D,A,...,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,4
3,4,B,B,A,C,B,D,A,E,C,...,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682,3
4,6,A,A,A,C,B,D,A,E,A,...,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,1


## Tutorial stops before step 3 below ...
Abishek is handling the ordinal encoding and stripping out target on each loop.  Feels like we could do this before the loop and simplify the code


In [10]:
final_predictions = []
results = []

for fold in range(5):
    x_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    x_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()

    y_train = x_train.target
    y_valid = x_valid.target

    x_train = x_train[useful_features]
    x_valid = x_valid[useful_features]

    ordinal_encoder = OrdinalEncoder()
    x_train[object_cols] = ordinal_encoder.fit_transform(x_train[object_cols])
    x_valid[object_cols] = ordinal_encoder.transform(x_valid[object_cols])
    x_test[object_cols] = ordinal_encoder.transform(x_test[object_cols])

    xgb_params_tutorial = {
        'random_state': fold,
        'n_jobs': -1,
    }

    xgb_params_from_gridsearch = {
        'n_estimators': 5000,
        'learning_rate': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.2,
        'max_depth': 3,
        'booster': 'gbtree',
        'reg_lambda': 0.2,
        'reg_alpha': 15,
        'random_state': fold,
        'n_jobs': -1,
        # 'gpu_id': 0,
        # 'tree_method': 'gpu_hist',
        # 'predictor': 'gpu_predictor'
    }

    # xgb_model = XGBRegressor(**xgb_params_tutorial)
    xgb_model = XGBRegressor(**xgb_params_from_gridsearch)

    %time xgb_model.fit(x_train, y_train)

    preds_valid = xgb_model.predict(x_valid)
    test_preds = xgb_model.predict(x_test)
    final_predictions.append(test_preds)

    mae = mean_squared_error(y_valid, preds_valid, squared=False)
    results.append(mae)

    print(fold, mae)

print(f"Average MAE: {sum(results)/len(results) :0.6f}")

Wall time: 3min 15s
0 0.7157593224363182
Wall time: 3min 19s
1 0.7160153248020477
Wall time: 3min 23s
2 0.7179753915186063
Wall time: 3min 24s
3 0.7177027838963904
Wall time: 3min 24s
4 0.7159340302614421


### Current best score
    Local: 0.71668 (5-fold)
    Kaggle-test: 0.71787 (V5 10-fold)
    Kaggle-result: 0.71853 (V5 10-fold)

Tutorial run as published results in:

    0  0.7242812912900478
    1  0.7232810321072864
    2  0.725452249623988
    3  0.725286377838993
    4  0.7242629367174096
    avg ~= 0.725
    
Upgrading to use settings from GridSearchCV work:

    Local: 0.71668 (5-fold)
    Kaggle-test: 0.71787 (V5 10-fold)
    Kagle-result: 0.71853 (V5 10-fold)

In [11]:
preds = np.mean(np.column_stack(final_predictions), axis=1)

In [None]:
# sample_submission.target = preds
# sample_submission.to_csv("submission.csv", index=False)

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [25]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# ordinal-encode categorical columns
X = features.copy()
X_test = df_test.copy()

ordinal_encoder = OrdinalEncoder()
X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
X_test[object_cols] = ordinal_encoder.transform(df_test[object_cols])

# Preview the ordinal-encoded features
X.head()

Unnamed: 0,id,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,kfold
0,1,1.0,1.0,1.0,2.0,1.0,1.0,0.0,4.0,2.0,...,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985,0
1,2,1.0,1.0,0.0,0.0,1.0,3.0,0.0,5.0,0.0,...,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,2
2,3,0.0,0.0,0.0,2.0,1.0,3.0,0.0,3.0,0.0,...,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,4
3,4,1.0,1.0,0.0,2.0,1.0,3.0,0.0,4.0,2.0,...,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682,3
4,6,0.0,0.0,0.0,2.0,1.0,3.0,0.0,4.0,0.0,...,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,1


In [7]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [8]:
from sklearn import preprocessing
from sklearn import utils

lab_enc = preprocessing.LabelEncoder()
y_train_enc = lab_enc.fit_transform(y_train)

lab_enc = preprocessing.LabelEncoder()
y_valid_enc = lab_enc.fit_transform(y_valid)

# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

In [7]:
run_rf = False

if run_rf:
    model = RandomForestRegressor(random_state=1)

    # Train the model (will take about 10 minutes to run)
    %time model.fit(X_train, y_train)

    pred_rf = model.predict(X_valid)
    print(mean_squared_error(y_valid, pred_rf, squared=False))

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

In [None]:
xgb_model.get_params().keys()

In [9]:
run_xgb = False

if run_xgb:
    # Feed the XGB into the model pipeline
    my_pipeline = Pipeline(
        [
         # ('imputer', Imputer()),
         ('xgbrg', XGBRegressor())
        ]
    )

    param_grid = {
        "xgbrg__n_estimators": [5000, 10000],
        "xgbrg__learning_rate": [0.05, 0.1],
        "xgbrg__subsample": [0.8],
        "xgbrg__colsample_bytree": [0.2],
        "xgbrg__max_depth": [3, 5],
        "xgbrg__booster": ['gbtree'],
        "xgbrg__reg_lambda": [0.2, 0.4, 0.6],
        "xgbrg__reg_alpha": [13, 15],
        "xgbrg__random_state": [42],
        "xgbrg__n_jobs": [-1],
        # "xgbrg__gpu_id": [0],
        # "xgbrg__tree_method": ['gpu_hist'],
        # "xgbrg__verbosity": [1]
    }

    '''
    params = {
        'learning_rate': 0.07853392035787837,
        'reg_lambda': 1.7549293092194938e-05,
        'reg_alpha': 14.68267919457715,
        'subsample': 0.8031450486786944,
        'colsample_bytree': 0.170759104940733,
        'max_depth': 3
    }
    '''

    searchCV = GridSearchCV(
        my_pipeline,
        cv=3,
        param_grid=param_grid,
    )

    start = time.time()

    searchCV.fit(
        X_train, y_train,
        xgbrg__early_stopping_rounds=300,
        xgbrg__eval_set=[(X_valid, y_valid)],
        xgbrg__verbose=1000
    )

    print((time.time() - start)/60.0)


In [10]:
# Print the parameters which yield the best model performance
print(searchCV.best_estimator_)
print(searchCV.best_score_)
print(searchCV.best_params_)
# print(pd.DataFrame(grid.cv_results_))


NameError: name 'searchCV' is not defined

In [15]:
xgb_parameters = {
    'n_estimators': 5000,
    'learning_rate': 0.05,
    'n_jobs': -1,
    'subsample': 0.8,
    'colsample_bytree': 0.2,
    'max_depth': 3,
    'booster': 'gbtree',
    'reg_lambda': 0.2,
    'reg_alpha': 15,
    'random_state': 42,
    # 'gpu_id': 0,
    # 'tree_method': 'gpu_hist',
    # 'predictor': 'gpu_predictor'
}

'''
params = {
    'learning_rate': 0.07853392035787837,
    'reg_lambda': 1.7549293092194938e-05,
    'reg_alpha': 14.68267919457715,
    'subsample': 0.8031450486786944,
    'colsample_bytree': 0.170759104940733,
    'max_depth': 3
}
'''

xgb_model = XGBRegressor(**xgb_parameters)

start = time.time()

xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    early_stopping_rounds=300,
    verbose=1000,
)

print((time.time()-start)/60.0)

pred_xgb = xgb_model.predict(X_valid)

print(mean_squared_error(y_valid, pred_xgb, squared=False))

[0]	validation_0-rmse:7.39030
[1000]	validation_0-rmse:0.72304
[2000]	validation_0-rmse:0.71998
[3000]	validation_0-rmse:0.71903
[4000]	validation_0-rmse:0.71885
[4239]	validation_0-rmse:0.71888
2.8950045386950176
0.7188357949796684


### Using Light GBM

In [None]:
run_lgbm = False

if run_lgbm:
    from lightgbm import LGBMRegressor

    lgbm_parameters = {
        'metric': 'rmse',
        'n_jobs': -1,
        'n_estimators': 10000,
        'reg_alpha': 10.924491968127692,
        'reg_lambda': 17.396730654687218,
        'colsample_bytree': 0.21497646795452627,
        'subsample': 0.7582562557431147,
        'learning_rate': 0.01,
        'max_depth': 12,
        'num_leaves': 32,
        'min_child_samples': 16,
        'max_bin': 256,
        'cat_l2': 0.025083670064082797
    }

    lgbm_model = LGBMRegressor(**lgbm_parameters)
    lgbm_model.fit(
        X_train, y_train,
        eval_set=((X_valid, y_valid)),
        verbose=-1,
        early_stopping_rounds=64,
        categorical_feature=object_cols
    )

    pred_lgbm = lgbm_model.predict(X_valid)

    print(mean_squared_error(y_valid, pred_lgbm, squared=False))

### TPOT to find best solution

In [14]:
type(y_train)

pandas.core.series.Series

In [9]:
# TPOT for classification
from tpot import TPOTClassifier

# Instantiate and train a TPOT auto-ML classifier
tpot = TPOTClassifier(
    generations=1,
    population_size=5,
    subsample=0.05,
    # config_dict='TPOT cuML',
    verbosity=2,
    n_jobs=-1,
    random_state=42,
)

%time tpot.fit(X_train, y_train_enc)

### END CODE HERE ###

# Export the optimized pipeline as Python code.
tpot.export('tpot_products_pipeline.py')

Optimization Progress:   0%|          | 0/10 [00:00<?, ?pipeline/s]

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Use the model to generate predictions
predictions = model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})

output.to_csv('submission.csv', index=False)

In [None]:
params = {
    'learning_rate': 0.07853392035787837,
    'reg_lambda': 1.7549293092194938e-05,
    'reg_alpha': 14.68267919457715,
    'subsample': 0.8031450486786944,
    'colsample_bytree': 0.170759104940733,
    'max_depth': 3
}

model = XGBRegressor(
    random_state=0, 
    #tree_method='gpu_hist',
    #gpu_id=0,
    #predictor="gpu_predictor",
    n_estimators=5000,
    **params
)

model.fit(
    xtrain, ytrain,
    early_stopping_rounds=300,
    eval_set=[(xvalid, yvalid)],
    verbose=1000
)

preds_valid = model.predict(xvalid)
test_preds = model.predict(xtest)
final_predictions.append(test_preds)
rmse = mean_squared_error(yvalid, preds_valid, squared=False)
print(fold, rmse)
scores.append(rmse)

print(np.mean(scores), np.std(scores))