Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

### Tensorflow
Including GPU support - sometimes - having trouble keeping tf-gpu working in Anaconda on Windoze


In [None]:
import tensorflow as tf
print(tf.__version__)

config = tf.compat.v1.ConfigProto(
    gpu_options=tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.8),
    device_count={'GPU': 1},
    # session = tf.compat.v1.Session(config=config) 
    # tf.compat.v1.keras.backend.set_session(session)
)

session = tf.compat.v1.Session(config=config)
tf.compat.v1.keras.backend.set_session(session)

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


### Import the necessary modules

In [None]:
import time
import os
import numpy as np
import pandas as pd

# data management
from sklearn import model_selection
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix

# Regressors
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# suppress "torch" warning in TPOT
import warnings
warnings.filterwarnings('ignore')


# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [None]:
# Load the training data
path = ""  # "../input/30-days-of-ml/"
df_train = pd.read_csv(f"{path}train.csv")
df_test = pd.read_csv(f"{path}test.csv")

# Preview the data
df_train.head()

In [None]:
df_train.columns

In [None]:
df_train["kfold"] = -1

kf = model_selection.KFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=df_train)):
    df_train.loc[valid_indicies, "kfold"] = fold

In [None]:
df_train.to_csv("train_folds.csv", index=False)

In [None]:
# Separate target from features
y = df_train['target']
features = df_train.drop(['target'], axis=1)

# Preview features
features.head()

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [None]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# ordinal-encode categorical columns
X = features.copy()
X_test = df_test.copy()

ordinal_encoder = OrdinalEncoder()
X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
X_test[object_cols] = ordinal_encoder.transform(df_test[object_cols])

# Preview the ordinal-encoded features
X.head()

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [None]:
from sklearn import preprocessing
from sklearn import utils

lab_enc = preprocessing.LabelEncoder()
y_train_enc = lab_enc.fit_transform(y_train)

lab_enc = preprocessing.LabelEncoder()
y_valid_enc = lab_enc.fit_transform(y_valid)

# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

In [None]:
run_rf = False

if run_rf:
    model = RandomForestRegressor(random_state=1)

    # Train the model (will take about 10 minutes to run)
    %time model.fit(X_train, y_train)

    pred_rf = model.predict(X_valid)
    print(mean_squared_error(y_valid, pred_rf, squared=False))

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

In [None]:
xgb_model.get_params().keys()

In [None]:
run_xgb_search = True

if run_xgb_search:
    # Feed the XGB into the model pipeline
    my_pipeline = Pipeline(
        [
         # ('imputer', Imputer()),
         ('xgbrg', XGBRegressor())
        ]
    )

    param_grid = {
        "xgbrg__n_estimators": [5000, 10000],
        "xgbrg__learning_rate": [0.05, 0.1],
        "xgbrg__subsample": [0.8],
        "xgbrg__colsample_bytree": [0.2],
        "xgbrg__max_depth": [3, 5],
        "xgbrg__booster": ['gbtree'],
        "xgbrg__reg_lambda": [0.2, 0.4, 0.6],
        "xgbrg__reg_alpha": [13, 15],
        "xgbrg__random_state": [42],
        "xgbrg__n_jobs": [-1],
        "xgbrg__gpu_id": [0],
        "xgbrg__tree_method": ['gpu_hist'],
        # "xgbrg__verbosity": [1]
    }

    '''
    params = {
        'learning_rate': 0.07853392035787837,
        'reg_lambda': 1.7549293092194938e-05,
        'reg_alpha': 14.68267919457715,
        'subsample': 0.8031450486786944,
        'colsample_bytree': 0.170759104940733,
        'max_depth': 3
    }
    '''

    searchCV = GridSearchCV(
        my_pipeline,
        cv=3,
        param_grid=param_grid,
    )

    start = time.time()

    searchCV.fit(
        X_train, y_train,
        xgbrg__early_stopping_rounds=300,
        xgbrg__eval_set=[(X_valid, y_valid)],
        xgbrg__verbose=1000
    )

    print((time.time() - start)/60.0)


In [None]:
# Print the parameters which yield the best model performance
print(searchCV.best_estimator_)
print(searchCV.best_score_)
print(searchCV.best_params_)
# print(pd.DataFrame(grid.cv_results_))


In [None]:
run_xgb = False

if run_xgb:
    xgb_parameters = {
        'n_estimators': 5000,
        'learning_rate': 0.05,
        'n_jobs': -1,
        'subsample': 0.8,
        'colsample_bytree': 0.2,
        'max_depth': 3,
        'booster': 'gbtree',
        'reg_lambda': 0.2,
        'reg_alpha': 15,
        'random_state': 42,
        # 'gpu_id': 0,
        # 'tree_method': 'gpu_hist',
        # 'predictor': 'gpu_predictor'
    }

    '''
    params = {
        'learning_rate': 0.07853392035787837,
        'reg_lambda': 1.7549293092194938e-05,
        'reg_alpha': 14.68267919457715,
        'subsample': 0.8031450486786944,
        'colsample_bytree': 0.170759104940733,
        'max_depth': 3
    }
    '''

    xgb_model = XGBRegressor(**xgb_parameters)

    start = time.time()

    xgb_model.fit(
        X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        early_stopping_rounds=300,
        verbose=1000,
    )

    print((time.time()-start)/60.0)

    pred_xgb = xgb_model.predict(X_valid)

    print(mean_squared_error(y_valid, pred_xgb, squared=False))

### Using Light GBM

In [None]:
run_lgbm = False

if run_lgbm:
    from lightgbm import LGBMRegressor

    lgbm_parameters = {
        'metric': 'rmse',
        'n_jobs': -1,
        'n_estimators': 10000,
        'reg_alpha': 10.924491968127692,
        'reg_lambda': 17.396730654687218,
        'colsample_bytree': 0.21497646795452627,
        'subsample': 0.7582562557431147,
        'learning_rate': 0.01,
        'max_depth': 12,
        'num_leaves': 32,
        'min_child_samples': 16,
        'max_bin': 256,
        'cat_l2': 0.025083670064082797
    }

    lgbm_model = LGBMRegressor(**lgbm_parameters)
    lgbm_model.fit(
        X_train, y_train,
        eval_set=((X_valid, y_valid)),
        verbose=-1,
        early_stopping_rounds=64,
        categorical_feature=object_cols
    )

    pred_lgbm = lgbm_model.predict(X_valid)

    print(mean_squared_error(y_valid, pred_lgbm, squared=False))

### TPOT to find best solution

In [None]:
type(y_train)

In [None]:
# TPOT for classification
from tpot import TPOTClassifier

# Instantiate and train a TPOT auto-ML classifier
tpot = TPOTClassifier(
    generations=1,
    population_size=5,
    subsample=0.05,
    # config_dict='TPOT cuML',
    verbosity=2,
    n_jobs=-1,
    random_state=42,
)

%time tpot.fit(X_train, y_train_enc)

### END CODE HERE ###

# Export the optimized pipeline as Python code.
tpot.export('tpot_products_pipeline.py')

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Use the model to generate predictions
predictions = model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})

output.to_csv('submission.csv', index=False)

In [None]:
params = {
    'learning_rate': 0.07853392035787837,
    'reg_lambda': 1.7549293092194938e-05,
    'reg_alpha': 14.68267919457715,
    'subsample': 0.8031450486786944,
    'colsample_bytree': 0.170759104940733,
    'max_depth': 3
}

model = XGBRegressor(
    random_state=0, 
    #tree_method='gpu_hist',
    #gpu_id=0,
    #predictor="gpu_predictor",
    n_estimators=5000,
    **params
)

model.fit(
    xtrain, ytrain,
    early_stopping_rounds=300,
    eval_set=[(xvalid, yvalid)],
    verbose=1000
)

preds_valid = model.predict(xvalid)
test_preds = model.predict(xtest)
final_predictions.append(test_preds)
rmse = mean_squared_error(yvalid, preds_valid, squared=False)
print(fold, rmse)
scores.append(rmse)

print(np.mean(scores), np.std(scores))