# Extra-Trees, DNN, XGBoost 🌱

**Overview**

This Notebook explores the benefits of combining multiple model predictions to enhance the inference capabilities,</br> 
I used the following Notebook as Guidance https://www.kaggle.com/dmitryuarov/forest-of-extra-trees-0-9895-up-to-4th-place so thanks very much to the author for sharing
I apply multiple modifications to the original code based on my experience in this competition and other models that I developed.

**Goal:** Explore model blending for TPS February 2022, Improve LB Score

---
**Notebook Overview**
- Notebook Goals: A few lines on the main objective of this Notebook
- Table of Content
- Notebook Updates
- Future Ideas to Implement

**Installing Machine Learning Libraries**

**Loading the Required Python Libraries**

**Loading the Dataset Information**

**Understanding the Information Loaded, EDA, and Others**

**Data Pre-Processing**

- Memory Optimization
- Outlier Elimination
- Other Modifications Required (Merge, Join, Others)

**Feature Engineering.**

**Data Processing for Training**

- Label Encoding
- Feature Selection and Creation of Train Dataset and Labels
- Train, Test Split

**Baseline Model.**

**Model Inference and Evaluation.**

**Cross-Validation Loop.**

**Advance Model Development and Training.**

**Model Inference and Evaluation.**

**Development of Other Model Architectures.**

**Model Ensembling, Blending, or Stacking.**

**Model Inference and Evaluation.**

**Model Submission.**
...

---

**Notebook Updates**

...

**Future Ideas**
...


https://www.kaggle.com/dmitryuarov/forest-of-extra-trees-0-9895-up-to-4th-place

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

from sklearn.ensemble import ExtraTreesClassifier
from scipy import stats

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '/kaggle/input/ubiquant-market-prediction/'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')

In [None]:
%%time
sub = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

In [None]:
%%time
trn_data.head()

In [None]:
%%time
tst_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
%%time
trn_data = reduce_mem_usage(trn_data)

In [None]:
%%time
tst_data = reduce_mem_usage(tst_data)

# Feature Engineering

In [None]:
%%time
ignore = ['target', 'row_id']
features = [feat for feat in trn_data.columns if feat not in ignore]

In [None]:
%%time
def create_features(df):
    """
    Created multiple features...
    """    
    df['A_sum'] = df[features].sum(axis = 1)
    df['A_min'] = df[features].min(axis = 1)
    df['A_max'] = df[features].max(axis = 1)    
    df['A_std'] = df[features].std(axis = 1)
    df['A_mad'] = df[features].mad(axis = 1)
    df['A_var'] = df[features].var(axis = 1)
    df['A_mean'] = df[features].mean(axis = 1)
    df['A_median'] = df[features].median(axis = 1)

    df['A_positive'] = df.select_dtypes(include='float64').gt(0).sum(axis=1)
    
    df['q01'] = df[features].quantile(q=0.01, axis=1)
    df['q05'] = df[features].quantile(q=0.05, axis=1)
    df['q10'] = df[features].quantile(q=0.10, axis=1)
    df['q25'] = df[features].quantile(q=0.25, axis=1)
    df['q75'] = df[features].quantile(q=0.75, axis=1)
    df['q90'] = df[features].quantile(q=0.90, axis=1)
    df['q95'] = df[features].quantile(q=0.95, axis=1)
    df['q99'] = df[features].quantile(q=0.99, axis=1)
    df['max'] = df[features].max(axis=1)
    df['min'] = df[features].min(axis=1)
    
    df['std'] = df[features].std(axis=1)
    df['range'] = df['max'] - df['min']
    df['iqr'] = df['q75'] - df['q25']
    df['tails'] = df['range'] / df['iqr']
    df['dispersion'] = df['std'] / df['A_mean']
    df['dispersion_2'] = df['iqr'] / df['A_median']
    df['skew'] = df[features].skew(axis=1)
    df['kurt'] = df[features].kurt(axis=1)
    
    df['median-max'] = df['A_median'] - df['max']
    df['median-min'] = df['A_median'] - df['min']
    df['q99-q95'] = df['q99'] - df['q95']
    df['q99-q90'] = df['q99'] - df['q90']
    df['q01-q05'] = df['q01'] - df['q05']
    df['q01-q10'] = df['q01'] - df['q10']

    return df

In [None]:
%%time
#trn_data = create_features(trn_data)
#tst_data = create_features(tst_data)

# Post Processing

In [None]:
encoder = LabelEncoder()
trn_data['target'] = encoder.fit_transform(trn_data['target'])

In [None]:
X = trn_data[features]
y = trn_data['target']
X_test = tst_data[features]

# Machine Learning Model Development

In [None]:
SPLITS = 5
SEED = 51
SHUFFLE = True

# XGBoost Model

In [None]:
XGB_params = {'max_depth': 8,
              'learning_rate': 0.2478225904887278, 
              'min_child_weight': 8, 
              'gamma': 0.018329940112279165, 
              'alpha': 0.00019394894279195157, 
              'lambda': 0.06161761858777205, 
              'colsample_bytree': 0.6721122683333417, 
              'subsample': 0.6155733760919804,
              'n_estimators': 3000,
              'tree_method': 'gpu_hist',
              'booster': 'gbtree',
              'random_state': 228,
              'use_label_encoder': False,
              'objective': 'multi:softmax',
              'eval_metric': 'mlogloss',
              'predictor': 'gpu_predictor'
             }

In [None]:
scores, predictions = [], []

k = StratifiedKFold(n_splits = SPLITS, random_state = SEED, shuffle = SHUFFLE)

for iteration, (trn_idx, val_idx) in enumerate(k.split(X, y)):
    X_train, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[trn_idx], y.iloc[val_idx]
    
    model = XGBClassifier(**XGB_params)
    model.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = False, early_stopping_rounds = 30)
    val_pred = model.predict(X_val)
    val_score = accuracy_score(y_val, val_pred)
    print(f'Fold {iteration} accuracy score: {round(val_score, 4)}')
    
    scores.append(val_score)
    predictions.append(model.predict(X_test))
    
print('')    
print(f'Mean accuracy - {round(np.mean(scores), 4)}')

In [None]:
sub['target'] = stats.mode(np.column_stack(predictions), axis = 1)[0]
sub.to_csv('xg_boost_submission.csv', index = False)

In [None]:
sub['target']

# Extra Trees Model

In [None]:
EXT_params = {'n_estimators': 2373,
              'max_depth':3691,
              'min_samples_split':3,
              'min_samples_leaf':1,
              'criterion':'gini',
             }

In [None]:
scores, predictions = [], []

k = StratifiedKFold(n_splits = SPLITS, random_state = SEED, shuffle = SHUFFLE)

for iteration, (trn_idx, val_idx) in enumerate(k.split(X, y)):
    X_train, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[trn_idx], y.iloc[val_idx]
    
    model = ExtraTreesClassifier(**EXT_params)
    model.fit(X_train, y_train)
    val_pred = model.predict(X_val)
    val_score = accuracy_score(y_val, val_pred)
    print(f'Fold {iteration+1} accuracy score: {round(val_score, 4)}')
    
    scores.append(val_score)
    predictions.append(model.predict(X_test))
    
print('')    
print(f'Mean accuracy - {round(np.mean(scores), 4)}')

In [None]:
sub['target'] = stats.mode(np.column_stack(predictions), axis = 1)[0]
sub.to_csv('extra_trees_submission.csv', index = False)

In [None]:
sub['target']

# Deep Neuronal Network

In [None]:
import tensorflow as tf
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dense, Dropout, Input, Concatenate

In [None]:
def my_dnn_model():
    x_input = Input(shape = (X.shape[-1]), name = "input")
    x1 = Dense(256, activation='selu')(x_input)
    b1 = BatchNormalization()(x1)
    x2 = Dense(128, activation='selu')(b1)
    b2 = BatchNormalization()(x2)
    x3 = Dense(128, activation='selu')(b1)
    b3 = BatchNormalization()(x3)
    
    d1 = Dropout(0.15)(Concatenate()([b2, b3]))
    x4 = Dense(128, activation='relu')(d1) 
    b4 = BatchNormalization()(x4)
    x5 = Dense(64, activation='selu')(b4)
    b5 = BatchNormalization()(x5)
    x6 = Dense(32, activation='selu')(b5)
    b6 = BatchNormalization()(x6)
    output = Dense(10, activation="softmax", name="output")(b6)
    
    model = tf.keras.Model(x_input, output, name='DNN_Model')
    return model

model = my_dnn_model()

In [None]:
plot_model(model, to_file='Super_Model.png', show_shapes=True,show_layer_names=True, dpi = 65)

In [None]:
VERBOSE = False
BATCH_SIZE = 64
EPOCHS = 250

In [None]:
scores, predictions = [], []

k = StratifiedKFold(n_splits = SPLITS, random_state = SEED, shuffle = SHUFFLE)

for iteration, (trn_idx, val_idx) in enumerate(k.split(X, y)):
    X_train, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[trn_idx], y.iloc[val_idx]
    
    model = my_dnn_model()
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics="accuracy")
    
    lr = ReduceLROnPlateau(monitor="val_loss", factor=0.6, patience=3, verbose=VERBOSE)
    es = EarlyStopping(monitor="val_loss", patience=7, verbose=VERBOSE, mode="min", restore_best_weights=True)

    save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
    chk_point = ModelCheckpoint(f'./TPS1_model_2022_{iteration+1}C.h5', options=save_locally, monitor='val_loss', verbose=VERBOSE, save_best_only=True, mode='min')
    
    
    model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=EPOCHS, verbose=VERBOSE, batch_size=BATCH_SIZE, callbacks=[lr, chk_point, es])
    load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    model = load_model(f'./TPS1_model_2022_{iteration+1}C.h5', options=load_locally)
    
    y_pred = model.predict(X_val, batch_size=BATCH_SIZE)
    score = accuracy_score(y_val, np.argmax(y_pred, axis=1))
    scores.append(score)
    
    predictions.append(np.argmax(model.predict(X_test, batch_size=BATCH_SIZE), axis=1))
    print(f"Fold-{iteration+1} | OOF Score: {score}")
    
print(f'Mean accuracy on {k.n_splits} folds - {np.mean(scores)}')

In [None]:
sub['target'] = stats.mode(np.column_stack(predictions), axis = 1)[0]
sub.to_csv('dnn_submission.csv', index=False)

In [None]:
sub['target']

# Blending Models

In [None]:
submission_01 = pd.read_csv('./extra_trees_submission.csv')
submission_02 = pd.read_csv('./xg_boost_submission.csv')
submission_03 = pd.read_csv('./dnn_submission.csv')

In [None]:
submission_01.head()

In [None]:
submission_02.head()

In [None]:
submission_03.head()

In [None]:
blend_predictions = []
for prediction in [submission_01, submission_02, submission_03]:
    blend_predictions.append(prediction['target'])
    
sub['target'] = encoder.inverse_transform(stats.mode(np.column_stack(blend_predictions), axis = 1)[0])
sub.to_csv('blended_submission.csv', index=False)

In [None]:
sub