# Introduction

This notebook details model 3 architecture and training. Model 3 is an ensemble of 10 models trained on the ubiquant dataset using stratified cross validation(10 folds). There will be other ensembles of models that will be utilized in conjuction with model 1 to make the final predictions on the test set for submission.

This notebook is made up of the following sections:
​
1. Importing libraries
2. Data importation
3. Data wrangling
4. Utility functions
5. Model architecture
6. Model training
7. Api Submission
8. Evaluation
9. Next Steps

NB: This notebook is similar to model 1 and 2 notebook with difference in model architecture.

## 1. Importing Libraries


In [None]:
import os                                               # Functions for interacting with the OS
import pandas as pd                                     # Data manipulation
import numpy as np                                      # Mathematical functions
import gc                                               # Automatically releases memory when an object is no longer used
import matplotlib.pyplot as plt                         # Plotting
import tensorflow as tf                                 # Deep learning API
from tensorflow.keras import layers                     # Deep learning (defining layers)
from tensorflow import keras                            # Deep learning framework
from scipy import stats                                 # Scientific computing and technical computing
import random                                           # Generate random numbers
import seaborn as sns                                   # Plotting
from scipy.stats import *                               # Scientific computation and functions
import warnings                                         # Manage warning messages and outputs
import pickle                                           # Serializing object structures
import lightgbm as lgb                                  # high performance gradient boosting framework
from sklearn.model_selection import train_test_split    # Splitting data into train and test set



## 2. Data Importation

This competition's dataset (18.55 GB) is too large. We will use another dataset converted to utilize less memory in pickle format. The dataset is in pickle format and utilizes less memory.

In [None]:
# Track time to load dataset
!%%time

# Declare number of ananonymized features
n_features = 300

# Select anonymized features
features = [f'f_{i}' for i in range(n_features)]

# Import train set
train = pd.read_pickle('../input/ubiquant-market-prediction-half-precision-pickle/train.pkl')

## 3. Data Wrangling

This section handles:

Selecting the independent and dependent features from the data set.
Dropping time_id feature (will not be utilized in modeling).
Create an integer look up layer for investment _id feature.

In [None]:
# Dataset dimensions
train.shape

In [None]:
# Select investment _id feature for processing
investment_id = train.pop("investment_id")
investment_id.head()

In [None]:
# Drop time_id feature
_ = train.pop("time_id")

In [None]:
# Select dependent / target feature

y = train.pop("target")
y.head()

### 3.1 IntegerLookup Layer

An integer lookup layer is a preprocessing layer which maps integer features to contiguous ranges. It turns integer categorical values into an encoded representation that can be read by an Embedding layer or Dense layer.

This layer maps a set of arbitrary integer input tokens into indexed integer output via a table-based vocabulary lookup.

The integer lookup layer will be one of two input branches for the multi-input keras model. Having a look up layer enables the keras deep learning model to handle both categorical and numeric features.

In [None]:
# Track processing time
!%%time

# Create a list of unique investment_ids
investment_ids = list(investment_id.unique())

# maximum tokens
investment_id_size = len(investment_ids) + 1

# initialize layer
investment_id_lookup_layer = layers.IntegerLookup(max_tokens=investment_id_size)

# Adapt layer to data (investment_ids)
investment_id_lookup_layer.adapt(pd.DataFrame({"investment_ids":investment_ids}))

## 4. Utility Functions

This section defines utility functions to preprocess the data prior submitting to the kaggle API.

In [None]:
# Making Tesorflow dataset
import tensorflow as tf
def preprocess(X, y):
    
    """
    .Pre-processing a tensorflow dataset
    
    Parameters
    ----------
    X : array, a list of features

    y : array, a feature
    
    
    """
    return X, y


def make_dataset(feature, investment_id, y, batch_size=1024, mode="train"):
    
    """ Function to create a source dataset compatable with tensorflow. 
    In addition a dataset transformation is applied  and the data is shuffled 
    if it is part of the training set. 

    Parameters
    ----------
    feature : array, shape = [n, 300]
        300 annonymised features.
    investment_id : list of int, shape = [n]
        List of investment Ids.
    y : array, shape = [n]
        Array containing target values we wish to predict.
    batch_size : int, default = 1024
        Size of batches.
    mode : string, default = "train"
        Variable used to specify if the data if from the training, test or
        validation data sets.
    
    Returns
    -------
    ds : tensorflow dataset, class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset' 
        Dataset in format compatible for training model.
    
    """
    
    ## Read elements from memory
    ds = tf.data.Dataset.from_tensor_slices(((investment_id, feature), y))
    
    ## Map preprocess function
    ds = ds.map(preprocess)
    
    ## If mode is set to train shuffle data
    if mode == "train":
        ds = ds.shuffle(4096)
        
    # Combine consecutive elements of this dataset into batches.
    # Cache the elements in dataset
    # Allow later elements to be prepared while the current element is being processed (prefetch)
    
    ds = ds.batch(batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)
    
    return ds

## 5. Modeling

This section defines the model architecture:

* Layers
* Activation functions
* Optimizer
* Loss function
* Metrics to be tracked


The model architecture is a multi input keras network with 2 input branches. First branch handles investment Ids (categorical feature) while the second branch will handle remaining anonymalized 300 features (numeric features).

### 5.1 Activation Function

Swish activation function is a smooth, non-monotonic function that consistently matches or outperforms ReLU on deep networks, it is unbounded above and bounded below.


### 5.2 Optimizer

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

Adam optimizer will be used with a learning rate of 0.001

### 5.3 Loss Function

The model will attempt to minimize MSE (mean squared error).

### 5.4 Metrics

The following metrics will be tracked during training

1. **Mean Squared Error (MSE)** : Average squared difference between the estimated values and the actual value.
2. **Mean Absolute Error (MAE)** : Average of errors between paired observations. 
3. **Mean Absolute Percentage Error (MAPE)** : measures of how accurate a focus is b percentage.

In [None]:

def get_model():
    
    """ 
    
    Function to define the multi-input keras model architecture. 

    Returns
    -------
    
    model : model, class 'keras.engine.functional.Functional'
        Model groups layers into an object with training and inference features.
    
    """
    investment_id_inputs = tf.keras.Input((1, ), dtype=tf.uint16)
    features_inputs = tf.keras.Input((300, ), dtype=tf.float16)
    
    # Branch 1
    investment_id_x = investment_id_lookup_layer(investment_id_inputs)
    # Turns positive integers (indexes) into dense vectors of fixed size
    investment_id_x = layers.Embedding(investment_id_size, 32, input_length=1)(investment_id_x) 
    investment_id_x = layers.Reshape((-1, ))(investment_id_x)
    investment_id_x = layers.Dense(64, activation='swish')(investment_id_x)
    investment_id_x = layers.Dense(64, activation='swish')(investment_id_x)
    investment_id_x = layers.Dense(64, activation='swish')(investment_id_x)
    
    # Branch 2
    feature_x = layers.Dense(256, activation='swish')(features_inputs)
    feature_x = layers.Dense(256, activation='swish')(feature_x)
    feature_x = layers.Dense(256, activation='swish')(feature_x)
    
    # Takes as input a list of tensors and returns a single tensor that is the concatenation of all inputs
    x = layers.Concatenate(axis=1)([investment_id_x, feature_x])
    x = layers.Dense(512, activation='swish', kernel_regularizer="l2")(x)
    x = layers.Dense(128, activation='swish', kernel_regularizer="l2")(x)
    x = layers.Dense(32, activation='swish', kernel_regularizer="l2")(x)
    
    output = layers.Dense(1)(x)
    
    rmse = keras.metrics.RootMeanSquaredError(name="rmse")
    
    model = tf.keras.Model(inputs=[investment_id_inputs, features_inputs], outputs=[output])
    
    model.compile(optimizer=tf.optimizers.Adam(0.001), loss='mse', metrics=['mse', "mae", "mape"])
    
    return model

In [None]:
model = get_model()
model.summary()
keras.utils.plot_model(model,to_file="model3-architecture.png", show_shapes=True)

## 6. Model Training

* Statified cross validation (a resampling procedure) will be applied to train 10 folds of data resulting in 10 models.
* Model will be trained with batchsize of 1024 (default from make datasets function).
* Callbacks will used to avoid overfitting by early stopping with a patience of 10 epochs.
* The model is set for 30 epochs.


In [None]:
%%time
# Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(10, shuffle=True, random_state=42)
models = []
for index, (train_indices, valid_indices) in enumerate(kfold.split(train, investment_id)):
    X_train, X_val = train.iloc[train_indices], train.iloc[valid_indices]
    investment_id_train = investment_id[train_indices]
    y_train, y_val = y.iloc[train_indices], y.iloc[valid_indices]
    investment_id_val = investment_id[valid_indices]
    train_ds = make_dataset(X_train, investment_id_train, y_train)
    valid_ds = make_dataset(X_val, investment_id_val, y_val, mode="valid")
    model = get_model()
    checkpoint = keras.callbacks.ModelCheckpoint(f"model_{index}", save_best_only=True)
    early_stop = keras.callbacks.EarlyStopping(patience=10)
    history = model.fit(train_ds, epochs=30, validation_data=valid_ds, callbacks=[checkpoint, early_stop])
    models.append(keras.models.load_model(f"model_{index}"))
    
    pearson_score = stats.pearsonr(model.predict(valid_ds).ravel(), y_val.values)[0]
    print('Pearson:', pearson_score)
    pd.DataFrame(history.history, columns=["mse", "val_mse"]).plot()
    plt.title("MSE")
    plt.show()
    pd.DataFrame(history.history, columns=["mae", "val_mae"]).plot()
    plt.title("MAE")
    plt.show()
    pd.DataFrame(history.history, columns=["rmse", "val_rmse"]).plot()
    plt.title("RMSE")
    plt.show()
    del investment_id_train
    del investment_id_val
    del X_train
    del X_val
    del y_train
    del y_val
    del train_ds
    del valid_ds
    gc.collect()

## 7. API Submission

This section preprocesses the test set from the API and makes predictions from the 5 models trained on the cross validated dataset. The five predictions are averaged to a single prediction by the inference function.

In [None]:
def preprocess_test(investment_id, feature):
    """ Functions to pre-process test set.
    
    Parameters
    ----------
    investment_id : list of int, shape = [n]
        List of investment Ids.
    feature : array, shape = [n, 300]
        300 annonymised features.

    Returns
    -------

    """
    return (investment_id, feature), 0



def make_test_dataset(feature, investment_id, batch_size=1024):
    
    """ Function to create a source dataset from the test features compatable 
    with tensorflow. 
    In addition a dataset transformation is applied  and the data is shuffled 
    if it is part of the training set.

    Parameters
    ----------
    feature : array, shape = [n, 300]
        Ground truth (correct) target values.
    investment_id : list of int, shape = [n]
        List of investment Ids.
    batch_size : int, default = 1024
        Size of batches.
    
    Returns
    -------
    ds : tensorflow dataset, class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset' 
        Dataset in format compatible for training model.
        .
    """
    
    ds = tf.data.Dataset.from_tensor_slices(((investment_id, feature)))
    ds = ds.map(preprocess_test)
    ds = ds.batch(batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)
    return ds

def inference(models, ds):
    
    """ Make predictions unsing n models in models and return mean of predictions.

    Parameters
    ----------
    models : array like, shape = [n]
        Trained models.
    ds : tensorflow dataset, class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset' 
        Dataset in format compatible for training model.
    
    Returns
    -------
    mean_y_pred : float
        Mean values of preditions made my each model in models.
    
    """
    
    y_preds = []
    for model in models:
        y_pred = model.predict(ds)
        y_preds.append(y_pred)
    return np.mean(y_preds, axis=0)

In [None]:
import ubiquant
env = ubiquant.make_env()
iter_test = env.iter_test() 
for (test_df, sample_prediction_df) in iter_test:
    ds = make_test_dataset(test_df[features], test_df["investment_id"])
    sample_prediction_df['target'] = inference(models, ds)
    env.predict(sample_prediction_df) 

## 8 Evaluation

Submissions are evaluated on the mean of the Pearson correlation coefficient for each time ID.
The 10 model ensemble resulted in a score of **0.148**

## 9. Next Steps

The strategy is to use this model as an ensemble with other ensembled models to improve the scores.