# Install *treelite* to accelerate the tree model

In [None]:
!pip --quiet install ../input/treelite/treelite-0.93-py3-none-manylinux2010_x86_64.whl

In [None]:
!pip --quiet install ../input/treelite/treelite_runtime-0.93-py3-none-manylinux2010_x86_64.whl

# Import the required libraries

In [None]:
import os
import time
import pickle
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import log_loss, roc_auc_score
import gc
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss, MSELoss
from torch.nn.modules.loss import _WeightedLoss
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score, accuracy_score
import treelite
import treelite_runtime 
import warnings

# Display settings for note rows and warnings
1. By default, Note will replace the out-of-range rows with ... . Sometimes you need to export the data set for observation, so you can make this part visible after changing this setting.
2. Some of the output of the code will be alerted (actually does not affect the results), set filter alerts can make the output much more readable.

In [None]:
# Set the display range of the note row to 100
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# Output set to ignore alarms
warnings.filterwarnings("ignore")

# Hyperparameters definition

In [None]:
DATA_PATH = '../input/jane-street-market-prediction/'
NFOLDS = 5
GPU_FLAG = torch.cuda.is_available()
TRAIN = False
CACHE_PATH = '../input/mlp012003weights'

# Define methods to be used in the training process

## Method of model preservation

>Save model is a common method in the competition, some models have a very long training process, if you do not want to kick to power or hand shake to turn off the page overnight back to the original phase, you should save the model timely. The trained model can be fine_tune, which can be loaded directly when the model is fused, saving the time of retraining.

1. The parameter dic of save_pickle is the weight of the model. Note that it is a weight. The model needs to be instantiated first when loading. save_path is the relative path to save.
2. with *open(save_path, 'wb') as f*, open the file with the path of save_path in wb mode and assign it to f. This way, you don't need to write a statement to close the file after saving.
3. The pickle module implements binary serialization and deserialization of a Python object structure. dic is the incoming model weight, and the model weight is saved by writing it to f using pickle_dump method.
4. Loading the model weights is done by deserializing the file contents with pikcle_load method and returning it.

In [None]:
def save_pickle(dic, save_path):
    with open(save_path, 'wb') as f:
    # with gzip.open(save_path, 'wb') as f:
        pickle.dump(dic, f)

def load_pickle(load_path):
    with open(load_path, 'rb') as f:
    # with gzip.open(load_path, 'rb') as f:
        message_dict = pickle.load(f)
    return message_dict

## Set seeds in batch
>During model development, it is sometimes useful to be able to obtain reproducible results in run after run to determine whether the change in performance comes from a change in the model or the dataset, or is simply the result of some new random sample points. To ensure that the training process is reproducible, define a seed_everything function which sets the seed for generating random numbers is necessary, which can follow the following steps:

1. **random.seed( )** -> Set the random seeds.
2. **os.environ['PYTHONHASHSEED']** -> Set hash seeds.
3. **np.random.seed( )** -> Set *numpy* seeds.
4. **torch.manual_seed( )** -> Set seeds of the random numbers generated by CPU.
5. **torch.cuda.manual_seed(seed)** -> Set seeds of the current random numbers generated by GPU.
6. **torch.cuda.manual_seed_all( )** -> Set seeds of random numbers if having multi-GPUs.
7. **torch.backends.cudnn.deterministic** -> When the flag is True, the algorithm of the neural network calculation process is consistent, otherwise, the same network structure and superparameter may run out of different weights.

In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(seed=42)

# Dataset loading
>The data needs to be loaded during training and evaluation which can be controlled without loading data during the model fusion phase with the flag 'TRAIN'.

The training set for this competition has 5.77G data, which is slow to load with the general read_csv method of pandas. There are two ways to speed it up.

1. Using datatable.

In [None]:
import datatable as dt
train = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()

2. Using parquet from pandas

In [None]:
import pandas as pd
train.to_parquet('train.parquet')
train = pd.read_parquet('./train.parquet')

# EDA (Exploratory Data Analysis)

# Data pre-processing

1. We already know that the training data set features 130 dimensions, the law is the beginning of feature_, followed by the number from 0 to 129. So first we use the row expression to extract the column names, which will be used in the submission.

In [None]:
feat_cols = [f'feature_{i}' for i in range(130)]
# Which can also be code as below
feat_cols = [f for f in train.columns if 'feature_' in f]

2. The cumulative return of the first 85 days is found to be inconsistent with the cumulative return curve from 86 to 500 days by EDA. It can be assumed that the features of the first 85 days will be unfavorable to the generalization of the model. Also, of course it may be favorable to the generalization of the model. It mainly depends on the similarity of the test set features to the features of the first 85 days. From the submission of the public list removing the features of the first 85 days can improve the score, indicating that the validation set of the public list is not similar to the features of the first 85 day features are not similar. However, removing the first 85 days of data for training is a controversial point. Some players believe that features that similar to the first 85 days may appear in the future. If the model is not trained, it will overfit the validation data of the public list and weaken the generalization ability of the model in the test data.

In [None]:
train = train.loc[train.date > 85].reset_index(drop=True)

3. This step is used for label establishing. We construct a 5-dimensional label with the 5 resp given by the train_dataset. EDA can find out whether the median of resp and 5 resp is similarly greater than 0. That is, the generalization ability of the model can be improved by predicting median of 5 labels. Because the probability that multiple labels are predicted wrong is smaller than the probability that only one label is predicted wrong.

* Convert the value of resp to [0,1]
    1. **train['resp'] > 0** -> Get a bool sequence.
    2. **( ).astype('int')** -> Convert a bool sequence to an integer. True denotes 1. False denotes 0.

* The generation of y uses *stack*, row expressions and transpose
    1. The function of the row expression *[train[c] for c in resp_cols]* is to generate an array of (5,n) from the 5 columns of target_cols using the row expression.
    2. *np.stack* splices the data in the specified dimension to get an array of ( ,n), default to be 0 dimension. It can be refer from https://numpy.org/doc/stable/reference/generated/numpy.stack.html.
    3. After transposing, we get an array of (n, ).

In [None]:
target_cols = ['action', 'action_1', 'action_2', 'action_3', 'action_4']

train['action'] = (train['resp'] > 0).astype('int')
train['action_1'] = (train['resp_1'] > 0).astype('int')
train['action_2'] = (train['resp_2'] > 0).astype('int')
train['action_3'] = (train['resp_3'] > 0).astype('int')
train['action_4'] = (train['resp_4'] > 0).astype('int')

y = np.stack([train[c] for c in resp_cols]).T

4. Add the features obtained by feature engineering. 

In [None]:
train['cross_41_42_43'] = train['feature_41'] + train['feature_42'] + train['feature_43']
train['cross_1_2'] = train['feature_1'] / (train['feature_2'] + 1e-5)

all_feat_cols = feat_cols + ['cross_41_42_43', 'cross_1_2'] 

4. Load the mean value of the features computed from the train_dataset and use it to fill in the null values when submit.

  >When the data features are analyzed, some of the features are found to have null values. The neural network model needs to process the null values. Here the mean value is used to fill the test set with the same mean value as the training set to ensure consistency.

* Note several scenarios for calculating the mean value
  1. Calculate the mean value of the full amount of train_dataset
  2. Calculate the mean of the actual training set + validation set with 85 days of dataset removed.
  3. Calculate the mean of the training set with the validation set data removed.
    
  >If the difference in effect is not significant, just be careful to ensure that the filling at training is consistent with the filling at prediction.

  >The parameters of *np.save* are path and array type object, so use *.values* to get the array object of DataFrame. It can be refer from https://numpy.org/doc/stable/reference/generated/numpy.save.html.

In [None]:
if TRAIN:
    # Calculate the mean value of the train_dataset and save it, so that there is no need to load the train_dataset and recalculate it during inference.
    f_mean = train.mean()
    np.save(f'{CACHE_PATH}/f_mean_online.npy', f_mean.values)
else:
    f_mean = np.load(f'{CACHE_PATH}/f_mean_online.npy')

# Construct the training and validation sets

Since the official validation set is not labeled (it can be treated as a test set), you need to construct a validation set by yourself, using 450 to 500 days of data as validation, or you can choose another time, just be careful not to leak the label.

>***Personal opinion***: although the data provided by the competition has date as a time sequence (not a feature of training), the returns of each transaction are independent of each other (not relate to date), so there is no need to ensure the time sequence of the training data and it is possible to disrupt the division of the training and validation sets.

In [None]:
train.fillna(f_mean, inplace=True)
valid = train.loc[(train.date >= 450) & (train.date < 500)].reset_index(drop=True)
train = train.loc[train.date < 450].reset_index(drop=True)

# ***Define the network structure of ResNet***
1. Inheritance of *nn.Module*
2. Implement *init* and *forward*
  - The *init* defines the layers that the model needs to use
  - The *forward* defines the process of turning features into labels

## *init* defines layers:
1. *super* calls Model's *init* to do the initialization. 
2. *batch_norm* -> *nn.BatchNorm1d* is used to create a batch normalization layer. The role of *nn.BatchNorm1d* is for data normalization, in which the deep neural network training process in each layer of the neural network input can maintain the same distribution. Because the distribution of the data in the neural network changes after each layer is activated, which is called Internal Covariate Shift,. The input distribution of the middle layer always changes, which will increase the difficulty of fitting the model. Also, the input distribution of the middle layer will make the output gradually close to the place where the gradient of the activation function is smaller, leading to the disappearance of the gradient, so it is necessary to do data normalization for the output of each layer. It is important to note that the BN layer is placed between the hidden layer and the activation layer.
3. *dropout* -> Dropout layer is created using *nn.Dropout* which serves to randomly extinguish some neurons (changing the output of some neurons to 0) as well as reducing overfitting.
4. *dense1* -> Create a fully connected layer using *nn.Linea*. The first parameter is the input dimension and the second parameter is the output dimension
5. *LeakyReLU* -> Create an activation layer with the activation function ReLU provided by *nn*

## *forward* definition forward propagation:

>Instead of using the classic block structure of ResNet, I just borrow the idea of ResNet, where each block contains only one hidden layer, and the input of each layer is obtained by splicing the input of the previous layer with the output of the previous layer. By stacking the layers defined in the *init* according to the designed model structure, the input dimension is n * 130 dimensions and the feature x is transformed by each layer to obtain the output in n * 5 dimensions.

1. **input layer** -> Contains *batch_norm* layer and dropout layer
2. **4 blocks** -> Each contains a fully-connected layer, a *batch_norm* operation layer, an activation layer, and a dropout layer
3. **output layer** -> Contains a fully-connected layer

In [None]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.batch_norm0 = nn.BatchNorm1d(len(all_feat_cols))
        self.dropout0 = nn.Dropout(0.2)

        dropout_rate = 0.2
        hidden_size = 256
        self.dense1 = nn.Linear(len(all_feat_cols), hidden_size)
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.dropout1 = nn.Dropout(dropout_rate)

        self.dense2 = nn.Linear(hidden_size+len(all_feat_cols), hidden_size)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(dropout_rate)

        self.dense3 = nn.Linear(hidden_size+hidden_size, hidden_size)
        self.batch_norm3 = nn.BatchNorm1d(hidden_size)
        self.dropout3 = nn.Dropout(dropout_rate)

        self.dense4 = nn.Linear(hidden_size+hidden_size, hidden_size)
        self.batch_norm4 = nn.BatchNorm1d(hidden_size)
        self.dropout4 = nn.Dropout(dropout_rate)

        self.dense5 = nn.Linear(hidden_size+hidden_size, len(target_cols))

        self.Relu = nn.ReLU(inplace=True)
        self.PReLU = nn.PReLU()
        self.LeakyReLU = nn.LeakyReLU(negative_slope=0.01, inplace=True)
        # self.GeLU = nn.GELU()
        self.RReLU = nn.RReLU()

    def forward(self, x):
        x = self.batch_norm0(x)
        x = self.dropout0(x)

        x1 = self.dense1(x)
        x1 = self.batch_norm1(x1)
        x1 = self.LeakyReLU(x1)
        x1 = self.dropout1(x1)

        x = torch.cat([x, x1], 1)

        x2 = self.dense2(x)
        x2 = self.batch_norm2(x2)
        x2 = self.LeakyReLU(x2)
        x2 = self.dropout2(x2)

        x = torch.cat([x1, x2], 1)

        x3 = self.dense3(x)
        x3 = self.batch_norm3(x3)
        x3 = self.LeakyReLU(x3)
        x3 = self.dropout3(x3)

        x = torch.cat([x2, x3], 1)

        x4 = self.dense4(x)
        x4 = self.batch_norm4(x4)
        x4 = self.LeakyReLU(x4)
        x4 = self.dropout4(x4)

        x = torch.cat([x3, x4], 1)

        x = self.dense5(x)

        return x

# Define the early-stopping function

>In order to prevent the model from overtraining and overfitting the training set, the training is stopped in time using this function when the evaluation index of the validation set does not improve anymore or even decreases.Define the early-stopping class, you can also call the package *from pytorchtools import EarlyStopping* to use directly. The implementation of the code is more or less the same.

Explanation of early-stopping class:
1. For several parameters of the *init* function, patience is the number of times the indicator is tolerated not to improve, mode is the measure to evaluate the improvement of the indicator, and delta is the floating coefficient of the compared indicator (maybe unnecessary to use).
2. To turn a python class instance into a callable object, all you need to do is implement a special method *__call__( )* which is similar to overloading the *( )* operator in a class, allowing the class instance object to be used as an 'object name()' as if it were a normal function
3. The logic of early-stopping is to use the early-stopping class to calculate whether the following evaluation metrics (AUC is used later) are elevated after each training with the validation set, if not, the counter = counter + 1, otherwise, the model is saved. When the counter reaches the upper limit of tolerance (which is denoted as patience), the flag of early-stopping is set to True. Then the early-stopping flag of the early-stopping class can be used to decide whether to stop the training process.
4. *save_checkpoint* function is used to save the model weights. *model.state_dict( )* is used to get the model weights. We can also use the *torch.save* method to save in the specified *model_path* path. Note that when loading the model later, you need to create the model before loading.


In [None]:
class EarlyStopping:
    def __init__(self, patience=7, mode="max", delta=0.001):
        self.patience = patience
        self.counter = 0
        self.mode = mode
        self.best_score = None
        self.early_stop = False
        self.delta = delta
        if self.mode == "min":
            self.val_score = np.Inf
        else:
            self.val_score = -np.Inf

    def __call__(self, epoch_score, model, model_path):

        if self.mode == "min":
            score = -1.0 * epoch_score
        else:
            score = np.copy(epoch_score)

        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(epoch_score, model, model_path)
        elif score < self.best_score: #  + self.delta
            self.counter += 1
            # print('EarlyStopping counter: {} out of {}'.format(self.counter, self.patience))
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            # ema.apply_shadow()
            self.save_checkpoint(epoch_score, model, model_path)
            # ema.restore()
            self.counter = 0

    def save_checkpoint(self, epoch_score, model, model_path):
        if epoch_score not in [-np.inf, np.inf, -np.nan, np.nan]:
            # print('Validation score improved ({} --> {}). Saving model!'.format(self.val_score, epoch_score))
            # if not DEBUG:
            torch.save(model.state_dict(), model_path)
        self.val_score = epoch_score

# Define the loss function

>Generally you can use the loss function that comes with PyTorch. PyTorch provides a rich set of loss functions, such as MSELoss, L1Loss for regression, CrossEntropyLoss for classification. Of course you can also customize the loss function. The custom loss function should be inherited from *torch.nn.Module* so that the backward method will be implemented automatically as soon as the forward method is set. The inheritance method of the loss function can be found in the follow URL: https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html.

Our custom loss function incorporates a label smoothing mechanism (*label_smoothing*). The role of label smooth is saying that the neural network will prompt itself to learn in the direction of the largest difference between the correct label and the wrong label. In the case of less training data, not enough to characterize all the sample features, it will lead to network overfitting. Label smoothing implements a improvement, which is a regularization strategy, mainly through *soft one-hot* to add noise (generally, *one-hot* is labeled as 1 and other as 0, *soft one-hot* is labeled as a number less than 1 and other as number slightly greater than 0, so that the label is not so absolute). It can reduce the weight of the category of the real sample label in the calculation of the loss function and finally to suppress the effect of overfitting.

Define a binary cross entropy loss function with *label_smoothing* -> A custom class inherits from *_WeightedLoss* (class *_WeightedLoss* inherits from *_Loss*, class *_Loss* inherits from *Model*). Add the *label_smoothing* methodv and redefine the *forward* method.

Explanation of *_smooth* method:
1. *@staticmethod* does not need to represent self parameter of its own object and cls parameter of its own class. It can be used like a function.
2. *assert* is python's assertion method which is used while debugging and throws an exception based on the expression after assert. In that case you don't need to frame a bunch of code with if.
3. Scaled down by target = 1.0 - smoothing (balance factor) when target is 1. Scaled up by target = 0.5 * smoothing when target is 0.

Explanation of *forward* method:
1. Before calculating the loss, call the *_smooth* method to smoothing the label.
2. Call *torch.nn.functional.binary_cross_entropy_with_logits* to compute the binary cross-information entropy loss of inputs and targets.
3. Returns the sum or mean of the losses based on the *reduction*.

In [None]:
##### Model&Data fnc
class SmoothBCEwLogits(_WeightedLoss):
    def __init__(self, weight=None, reduction='mean', smoothing=0.0):
        super().__init__(weight=weight, reduction=reduction)
        self.smoothing = smoothing
        self.weight = weight
        self.reduction = reduction

    @staticmethod
    def _smooth(targets:torch.Tensor, n_labels:int, smoothing=0.0):
        assert 0 <= smoothing < 1
        with torch.no_grad():
            targets = targets * (1.0 - smoothing) + 0.5 * smoothing
        return targets

    def forward(self, inputs, targets):
        targets = SmoothBCEwLogits._smooth(targets, inputs.size(-1),
            self.smoothing)
        loss = F.binary_cross_entropy_with_logits(inputs, targets,self.weight)

        if  self.reduction == 'sum':
            loss = loss.sum()
        elif  self.reduction == 'mean':
            loss = loss.mean()

        return loss

# Data loading for PyTorch

The sequence of operations for loading data into the model using PyTorch is as follows:
1. Create a *Dataset* object, in this case a custom *MarketDataset* type.
2. Create a *DataLoader* object.
3. Loop this *DataLoader* object to load features and labels into the model for training.

Creating a *Dataset* class requires the inclusion of at least 3 function:
1. **__init__** -> Initialization. Pass in data via parameters, or load data in methods.
2. **__len__** -> Return the total number of items in this dataset.
3. **__getitem__** -> Remove the data in dataset specified by the idx parameter. Convert it to a tensor and return it.

In [None]:
class MarketDataset:
    def __init__(self, df):
        self.features = df[all_feat_cols].values
        self.label = df[target_cols].values.reshape(-1, len(target_cols))

    def __len__(self):
        return len(self.label)

    def __getitem__(self, idx):
        return {
            'features': torch.tensor(self.features[idx], dtype=torch.float),
            'label': torch.tensor(self.label[idx], dtype=torch.float)
        }

# Define the training function

Parameter description:
1. **model** -> Used for passing in the ResNet model defined by PyTorch.
2. **optimizer** -> Used for passing in optimizers.
3. **scheduler** -> Used for passing in learning rate to adjust objects.
4. **loss_fn** -> Used for passing in loss function.
5. **dataloader** -> Used for passing in the data loading object.
6. **device** -> Used for passing in device objects.

Training process:
1. **model.train( )** -> Set the model to training mode. The effect is to enable *batch_normalization* and *drop_out*.
2. **final_loss** -> Used to calculate the average loss of the whole training process.
3. **for loop** -> Take one batch data at a time from the dataloader.
4. **zero_grad( )** -> Clear the gradient of the previous batch training. This step must be performed for each batch.
5. **to(device)** -> Copy data to GPU.
6. **model(features)** -> Call the *forward* method of the model to get the results of the ResNet calculation.
7. **loss_fn(outputs, label)** -> Use loss function to calculate the model prediction result and the loss of label, returning the *SmoothBCEwLogits* object.
8. **loss.backward( )** -> Back propagation calculates the gradients.
9. **optimizer.step( )** -> Optimizer updates model parameters.
10. **scheduler.step( )** -> If a *scheduler* object is passed in, then aejust learning rate.
11. **final_loss += loss.item( )** -> Record the sum of the loss values of all the batches, and use it to calculate the average of the loss of the whole training process.
12. **final_loss /= len(dataloader)** -> Calculate the average of the loss of all the batches throughout the training process.

In [None]:
def train_fn(model, optimizer, scheduler, loss_fn, dataloader, device):
    model.train()
    final_loss = 0

    for data in dataloader:
        optimizer.zero_grad()
        features = data['features'].to(device)
        label = data['label'].to(device)
        outputs = model(features)
        loss = loss_fn(outputs, label)
        loss.backward()
        optimizer.step()
        if scheduler:
            scheduler.step()

        final_loss += loss.item()

    final_loss /= len(dataloader)

    return final_loss

# Define the evaluation (inference) function

>The difference between inference and training is that the gradient is not calculated and the model parameters aren't updated. The parameter passing is consistent with the above training function.

The process of inference:
1. **model.eval()** -> Set the model to evaluation mode where the model will not drop out.
2. **preds** -> Used to record the prediction results for each validator batch.
3. **for loop** -> Similar to training, the validation data is fetched by batch from the *Dataloader* through a for loop.
4. **to(device)** -> Copy verification data to the GPU.
5. **with torch.no_grad()** -> The role of the model is to let the model does not calculate the gradient and speed up the model's calculation speed. Otherwise it is easy to fully occupy the memory. 
6. **model(features)** -> Predicting results with models.
7. **outputs.sigmoid()** -> The sigmoid function is used for activate prediction result because it is dichotomous, while softmax is used for multiclassification.
8. **.detach()** -> Cutting down back propagation and returning a new variable which is separated from the currently computed graph. Getting this variable never needs to compute its gradient.
9. **.cpu()** -> Copy the result from the GPU and convert it to Numpy type.
10. **np.concatenate()** -> Splice the prediction results of all the batches and then reshape them into the dimension of *(n, len(target_cols))*.

In [None]:
def inference_fn(model, dataloader, device):
    model.eval()
    preds = []

    for data in dataloader:
        features = data['features'].to(device)

        with torch.no_grad():
            outputs = model(features)

        preds.append(outputs.sigmoid().detach().cpu().numpy())

    preds = np.concatenate(preds).reshape(-1, len(target_cols))

    return preds

# Define the scoring function

>Define the scoring function based on the officially given scoring index.

$p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij})$

$t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}}$

$u = min(max(t,0),6) \sum p_i$

Explanation of *utility_score_bincount*:
1. The parameters correspond to the date, weight, resp and action columns in the dataset.
2. **np.unique(date)** -> Get the sequence without duplicate dates.
3. **np.bincount( )** -> Obtain a sequence of cumulative returns by day. *bincount* can be found at the following URL: https://blog.csdn.net/xlinsist/article/details/51346523.
4. **t,u** -> It is to calculate the *t* and *utility_score* according to the official formula.

In [None]:
def utility_score_bincount(date, weight, resp, action):
    count_i = len(np.unique(date))
    Pi = np.bincount(date, weight * resp * action)
    t = np.sum(Pi) / np.sqrt(np.sum(Pi ** 2)) * np.sqrt(250 / count_i)
    u = np.clip(t, 0, 6) * np.sum(Pi)
    return u

# Model training:
1. Use TRAIN to mark whether to train or not. Skip this step if TRAIN = False when inferring.
2. Create Dataset objects for training set *train* and validator *valid* respectively.
3. Create *DataLoader* objects separately. *shuffle* marks whether to upset the order. *num_workers* specifies the number of parallel threads. *DataLoader* of the validation set does not upset the order in order to ensure the consistency of each validation data. Otherwise there will be interference when comparing the model effect.




In [None]:
if TRAIN:
    train_set = MarketDataset(train)
    train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
    valid_set = MarketDataset(valid)
    valid_loader = DataLoader(valid_set, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)

1. The model is trained NFOLDS times. The *DataLoader* of the training set is disrupted each time so that multiple models are obtained for fusion and better generalization ability.
2. **torch.cuda.empty_cache( )** -> Release memory that is no longer in use.
3. **torch.device('cuda:0')** -> *torch.device* represents the object to which the *torch.Tensor* is assigned. If the GPU is enabled then it will be *cuda:0*, otherwise *cpu*. The number X of *cuda:X* can be obtained with *torch.cuda.current_device( )*.
4. Create the *model* object of ResNet and copy the model into the previously obtained device object using *.to(device)*.
5. Instantiate an optimizer which is used to update model parameters. The so-called optimization can be Adam algorithm. Instantiation provides ResNet model parameters. Learning rate and weight decay are the parameters of the optimizer. More information about PyTorch optimzer may refer to URL: https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-optim/.
6. We can choose to instantiate a dynamic learning rate object. *torch.optim.lr_scheduler* module provides some methods to adjust the learning rate (*learning_rate*) according to the epoch of training. If the initial learning rate is set too small as the convergence speed is slow and training efficiency is low. However, setting a larger rate may be too easy to jitter. The role of this module is to adjust the learning rate bigger and then smaller. You can refer to this blog to deepen the understanding of the module: https://blog.csdn.net/qyhaill/article/details/103043637.
7. Create the object of the loss function.
8. Create early-stopping objects. 
9. *model_weights* is the path where the model weights are saved. *f"{}..."* indicates that python expressions within curly brackets are supported within strings.

In [None]:
    start_time = time.time()
    for _fold in range(NFOLDS):
        print(f'Fold{_fold}:')
        seed_everything(seed=42+_fold)
        torch.cuda.empty_cache()
        device = torch.device("cuda:0")
        model = Model()
        model.to(device)
        # model = nn.DataParallel(model)

        optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
        # optimizer = Nadam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
        # optimizer = Lookahead(optimizer=optimizer, k=10, alpha=0.5)
        scheduler = None
        # scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer=optimizer, pct_start=0.1, div_factor=1e3,
        #                                                 max_lr=1e-2, epochs=EPOCHS, steps_per_epoch=len(train_loader))
        # loss_fn = nn.BCEWithLogitsLoss()
        loss_fn = SmoothBCEwLogits(smoothing=0.005)

        es = EarlyStopping(patience=EARLYSTOP_NUM, mode="max")
        
        model_weights = f"{CACHE_PATH}/online_model{_fold}.pth"

# Training process

>The above processes is in preparation for training, getting the module ready. The we start to train the model. We will train the model EPOCHS times, which is a previously defined hyperparameter.

1. The module defined above is passed into *train_fn*. The training function is defined before, here only one statement is needed to complete the training. What a surprise!
2. What is the latter code used for? It is to evaluate the effect of the model in the validation set and to stop early based on the evaluation,
3. First we use *inferance_fn* to make predictions on the validation set with the trained model and get the prediction result *valid_pred*.
4. Compute the AUC and logloss of the model on the validation dataset.
5. The results of model prediction is 5 resp. To calculate the *utility_score*, we need to change 5 resp into 1 resp. We use *np.median* to take the median of the 5 resp and *utility_score_bincount* method to calculate *utility_score*.
6. Pass the validation set's AUC, model and save path to the early-stopping module. If the early-stopping flag *early_stop* is True, the early-stopping condition is met and the training loop stops.

In [None]:
        for epoch in range(EPOCHS):
            train_loss = train_fn(model, optimizer, scheduler, loss_fn, train_loader, device)

            valid_pred = inference_fn(model, valid_loader, device)
            valid_auc = roc_auc_score(valid[target_cols].values, valid_pred)
            valid_logloss = log_loss(valid[target_cols].values, valid_pred)
            valid_pred = np.median(valid_pred, axis=1)
            valid_pred = np.where(valid_pred >= 0.5, 1, 0).astype(int)
            valid_u_score = utility_score_bincount(date=valid.date.values, weight=valid.weight.values,
                                                   resp=valid.resp.values, action=valid_pred)
            print(f"FOLD{_fold} EPOCH:{epoch:3} train_loss={train_loss:.5f} "
                      f"valid_u_score={valid_u_score:.5f} valid_auc={valid_auc:.5f} "
                      f"time: {(time.time() - start_time) / 60:.2f}min")
            es(valid_auc, model, model_path=model_weights)
            if es.early_stop:
                print("Early stopping")
                break
        torch.save(model.state_dict(), model_weights)
        

# ResNet model loading

>The model is trained and can be loaded directly if we want to use it later. We can use the following code to load the model during model fusion.

Since we are saving the weights of the model, we need to create the model object first. The number of folds that training is divided into here is the number of models needed to loaded.

1. Get the weights of the model with *torch.load(path)*.
2. Use *model.load_state_dict( )* to put the weights into the model.
3. Shift the model's mode to *eval*.
4. Put in *model_list*, later we can directly use the model in *model_list* to make predictions.

In [None]:
model_list = []
for _fold in range(NFOLDS):
    torch.cuda.empty_cache()
    device = torch.device("cuda:0")
    model = Model()
    model.to(device)
    model_weights = f"{CACHE_PATH}/online_model{_fold}.pth"
    model.load_state_dict(torch.load(model_weights))

    model.eval()
    model_list.append(model)

# ResNet Summary
## Review the main steps
>At this point we have completed the whole process from building -> training -> early-stopping and saving -> loading from the PyTorch version of the ResNet model. Here recall the main steps.

1. Prepare module. ResNet model, loss function, early-stopping module, training process function, evaluation (inference) process function, scoring function, etc.
2. Building blocks for model training.

    Outermost loop: training NFOLDS times:
        Prepare modules for each model: ResNet model, loss function, optimizer, early-stopping module, learning rate adjustment module.

    innermost loop: Each model is trained EPOCHS times. For each train:
        1) Pass the module to the training process function to get the trained model.
        2) Give the trained model and validation set to the evaluation (inference) process function and get the evaluation result.
        3) Give the model parameters and evaluation results to the early-stopping module. Save the model and update the early-stopping marker according to the evaluation.
        4) Determine whether to stop model training based on early-stopping markers.
           
## Review of the main modules
1. **ResNet model** -> The structure of the neural network. The goal of training is to get the weights of the network structure.
2. **Loss function** -> The role is to calculate the Differences between the model and the label.
3. **Optimizer** -> Calculate the gradients and modify model weights by policy.
4. **Early-stopping Module** -> A module to prevent over-training of the model, i.e. It can prevent over-fitting and save training time.
5. **Training process function** -> Feed the training data to the model by batch. Calculate the loss and update the weights until all batches are completed.
       
## Review of the main hyperparameters

The tuning of the model is done mainly for the following hyperparameters:
1. **layer_size** -> Numbers of layers of ResNet structure.
2. **hidden_size** -> Number of neurons per hidden layer.
3. **dropout_rate** -> The proportion of neurons randomly extinguished.
4. **NFOLDS** -> Number of ResNet models trained.
5. **EPOCHS** -> Number of training sessions per model.
6. **learning_rate** -> Learning rete.
7. **label_smoothing** -> Label balance factor.
8. **patience** -> Number of early stopping tolerance.

# TensorFlow model

>The above part uses PyTorch to implement a ResNet idea model. The following uses TensorFlow to implement a simpleNN that I did not expect that simple network in this dataset has good performance.

## Import the required libraries

In [None]:
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Concatenate, Lambda, GaussianNoise, Activation
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers.experimental.preprocessing import Normalization
import tensorflow as tf
import tensorflow_addons as tfa

import numpy as np
import pandas as pd
from tqdm import tqdm
from random import choices

## Define hyperparameter 

>The role of the hyperparameter is similar to that in the PyTorch model.

In [None]:
np.random.seed(SEED)

SEED = 1111
epochs = 205
batch_size = 4096
hidden_units = [160, 160, 160]
dropout_rates = [0.2, 0.2, 0.2, 0.2]
label_smoothing = 1e-2
learning_rate = 1e-2

## SimpleNN model

>We build a simple 3-layer neural network model using Keras, an advanced neural network API written in Python and running with TensorFlow as the backend by default. The core data structure of Keras is the model, a way of organizing the network layers. The simplest model is the Sequential sequential model, which consists of multiple network layers stacked linearly. For more complex structures, use the Keras functional API, which allows the construction of arbitrary neural network graphs.

Build models using the Keras functional API:
1. Create an input layer with *tf.keras.layers.Input*. The parameters are the dimensions of the training set features.
2. Add a BN operation layer and a dropout layer to the output of the input layer, calling the BatchNormaliztion and Dropout APIs provided by *tf.keras.layers*.
3. *hidden_units* is a list of the number of incoming hidden layer neurons. The superparameter section has defined that there are 160 neurons in each of the 3 layers.
4. Each layer has structured like a sandwich: fully connected layer + BatchNorm layer + activation layer + dropout layer.
5. The activation function used is the switch activation function proposed by Google in October 2017: $f(x) = x - \text{sigmoid}(βx)$. The switch has the properties of no upper bound with lower bound, smooth, non-monotonic. Also, we can use other activation functions such as ReLU.
6. **output layer** -> a fully connected layer with an output of 5 (dimensions of the labels) and labels in each dimension is a binary classfication problem (action is 1 or 0). Then we use the sigmoid activation function to do activation on the output of the full connected layer.
7. Pass the defined input *inp* and output *out* objects to the model *tf.keras.models.Model* to complete the creation of the model. We can use *keras.utils.plot_model(model, "model_with_shape.png", show_shapes=True)* to generate pictures of the model structure.
8. Use *model.compile* to configure the training process of the model. We need to specify the optimizer, loss function, and evaluation matrix, which can refer to: https://keras.io/zh/models/model/#compile.
9. The optimizer *optimizer* uses RectifiedAdam, which claims to provide fully automatic and dynamic self-tuning of the learning rate, eliminating the "warm-up" required by using Adam, ensuring the learning rate and convergence speed, while effectively avoiding the model from falling into the trap of "local optimum" trap.

In [None]:
# fit
def create_mlp(
    num_columns, num_labels, hidden_units, dropout_rates, label_smoothing, learning_rate
):

    inp = tf.keras.layers.Input(shape=(num_columns,))
    x = tf.keras.layers.BatchNormalization()(inp)
    x = tf.keras.layers.Dropout(dropout_rates[0])(x)
    for i in range(len(hidden_units)):
        x = tf.keras.layers.Dense(hidden_units[i])(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(tf.keras.activations.swish)(x)
        x = tf.keras.layers.Dropout(dropout_rates[i + 1])(x)
    
    x = tf.keras.layers.Dense(num_labels)(x)
    out = tf.keras.layers.Activation("sigmoid")(x)

    model = tf.keras.models.Model(inputs=inp, outputs=out)
    model.compile(
        optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate),
        loss=tf.keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
        metrics=tf.keras.metrics.AUC(name="AUC"),
    )

    return model

# TensorFlow training model
>The *create_mlp* function encapsulates the model building and training configuration. We can call this method to get a configured TensorFlow model, using the *fit* method to complete the training.

1. If we train the model with k-fold cross-validation method, we should use *clear_session* to clear the training content to be given to the session. Otherwise the label may be leaked and it also tends to lead to high memory usage.
2. The *fit* method provides the training set and labels. *epochs* is the number of times the model is trained. *batch_size* is the amount of data per batch, *verbose* is the configuration for printing information about the training process. *validation_data* is the data used for validation.
3. There are two ways to save models in TensorFlow. The *save* method saves the model structure and weights together. The *save_weights* method saves only the model weights. The advantage of saving the full amount is that it can be used directly when loading without instantiating a model object and the advantage of saving only the weights is that the file is smaller.
4. The trained model should be loaded when doing model fusion using the same method as the saved one. If it is a fully saved model then we should use *tensorflow.keras.models.load_model* method. If you use weights-saved method, you have to instantiate a model object first and then load the model weights with *load_weights*.

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(SEED)
clf = create_mlp(
    len(feat_cols), 5, hidden_units, dropout_rates, label_smoothing, learning_rate
    )
history = clf.fit(X_train,y_train
                  , epochs=epochs
                  , batch_size=batch_size
                  , verbose=2
                  , validation_data=(X_valid,y_valid)
                 )
                
clf.save_weights('model.h5')
clf.load_weights('../input/jane-street-with-keras-nn-overfit/model.h5')

tf_models = [clf]

# XGBoost model

## Loading the treelite type XGBoost model

>After the XGBoost model is trained and converted to treelite type, the treelite type model can be loaded and used when doing model fusion.

>treelite is a tree model deployment acceleration tool that can compile and optimize tree models into separate libraries which can be easily used for model deployment. After optimization it can increase the prediction speed of XGBoost model by 2-6 times. Project address: https://treelite.readthedocs.io/.

In [None]:
treelite_model = treelite_runtime.Predictor('../input/model-tree/model_xgb.so', verbose=True)

## Steps to convert XGBoost model to treelite type.
1. Load the tree model into treelite with the *treelite.Model.load* method.
2. Archive the deployment source with the *export_srcpkg* method.
3. Deploy the shared library and get a *.so* type file. Later we can use *treelite_runtime* to load the separate treelite library of *.so*, which can be used for prediction.

In [None]:
import treelite
model = treelite.Model.load('my_model.model', model_format='xgboost')

# Produce a zipped source directory, containing all model information
# Run `make` on the target machine
model.export_srcpkg(platform='unix', toolchain='gcc',
                    pkgpath='./mymodel.zip', libname='mymodel.so',
                    verbose=True)

# Like export_srcpkg, but generates a shared library immediately
# Use this only when the host and target machines are compatible
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True)

## Steps to load and use treelite:
1. Load the *.so* file with *treelite_runtime.Predictor*.
2. Use treelite_runtime.Batch.from_npy2d to convert data into treelite-specific data structures.
3. Predict using the *predict* method of the *Predictor* object to get the probability value.

In [None]:
import treelite_runtime
predictor = treelite_runtime.Predictor('./mymodel.so', verbose=True)
batch = treelite_runtime.Batch.from_npy2d(X)
out_pred = predictor.predict(batch)

# Submit test results
## Strategies for model fusion

The models to be used for model fusion are loaded, and we start to do model fusion. The strategy used is weighted average.
1. Each model has a weight, and the sum of the weights is 1. If the three models have equal weights, the averaging strategy is adopted.
2. The results obtained from all three models are probability values (between 0 and 1), which are multiplied by the weights of the corresponding models to obtain the predicted probabilities with weights.
3. The prediction results of the three models with weights are added up and compared with *th* (the threshold to determine whether action is 1). Those smaller then *th* are marked as 0, otherwise is marked as 1.

## Features of this competition submission
According to the official submission instructions given:
>You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow the following template in Kaggle Notebooks:

 ```
 import janestreet
 env = janestreet.make_env() # initialize the environment
 iter_test = env.iter_test() # an iterator which loops over the test set

 for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df.action = 0 #make your 0/1 prediction here
    env.predict(sample_prediction_df)
 ```

According to the official instructions, we need to use the official data interface provided by *janestreet* to get a runtime environment *env*. Also we need to get an iterator *iter_test* from the environment. We can get a transaction data *test_df* and results saved *sample_prediction_df* at a time. After feeding the data *test_df* to the model to do predictions we should mark the result with *sample_prediction_df.action* as 0 or 1. Finally we should use the official method *predict* to submit the predicted results of this transaction data.
 
## Steps for model fusion and predictions:
The ResNet model, simpleNN model and XGBoost model are used to make predictions for *test_df* respectively. Then a probability is calculated according to the weighted average strategy. The value of *action* is obtained after comparing with *th* (judgment threshold).

### XGBoost model prediction
1. The XGBoost model is trained with a null substitution value of -999, and the *fillna* operation is not required when feed the data to the XGBoost model.
2. The XGBoost model has been converted to treelite type, so the data needs to be converted to the format required by treelite.
3. Complete the prediction with *predict* method. As already mentioned, treelite's *predict* is a probability value.
 
### ResNet model prediction
1. **Data pre-processing** -> We use the previously calculated and saved *f_mean* to fill in the null values of the *test_df*.
2. **Feature engineering** -> ResNet's model does feature engineering. So here we also need to do the same feature engineering on test's data and add the results to the end of test's columns with *np.concatenate* (2 more columns of features).
3. The ResNet models are trained with NFOLDS models. The five resnet models are fused with the average strategy to obtain the prediction results of the ResNet models.
4. The prediction result is taken as the median one from 5 using *np.median*. 
5. The prediction uses the GPU speedup, note that there is a *to(device)* method. We need to copy the data back to the cpu to do the calculation when doing the model fusion.
 
### TensorFlow model prediction
Considering that TensorFlow model can also be trained with cross-validation, we set aside multiple TensorFlow models that do fusion with average strategies.

1. Training data of simpleNN that implemented by TensorFlow does not do feature engineering but use the original test data with filled null values for prediction.
2. The prediction results are obtained by the innermost determinant, using *np.mean* to get average and taking the median one from the 5 labels as the result with *np.median*.
 
### Integration
The predictions of the three models are all 1 dimension of *np.array* type. Their respective weights are multiplied and summed to obtain the fused prediction.

### Minor optimization
We know that even if the weight is predicted as 1 rather original 0, it does not contribute to the *utility_score* (may even be a drag on it) according to the information of the topic. That case, we do not need to use the model to make prediction for the transaction data if weight is 0, but we can directly mark the *action* as 0, which can reduce the calculation and improve the speed of prediction.

In [None]:
if Not TRAIN:
    import janestreet
    env = janestreet.make_env()
    env_iter = env.iter_test()

    for (test_df, pred_df) in tqdm(env_iter):
        if test_df['weight'].item() > 0:
            x_tt = test_df.loc[:, feat_cols].values
            batch = treelite_runtime.Batch.from_npy2d(x_tt)    
            tree_pred = treelite_model.predict(batch)
            if np.isnan(x_tt.sum()):
                x_tt = np.nan_to_num(x_tt) + np.isnan(x_tt) * f_mean

            cross_41_42_43 = x_tt[:, 41] + x_tt[:, 42] + x_tt[:, 43]
            cross_1_2 = x_tt[:, 1] / (x_tt[:, 2] + 1e-5)
            feature_inp = np.concatenate((
                x_tt,
                np.array(cross_41_42_43).reshape(x_tt.shape[0], 1),
                np.array(cross_1_2).reshape(x_tt.shape[0], 1),
            ), axis=1)

            # torch_pred
            torch_pred = np.zeros((1, len(target_cols)))
            for model in model_list:
                torch_pred += model(torch.tensor(feature_inp, dtype=torch.float).to(device)).sigmoid().detach().cpu().numpy() / NFOLDS
            torch_pred = np.median(torch_pred)
            
            # tf_pred
            tf_pred = np.median(np.mean([model(x_tt, training = False).numpy() for model in tf_models],axis=0))
            
            # avg
            pred = torch_pred*0.46 + tf_pred*0.51 + tree_pred*0.03
            
            pred_df.action = np.where(pred >= 0.493, 1, 0).astype(int)
        else:
            pred_df.action = 0
        env.predict(pred_df)

In [None]:
print('done')