# Modelling Rogue Wave Data with Random Forest Classification Model

In [1]:
%load_ext autoreload
%autoreload 2

## Setup
### Imports

Importing all required packages and define seed and number of cores to use.

In [2]:
import os
import pandas as pd
import numpy as np
import sys
import pickle

sys.path.append('./')
import utils

from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, f1_score

import shap
from fgclustering import FgClustering

import warnings
warnings.filterwarnings("ignore")

### Parameter Settings

In [3]:
print(os.cpu_count()) # how many CPU cores are available on the current machine

256


In [4]:
seed = 42
n_jobs = 15

In [5]:
undersample = True
num_cv = 10
cases = [1,2,3]

## Building an ElasticNet Classification Model

### Instantiating the Model and Setting Hyperparameters

- `C`: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
- `l1_ratio`: The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
- `class_weight`: Weights associated with classes. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

In [6]:
# Define the classifer
classifier = LogisticRegression(solver='saga', penalty = 'elasticnet', random_state=seed, n_jobs=n_jobs)

In [7]:
hyperparameter_grid = {'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 20.0, 30., 4.0, 5.0, 10.0, 100.0], 
            'l1_ratio': [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1],
            'class_weight': ['balanced', None] 
}

### Train and Evaluate the Model

For hyperparameter tuning, we use a k-fold crossvalidation with a stratified splitter that ensures we have enough data from each class in the training and validation set.

For evaluation use confusion matrix, macro F1 score and balanced accuracy to account for potential class imbalances.

We iterate over each binarization case:

- case 1: class 0: target < 2.0 and class 1: target > 2.0 
- case 2: class 0: target < 1.5 and class 1: target > 2.0
- case 3: class 0: target < 1.5, class 1: 1.5 < target < 2.0 and class 2: target > 2.0

The data that was preprocessed in `data_preprocessing.ipynb`.

In [8]:
for case in cases:

    print(f'Building model for case {case}')

    # Load and unpack the data
    with open(f'../data/data_case{case}{"_undersampled" if undersample else ""}.pickle', 'rb') as handle:
        data = pickle.load(handle)
    
    X_train, X_test, y_train, y_test = data

    scaler = StandardScaler()
    X_train_transformed = scaler.fit_transform(X_train)
    X_test_transformed = scaler.transform(X_test)

    # Tune hyperparameters
    skf = StratifiedKFold(n_splits=num_cv).split(X_train_transformed, y_train)

    gridsearch_classifier = GridSearchCV(classifier, hyperparameter_grid, cv=skf)
    gridsearch_classifier.fit(X_train_transformed, y_train)

    # Check the results
    print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
    print(gridsearch_classifier.best_params_)

    # Take the best estimator
    model = gridsearch_classifier.best_estimator_
    
    # Predict labels
    y_pred = model.predict(X_test_transformed)

    print(f"Balanced acc: {balanced_accuracy_score(y_test, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_test, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")

    # Save the model
    data_and_model = [X_train_transformed, X_test_transformed, y_train, y_test, model]
    
    with open(f'../models/model_elnet_case{case}{"_undersampled" if undersample else ""}.pickle', 'wb') as handle:
        pickle.dump(data_and_model, handle, protocol=pickle.HIGHEST_PROTOCOL)

Building model for case 1
The mean cross-validated score of the best model is 59.94% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': 'balanced', 'l1_ratio': 0}
Balanced acc: 0.48613079320527613
Macro F1 score: 0.2887679178671357
Confusion matrix:
[[ 77919 128195]
 [  1447   2119]]
Building model for case 2
The mean cross-validated score of the best model is 61.54% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': 'balanced', 'l1_ratio': 0}
Balanced acc: 0.5490943238308509
Macro F1 score: 0.37021833618012623
Confusion matrix:
[[41069 42659]
 [ 1399  2167]]
Building model for case 3
The mean cross-validated score of the best model is 44.6% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': 'balanced', 'l1_ratio': 0.4}
Balanced acc: 0.3557478102475981
Macro F1 score: 0.2747768302403449
Confusion matrix:
[[31859 22280 29588]
 [43902 35415 43070]
 [ 1078  1071  1417]]


## Building a Random Forest Classification Model

### Instantiating the Model and Setting Hyperparameters

- `n_estimators`: The number of trees in the forest.
- `max_depth`: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- `max_samples`: If bootstrap is True, the number of samples to draw from X to train each base estimator.
- `criterion`: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain.
- `max_features`: The number of features to consider when looking for the best split.
- `class weight`:
  - The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
  - The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

In [8]:
# Define the classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
classifier = RandomForestClassifier(oob_score=True, random_state=seed)

In [9]:
hyperparameter_grid = {'n_estimators': [1000], 
            'max_depth': [5, 10, 20, 30, 50], 
            'max_samples': [0.5, 0.8, 0.95],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2'],
            'class_weight': ['balanced', 'balanced_subsample'] 
}

### Train and Evaluate the Model

For hyperparameter tuning, we use a k-fold crossvalidation with a stratified splitter that ensures we have enough data from each class in the training and validation set.

For evaluation use confusion matrix, macro F1 score and balanced accuracy to account for potential class imbalances.

We iterate over each binarization case:

- case 1: class 0: target < 2.0 and class 1: target > 2.0 
- case 2: class 0: target < 1.5 and class 1: target > 2.0
- case 3: class 0: target < 1.5, class 1: 1.5 < target < 2.0 and class 2: target > 2.0

The data that was preprocessed in `data_preprocessing.ipynb`.

In [None]:
for case in cases:

    print(f'Building model for case {case}')

    # Load and unpack the data
    with open(f'../data/data_case{case}{"_undersampled" if undersample else ""}.pickle', 'rb') as handle:
        data = pickle.load(handle)
    
    X_train, X_test, y_train, y_test = data

    # Tune hyperparameters
    skf = StratifiedKFold(n_splits=num_cv).split(X_train, y_train)
    
    gridsearch_classifier = GridSearchCV(classifier, hyperparameter_grid, cv=skf, verbose=0, n_jobs=n_jobs)
    gridsearch_classifier.fit(X_train, y_train)

    # Check the results
    print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
    print(gridsearch_classifier.best_params_)

    # Take the best estimator
    model = gridsearch_classifier.best_estimator_
    
    # Predict labels
    y_pred = model.predict(X_test)

    print(f"Balanced acc: {balanced_accuracy_score(y_test, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_test, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")

    # Save the model
    data_and_model = [X_train, X_test, y_train, y_test, model]
    
    with open(f'../models/model_randomforest_case{case}{"_undersampled" if undersample else ""}.pickle', 'wb') as handle:
        pickle.dump(data_and_model, handle, protocol=pickle.HIGHEST_PROTOCOL)


Building model for case 1
