# Modelling Rogue Wave Data with Random Forest Classification Model

In [1]:
%load_ext autoreload
%autoreload 2

## Setup
### Imports

Importing all required packages and define seed and number of cores to use.

In [2]:
import os
import sys
import pickle

sys.path.append('./')
import utils

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, f1_score

import warnings
warnings.filterwarnings("ignore")

### Parameter Settings

In [3]:
seed = 42
n_jobs = 4
print(f"Using {n_jobs} cores from {os.cpu_count()} available cores.") # how many CPU cores are available on the current machine

Using 4 cores from 8 available cores.


In [4]:
undersample = True
num_cv = 10
case = 2

## Building an ElasticNet Classification Model

### Instantiating the Model and Setting Hyperparameters

- `C`: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
- `l1_ratio`: The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
- `class_weight`: Weights associated with classes. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

In [5]:
hyperparameter_grid = {'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 20.0, 30., 4.0, 5.0, 10.0, 100.0], 
            'l1_ratio': [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1],
            'class_weight': ['balanced', None] 
}

In [6]:
def elnet_model(seed, X_train, X_test, y_train_cat, y_test_cat):
    classifier = LogisticRegression(solver='saga', penalty = 'elasticnet', random_state=seed, n_jobs=n_jobs)

    scaler = StandardScaler()
    X_train_transformed = scaler.fit_transform(X_train)
    X_test_transformed = scaler.transform(X_test)

    # Tune hyperparameters
    skf = StratifiedKFold(n_splits=num_cv).split(X_train_transformed, y_train_cat)

    gridsearch_classifier = GridSearchCV(classifier, hyperparameter_grid, cv=skf)
    gridsearch_classifier.fit(X_train_transformed, y_train_cat)

    # Check the results
    print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
    print(gridsearch_classifier.best_params_)

    # Take the best estimator
    model = gridsearch_classifier.best_estimator_

    # Predict training labels
    y_pred = model.predict(X_train_transformed)
    y_true = y_train_cat

    print(f"Balanced acc: {balanced_accuracy_score(y_true, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_true, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_true, y_pred)}")

    # Predict test labels
    y_pred = model.predict(X_test_transformed)
    y_true = y_test_cat

    print(f"Balanced acc: {balanced_accuracy_score(y_true, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_true, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_true, y_pred)}")

### Train and Evaluate the Model

For hyperparameter tuning, we use a k-fold crossvalidation with a stratified splitter that ensures we have enough data from each class in the training and validation set.

For evaluation use confusion matrix, macro F1 score and balanced accuracy to account for potential class imbalances.

We load the case 2 data that was preprocessed in `data_preprocessing.ipynb`.  

Case 2: 
- class 0: target < 1.5
- class 1: target > 2.0

In [7]:
undersample_method = "random"

print(f'Building model for case {case}' + f'{f" with {undersample_method} undersampled data" if undersample else ""}.')
data_train, data_test, y_train, y_train_cat, X_train, y_test, y_test_cat, X_test = utils.load_data(case, undersample, undersample_method)
elnet_model(seed, X_train, X_test, y_train_cat, y_test_cat)

Building model for case 2 with random undersampled data.

Training dataset target distribution:
Counter({0: 14264, 1: 14264})

Test dataset target distribution:
Counter({0: 83728, 1: 3566})
The mean cross-validated score of the best model is 70.29% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': 'balanced', 'l1_ratio': 1}
Balanced acc: 0.7024326977005048
Macro F1 score: 0.7024266131169831
Confusion matrix:
[[ 9955  4309]
 [ 4180 10084]]
Balanced acc: 0.7043801810933012
Macro F1 score: 0.48931246728229605
Confusion matrix:
[[58526 25202]
 [ 1035  2531]]


In [8]:
undersample_method = "nearmiss"

print(f'Building model for case {case}' + f'{f" with {undersample_method} undersampled data" if undersample else ""}.')
data_train, data_test, y_train, y_train_cat, X_train, y_test, y_test_cat, X_test = utils.load_data(case, undersample, undersample_method)
elnet_model(seed, X_train, X_test, y_train_cat, y_test_cat)

Building model for case 2 with nearmiss undersampled data.

Training dataset target distribution:
Counter({0: 14264, 1: 14264})

Test dataset target distribution:
Counter({0: 83728, 1: 3566})
The mean cross-validated score of the best model is 60.9% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': None, 'l1_ratio': 0}
Balanced acc: 0.6198822209758834
Macro F1 score: 0.6198506448894372
Confusion matrix:
[[8972 5292]
 [5552 8712]]
Balanced acc: 0.5512648507213862
Macro F1 score: 0.36924542544745725
Confusion matrix:
[[40822 42906]
 [ 1373  2193]]


## Building a Random Forest Classification Model

### Instantiating the Model and Setting Hyperparameters

- `n_estimators`: The number of trees in the forest.
- `max_depth`: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- `max_samples`: If bootstrap is True, the number of samples to draw from X to train each base estimator.
- `criterion`: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain.
- `max_features`: The number of features to consider when looking for the best split.
- `class weight`:
  - The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
  - The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

In [9]:
hyperparameter_grid = {'n_estimators': [1000], 
            'max_depth': [5, 10, 20, 30, 50], 
            'max_samples': [0.5, 0.8, 0.95],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2'],
            'class_weight': ['balanced', 'balanced_subsample'] 
}

In [None]:
def rf_model(seed, X_train, X_test, y_train_cat, y_test_cat):
    # Define the classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
    classifier = RandomForestClassifier(oob_score=True, random_state=seed)

    # Tune hyperparameters
    skf = StratifiedKFold(n_splits=num_cv).split(X_train, y_train_cat)
    
    gridsearch_classifier = GridSearchCV(classifier, hyperparameter_grid, cv=skf, verbose=0, n_jobs=n_jobs)
    gridsearch_classifier.fit(X_train, y_train_cat)

    # Check the results
    print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
    print(gridsearch_classifier.best_params_)

    # Take the best estimator
    model = gridsearch_classifier.best_estimator_

    # Predict training labels
    y_pred = model.predict(X_train)
    y_true = y_train_cat

    print(f"Balanced acc: {balanced_accuracy_score(y_true, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_true, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_true, y_pred)}")

    # Predict test labels
    y_pred = model.predict(X_test)
    y_true = y_test_cat

    print(f"Balanced acc: {balanced_accuracy_score(y_true, y_pred)}")
    print(f"Macro F1 score: {f1_score(y_true, y_pred, average='macro')}")
    print(f"Confusion matrix:\n{confusion_matrix(y_true, y_pred)}")

    data_and_model = [data_train, data_test, model]

    with open(f'../models/class_model_randomforest_case{case}{f"_{undersample_method}_undersampled" if undersample else ""}.pickle', 'wb') as handle:
        pickle.dump(data_and_model, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Train and Evaluate the Model

For hyperparameter tuning, we use a k-fold crossvalidation with a stratified splitter that ensures we have enough data from each class in the training and validation set.

For evaluation use confusion matrix, macro F1 score and balanced accuracy to account for potential class imbalances.

We load the case 2 data that was preprocessed in `data_preprocessing.ipynb`.  

Case 2: 
- class 0: target < 1.5
- class 1: target > 2.0

In [11]:
undersample_method = "random"

print(f'Building model for case {case}' + f'{f" with {undersample_method} undersampled data" if undersample else ""}.')
data_train, data_test, y_train, y_train_cat, X_train, y_test, y_test_cat, X_test = utils.load_data(case, undersample, undersample_method)
rf_model(seed, X_train, X_test, y_train_cat, y_test_cat)

Building model for case 2 with random undersampled data.

Training dataset target distribution:
Counter({0: 14264, 1: 14264})

Test dataset target distribution:
Counter({0: 83728, 1: 3566})
The mean cross-validated score of the best model is 99.62% accuracy and the parameters of best prediction model are:
{'class_weight': 'balanced_subsample', 'criterion': 'gini', 'max_depth': 50, 'max_features': 'sqrt', 'max_samples': 0.95, 'n_estimators': 1000}
Balanced acc: 1.0
Macro F1 score: 1.0
Confusion matrix:
[[14264     0]
 [    0 14264]]
Balanced acc: 0.9971070693994141
Macro F1 score: 0.9711103113974497
Confusion matrix:
[[83314   414]
 [    3  3563]]


In [12]:
undersample_method = "nearmiss"

print(f'Building model for case {case}' + f'{f" with {undersample_method} undersampled data" if undersample else ""}.')
data_train, data_test, y_train, y_train_cat, X_train, y_test, y_test_cat, X_test = utils.load_data(case, undersample, undersample_method)
rf_model(seed, X_train, X_test, y_train_cat, y_test_cat)

Building model for case 2 with nearmiss undersampled data.

Training dataset target distribution:
Counter({0: 14264, 1: 14264})

Test dataset target distribution:
Counter({0: 83728, 1: 3566})
The mean cross-validated score of the best model is 96.93% accuracy and the parameters of best prediction model are:
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 30, 'max_features': 'sqrt', 'max_samples': 0.8, 'n_estimators': 1000}
Balanced acc: 1.0
Macro F1 score: 1.0
Confusion matrix:
[[14264     0]
 [    0 14264]]
Balanced acc: 0.8111308354569383
Macro F1 score: 0.4770770780264054
Confusion matrix:
[[52359 31369]
 [   11  3555]]
