# Modelling Rogue Wave Data with Elastic Net Classification Model

In [1]:
%load_ext autoreload
%autoreload 2

## Setup
### Imports

Importing all required packages and define seed and number of cores to use.

In [2]:
import os
import pandas as pd
import sys
import pickle

sys.path.append('./')
import utils

from collections import Counter
from imblearn.under_sampling import NearMiss
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, f1_score


In [3]:
print(os.cpu_count()) # ask the question how many CPU cores are available on the current machine

8


In [4]:
seed = 42
n_jobs = 5

### Model Configuration

Select which case to use:

- case 1: class 0: target < 2.0 and class 1: target > 2.0 
- case 2: class 0: target < 1.5 and class 1: target > 2.0
- case 3: class 0: target < 1.5, class 1: 1.5 < target < 2.0 and class 2: target > 2.0

and if the data should be undersampled. In addition, set the number of cross-validation rounds.

In [5]:
case = 2
undersample = True

num_cv = 10

## Loading Rogue Wave Data

Loading the data that was preprocessed in `data_preprocessing.ipynb`.

In [None]:
# Load and unpack the data
with open(f'./data/data_case{case}.pickle', 'rb') as handle:
    data = pickle.load(handle)

X_train = data[0]
X_test = data[1]
y_train = data[2]
y_test = data[3]

To tackle the high class imbalance we will undersample the larger class using NearMiss Undersampler. Near-miss is an algorithm that can help in balancing an imbalanced dataset. It can be grouped under undersampling algorithms and is an efficient way to balance the data. The algorithm does this by looking at the class distribution and randomly eliminating samples from the larger class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.

For expl of version argument and n_neighbours see https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-randomundersampler-nearmiss-eae0eadcc145

In [None]:
if undersample:
    X_train_original = X_train

    nm = NearMiss(version=1, sampling_strategy='auto', n_neighbors=5) 
    X_train, y_train = nm.fit_resample(X_train, y_train)    

    print('Resampled dataset shape:')
    print(Counter(y_train))

    utils.plot_distributions_target(pd.DataFrame({"target_original": X_train_original["AI_10min"]}), pd.DataFrame({"target_undersampled": X_train["AI_10min"]}))

After undersampling, drop the continuous target variable from the dataset as we only use the binarized version for classification.

In [None]:
X_train = X_train.drop(columns=['AI_10min'])
X_test = X_test.drop(columns=['AI_10min'])

Scale the data with the standard scaler, since regression models benefit from it. Fit scaler on training set only and then use the fittet scaler for test set to avoid data leakage between train and test set.

In [7]:
scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(X_train)
X_test_transformed = scaler.transform(X_test)

## Building an Elastic Net Classification Model

### Setting Hyperparameters

- `C`: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
- `l1_ratio`: The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
- `class_weight`: Weights associated with classes. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

In [8]:
hyperparameter_grid = {'C': [0.1, 0.2, 0.3, 0.5, 0.75, 1.0, 5.0, 10.0, 100.0], 
            'l1_ratio': [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1],
            'class_weight': ['balanced', None] 
}

### Train the Model

Running a regularized logistic regression model with the parameters:

- `penalty`: 'elasticnet' -> both L1 and L2 penalty terms are added
- `solver`: algorithm to use in the optimization problem -> for 'elasticnet' penalty only 'saga' solver is available

In [None]:
# Define a classifier
classifier = LogisticRegression(solver='saga', penalty = 'elasticnet', random_state=seed, n_jobs=n_jobs)

In [10]:
# Tune hyperparameters
skf = StratifiedKFold(n_splits=num_cv).split(X_train_transformed, y_train)

gridsearch_classifier = GridSearchCV(classifier, hyperparameter_grid, cv=skf)
gridsearch_classifier.fit(X_train_transformed, y_train)

### Evaulate the Model

For evaluation use confusion matrix, macro F1 score and balanced accuracy to account for class imbalances.

In [11]:
# Check the results
print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
print(gridsearch_classifier.best_params_)

The mean cross-validated score of the best model is 61.54% accuracy and the parameters of best prediction model are:
{'C': 0.1, 'class_weight': 'balanced', 'l1_ratio': 0}


In [12]:
# Take the best estimator
model = gridsearch_classifier.best_estimator_

# Predict labels
y_pred = model.predict(X_test_transformed)

In [13]:
print(f"Balanced acc: {balanced_accuracy_score(y_test, y_pred)}")
print(f"Macro F1 score: {f1_score(y_test, y_pred, average='macro')}")
print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")

Balanced acc: 0.5490943238308509
Macro F1 score: 0.37021833618012623
Confusion matrix:
[[41069 42659]
 [ 1399  2167]]


### Save the Model

In [14]:
# Save the model with joblib
data_and_model = [X_train, X_test, y_train, y_test]

with open(f'./models/model_elnet_case{case}_unders_{undersample}.pickle', 'wb') as handle:
    pickle.dump(data_and_model, handle, protocol=pickle.HIGHEST_PROTOCOL)