# Modeling Notebook

In this last phase, we'll load the datasets prepared in the Feature Engineering phase, and implement train, and test various anomaly detection machine learning algorithms. Subsequently, we'll evaluate the performance of each algorithm by computing relevant metrics. This process will aid us in selecting the most suitable model for our specific use case.

#### Import libraries section

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import IsolationForest

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.metrics import make_scorer, roc_auc_score

import warnings
warnings.filterwarnings("ignore")

### 1. Load the processed dataset

In [2]:
consolidated_df = pd.read_csv("../Data/Processed/processed_consolidated_data.csv")
filtered_df = pd.read_csv("../Data/Processed/processed_filtered_data.csv")

In [3]:
consolidated_df.head()

Unnamed: 0,cluster_6,cluster_10,cluster_15,cluster_14,cluster_4,cluster_2,cluster_17,cluster_8,cluster_12,cluster_3,cluster_9,cluster_5,cluster_1,cluster_13,cluster_0,cluster_11,cluster_7,cluster_16,anomaly
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.475738,0.0,0.0,0.0,-0.2981119,0.0,0.0,0.0,0.0,0
1,0.0,-0.038235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,1.373144,6.981317,0.0,1.181053,0.0,4.596254,6.473705,0.0,0.511663,2.330604,0.593854,-0.341399,0.0,-1.04e-08,-0.088416,17.912556,0.0,2.593812,1
3,1.373144,6.981317,0.0,1.181053,0.0,4.596254,6.473705,0.0,0.511663,2.330604,0.593854,-0.341399,0.0,-1.04e-08,-0.088416,17.912556,0.0,2.593812,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.160446,0.0,0.0,0.0,0.0,-0.088416,0.0,0.0,0.0,1


In [4]:
filtered_df.head()

Unnamed: 0,Assets,EarningsPerShareBasic,NetIncomeLoss,RetainedEarningsAccumulatedDeficit,StockholdersEquity,anomaly
0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.38838,0.0,0.0,0.0,1
3,0.0,0.38838,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,1


### 2. Train and test split and validations
In this section, we will divide our data in train and test splits so we can train and validate our machine learning models' performances and run the last validations before inputting the data in our models

In [5]:
# Ensure there are no duplicates
print("Number of duplicates in consolidated_df:", consolidated_df.duplicated().sum())
consolidated_df = consolidated_df.drop_duplicates()
# Ensure there are no duplicates
print("Number of duplicates in filtered_df:", filtered_df.duplicated().sum())
filtered_df = filtered_df.drop_duplicates()

# Ensure there are no full null rows
mask = (consolidated_df.iloc[:,:-1] == 0).all(axis=1)
print("Number of full Null rows in inputs:", mask.sum())
consolidated_df = consolidated_df[~mask]
mask = (filtered_df.iloc[:,:-1] == 0).all(axis=1)
print("Number of full Null rows in inputs:", mask.sum())
filtered_df = filtered_df[~mask]

Number of duplicates in consolidated_df: 604
Number of duplicates in filtered_df: 1719
Number of full Null rows in inputs: 2
Number of full Null rows in inputs: 2


In [6]:
# (1) Consolidated dataset

# Define features and target
X_consolidated = consolidated_df.drop(columns=['anomaly'])
y_consolidated = consolidated_df['anomaly']

# Split the data into training and testing sets
X_train_con, X_test_con, y_train_con, y_test_con = train_test_split(X_consolidated, y_consolidated, test_size=0.3, random_state=42, stratify=y_consolidated)

In [7]:
# (2) Filtered dataset

# Define features and target
X_filtered = filtered_df.drop(columns=['anomaly'])
y_filtered = filtered_df['anomaly']

# Split the data into training and testing sets
X_train_filter, X_test_filter, y_train_filter, y_test_filter = train_test_split(X_filtered, y_filtered, test_size=0.3, random_state=42, stratify=y_filtered)

### 3. Modelling

In this section, we will try different machine learning models to build our anomaly detection classifier and select the best performer. Initially, we'll start with a Dummy Classifier as a baseline to provide a benchmark for comparison. Following this, we train a IsolationForest, One-Class SVM, Random-Forest classifier, AutoEncoder.

For each model, we'll perform hyperparameter tuning using RandomizedSearchCV to find the best configuration. We evaluate each model using accuracy, confusion matrix, and classification report to understand their performance. Finally, we identify the best performing model based on accuracy and save it to a pickle file for future use. Additionally, we visualize the confusion matrices for the top models to inspect their performance further.

#### 3.1 Dummy Classifier
This is our benchmark model

In [8]:
# (1) Consolidated dataset

# Instantiate and fit a dummy classifier (e.g., most frequent strategy)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_con, y_train_con)

# Predict on test set
y_pred_dummy = dummy_clf.predict(X_test_con)

print("Dummy Classifier:")
print(f"Accuracy: {accuracy_score(y_test_con, y_pred_dummy)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_con, y_pred_dummy)}\n")
print(f"Classification Report:\n{classification_report(y_test_con, y_pred_dummy, zero_division=0)}")

Dummy Classifier:
Accuracy: 0.9111709286675639

Confusion Matrix:
[[1354    0]
 [ 132    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1354
           1       0.00      0.00      0.00       132

    accuracy                           0.91      1486
   macro avg       0.46      0.50      0.48      1486
weighted avg       0.83      0.91      0.87      1486



In [9]:
# (2) Filtered dataset

# Instantiate and fit a dummy classifier (e.g., most frequent strategy)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_filter, y_train_filter)

# Predict on test set
y_pred_dummy = dummy_clf.predict(X_test_filter)

print("Dummy Classifier:")
print(f"Accuracy: {accuracy_score(y_test_filter, y_pred_dummy)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_filter, y_pred_dummy)}\n")
print(f"Classification Report:\n{classification_report(y_test_filter, y_pred_dummy, zero_division=0)}")

Dummy Classifier:
Accuracy: 0.9097222222222222

Confusion Matrix:
[[1048    0]
 [ 104    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1048
           1       0.00      0.00      0.00       104

    accuracy                           0.91      1152
   macro avg       0.45      0.50      0.48      1152
weighted avg       0.83      0.91      0.87      1152



#### 3.2 Isolation Forest

3.2.1 Consolidated dataset

In [10]:
# (1) Consolidated dataset

# Instantiate Isolation Forest
iso_forest_1 = IsolationForest(contamination=0.09)  # Adjust contamination based on your anomaly rate
iso_forest_1.fit(X_train_con)

# Predict outliers/anomalies
y_pred_if = iso_forest_1.predict(X_test_con)  # Anomalies are labeled as -1, normal points as 1
y_pred_if = np.where(y_pred_if == -1, 1, 0)

print("\nIsolation Forest:")
print(f"Accuracy: {accuracy_score(y_test_con, y_pred_if)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_con, y_pred_if)}\n")
print(f"Classification Report:\n{classification_report(y_test_con, y_pred_if, zero_division=0)}")


Isolation Forest:
Accuracy: 0.8304172274562585

Confusion Matrix:
[[1218  136]
 [ 116   16]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.90      0.91      1354
           1       0.11      0.12      0.11       132

    accuracy                           0.83      1486
   macro avg       0.51      0.51      0.51      1486
weighted avg       0.84      0.83      0.84      1486



3.2.2 Consolidated dataset

In [11]:
# Instantiate Isolation Forest
iso_forest_1 = IsolationForest(contamination=0.09)

# Hyperparameter tuning for Isoloation Forest using RandomizedSearchCV
param_dist_if_1 = {'max_samples': [0.25, 0.5, 0.75, 1.0],  # Subsample size for each tree
    'n_estimators': [50, 100, 200, 300],   # Number of trees in the forest
    'max_features': [5, 10, 15, 20, None]}   # Max number of features to consider for splitting a node

# Define scorer
scorer = make_scorer(roc_auc_score)

# Hyperparameter tunning
random_search_if_1 = RandomizedSearchCV(iso_forest_1,  param_distributions=param_dist_if_1, n_iter=10, cv=5, scoring=scorer, random_state=42, n_jobs=-1)
random_search_if_1.fit(X_train_con)

best_if_clf_1 = random_search_if_1.best_estimator_
y_pred_best_if_1 = best_if_clf_1.predict(X_test_con)
y_pred_best_if_1 = np.where(y_pred_best_if_1 == -1, 1, 0)  # Convert -1 (anomaly) to 1, and 1 (normal) to 0


print("\nBest Decision Tree Classifier:")
print(f"Accuracy: {accuracy_score(y_test_con, y_pred_best_if_1)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_con, y_pred_best_if_1)}\n")
print(f"Classification Report:\n{classification_report(y_test_con, y_pred_best_if_1, zero_division=0)}")

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
     


Best Decision Tree Classifier:
Accuracy: 0.8243606998654105

Confusion Matrix:
[[1209  145]
 [ 116   16]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.89      0.90      1354
           1       0.10      0.12      0.11       132

    accuracy                           0.82      1486
   macro avg       0.51      0.51      0.51      1486
weighted avg       0.84      0.82      0.83      1486



3.2.1 Filtered dataset

In [12]:
# (2) Filtered dataset

# Instantiate Isolation Forest
iso_forest_2 = IsolationForest(contamination=0.09)  # Adjust contamination based on your anomaly rate
iso_forest_2.fit(X_train_filter)

# Predict outliers/anomalies
y_pred_if = iso_forest_2.predict(X_test_filter)  # Anomalies are labeled as -1, normal points as 1
y_pred_if = np.where(y_pred_if == -1, 1, 0)

print("\nIsolation Forest:")
print(f"Accuracy: {accuracy_score(y_test_filter, y_pred_if)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_filter, y_pred_if)}\n")
print(f"Classification Report:\n{classification_report(y_test_filter, y_pred_if, zero_division=0)}")


Isolation Forest:
Accuracy: 0.8342013888888888

Confusion Matrix:
[[942 106]
 [ 85  19]]

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.90      0.91      1048
           1       0.15      0.18      0.17       104

    accuracy                           0.83      1152
   macro avg       0.53      0.54      0.54      1152
weighted avg       0.85      0.83      0.84      1152



3.2.2 Filtered dataset

#### 3.3 One-Class SVM

#### 3.4 RandomForest Classifier

#### 3.5 AutoEncoder

### 4. Identify the best performer

### 5. Saving the model
In this last section, we will save the best performer model into a pickle file for later usage.

In [None]:
# Save the best performing model to a pickle file
pickle_file_path = '../models/best_class_model.pkl'
with open(pickle_file_path, 'wb') as file:
    pickle.dump(best_model, file)
print(f"Best model saved to {pickle_file_path}")