# Modeling Notebook

In this last phase, we'll load the datasets prepared in the Feature Engineering phase, and implement train, and test various anomaly detection machine learning algorithms. Subsequently, we'll evaluate the performance of each algorithm by computing relevant metrics. This process will aid us in selecting the most suitable model for our specific use case.

#### Import libraries section

In [13]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import IsolationForest

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

### 1. Load the processed dataset

In [2]:
consolidated_df = pd.read_csv("../Data/Processed/processed_consolidated_data.csv")
filtered_df = pd.read_csv("../Data/Processed/processed_filtered_data.csv")

In [3]:
consolidated_df.tail()

Unnamed: 0,cluster_6,cluster_10,cluster_15,cluster_14,cluster_4,cluster_2,cluster_17,cluster_8,cluster_12,cluster_3,cluster_9,cluster_5,cluster_1,cluster_13,cluster_0,cluster_11,cluster_7,cluster_16,anomaly
5554,-0.246869,,,,-0.046225,,,-0.103407,-0.696322,,0.0,0.0,-1.147162,,-0.244415,-0.130097,0.048848,0.403916,0
5555,2.201774,3.381788,3.21784,3.592778,0.65135,3.189005,3.068466,0.389782,1.660412,0.565051,0.0,-1.14595,2.688592,-1.166791,1.996624,,-0.351831,1.770652,0
5556,0.22435,2.113636,3.117679,2.208783,0.0,2.630917,1.342173,0.373364,0.389803,1.290899,0.0,0.0,0.51175,2.654875,1.268348,2.376257,-0.130018,,0
5557,0.010262,1.320406,3.001764,1.361922,-0.600823,0.828067,1.725537,-0.074663,0.726163,1.130658,-0.126752,0.0,0.837838,-1.225484,0.491605,1.876478,-4.49273,,0
5558,0.400114,-0.751519,2.053032,0.364424,0.0,2.149691,2.461613,1.128856,0.314571,-0.373194,0.129812,0.0,0.696582,2.329912,-0.063248,,,-1.082809,0


In [4]:
filtered_df.tail()

Unnamed: 0,Assets,EarningsPerShareBasic,NetIncomeLoss,RetainedEarningsAccumulatedDeficit,StockholdersEquity,anomaly
5554,-0.504218,-1.430176,-0.292595,-0.289156,-0.390762,0
5555,0.570299,0.447677,1.328116,-0.196763,0.578382,0
5556,0.292185,0.743182,0.546734,1.449525,-0.331026,0
5557,-0.060695,0.249874,0.205876,-0.390601,-0.420549,0
5558,0.102541,0.31104,-0.010839,0.01186,-0.003716,0


In [5]:
consolidated_df = consolidated_df.fillna(0)
filtered_df = filtered_df.fillna(0)

### 2. Train and test split and validations
In this section, we will divide our data in train and test splits so we can train and validate our machine learning models' performances and run the last validations before inputting the data in our models

In [6]:
# Ensure there are no duplicates
print("Number of duplicates in consolidated_df:", consolidated_df.duplicated().sum())
consolidated_df = consolidated_df.drop_duplicates()
# Ensure there are no duplicates
print("Number of duplicates in filtered_df:", filtered_df.duplicated().sum())
filtered_df = filtered_df.drop_duplicates()

Number of duplicates in consolidated_df: 654
Number of duplicates in filtered_df: 1629


In [7]:
# (1) Consolidated dataset

# Define features and target
X_consolidated = consolidated_df.drop(columns=['anomaly'])
y_consolidated = consolidated_df['anomaly']

# Split the data into training and testing sets
X_train_con, X_test_con, y_train_con, y_test_con = train_test_split(X_consolidated, y_consolidated, test_size=0.3, random_state=42, stratify=y_consolidated)

In [8]:
# (2) Filtered dataset

# Define features and target
X_filtered = filtered_df.drop(columns=['anomaly'])
y_filtered = filtered_df['anomaly']

# Split the data into training and testing sets
X_train_filter, X_test_filter, y_train_filter, y_test_filter = train_test_split(X_filtered, y_filtered, test_size=0.3, random_state=42, stratify=y_filtered)

### 3. Modelling

In this section, we will try different machine learning models to build our anomaly detection classifier and select the best performer. Initially, we'll start with a Dummy Classifier as a baseline to provide a benchmark for comparison. Following this, we train a IsolationForest, One-Class SVM, Random-Forest classifier.

For each model, we'll perform hyperparameter tuning using RandomizedSearchCV to find the best configuration. We evaluate each model using accuracy, confusion matrix, and classification report to understand their performance. Finally, we identify the best performing model based on accuracy and save it to a pickle file for future use. Additionally, we visualize the confusion matrices for the top models to inspect their performance further.

#### 3.1 Dummy Classifier

In [9]:
# (1) Consolidated dataset

# Instantiate and fit a dummy classifier (e.g., most frequent strategy)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_con, y_train_con)

# Predict on test set
y_pred_dummy = dummy_clf.predict(X_test_con)

print("Dummy Classifier:")
print(f"Accuracy: {accuracy_score(y_test_con, y_pred_dummy)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_con, y_pred_dummy)}\n")
print(f"Classification Report:\n{classification_report(y_test_con, y_pred_dummy, zero_division=0)}")

Dummy Classifier:
Accuracy: 0.9110054347826086

Confusion Matrix:
[[1341    0]
 [ 131    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1341
           1       0.00      0.00      0.00       131

    accuracy                           0.91      1472
   macro avg       0.46      0.50      0.48      1472
weighted avg       0.83      0.91      0.87      1472



In [10]:
# (2) Filtered dataset

# Instantiate and fit a dummy classifier (e.g., most frequent strategy)
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train_filter, y_train_filter)

# Predict on test set
y_pred_dummy = dummy_clf.predict(X_test_filter)

print("Dummy Classifier:")
print(f"Accuracy: {accuracy_score(y_test_filter, y_pred_dummy)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_filter, y_pred_dummy)}\n")
print(f"Classification Report:\n{classification_report(y_test_filter, y_pred_dummy, zero_division=0)}")

Dummy Classifier:
Accuracy: 0.9117896522476675

Confusion Matrix:
[[1075    0]
 [ 104    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1075
           1       0.00      0.00      0.00       104

    accuracy                           0.91      1179
   macro avg       0.46      0.50      0.48      1179
weighted avg       0.83      0.91      0.87      1179



#### 3.2 Isolation Forest

In [11]:
# (1) Consolidated dataset

# Instantiate Isolation Forest
iso_forest = IsolationForest(contamination=0.09)  # Adjust contamination based on your anomaly rate
iso_forest.fit(X_train_con)

# Predict outliers/anomalies
y_pred_if = iso_forest.predict(X_test_con)  # Anomalies are labeled as -1, normal points as 1
y_pred_if = np.where(y_pred_if == -1, 1, 0)

print("\nIsolation Forest:")
print(f"Accuracy: {accuracy_score(y_test_con, y_pred_if)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_con, y_pred_if)}\n")
print(f"Classification Report:\n{classification_report(y_test_con, y_pred_if, zero_division=0)}")

ValueError: Input X contains infinity or a value too large for dtype('float32').

In [15]:
# (2) Filtered dataset

# Instantiate Isolation Forest
iso_forest = IsolationForest(contamination=0.09)  # Adjust contamination based on your anomaly rate
iso_forest.fit(X_train_filter)

# Predict outliers/anomalies
y_pred_if = iso_forest.predict(X_test_filter)  # Anomalies are labeled as -1, normal points as 1
y_pred_if = np.where(y_pred_if == -1, 1, 0)

print("\nIsolation Forest:")
print(f"Accuracy: {accuracy_score(y_test_filter, y_pred_if)}\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_test_filter, y_pred_if)}\n")
print(f"Classification Report:\n{classification_report(y_test_filter, y_pred_if, zero_division=0)}")


Isolation Forest:
Accuracy: 0.833757421543681

Confusion Matrix:
[[968 107]
 [ 89  15]]

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.90      0.91      1075
           1       0.12      0.14      0.13       104

    accuracy                           0.83      1179
   macro avg       0.52      0.52      0.52      1179
weighted avg       0.85      0.83      0.84      1179



1. Logistic regression
2. Tree-Based Models: Random forest and Gradient Boosting Machines (GBM): (e.g., XGBoost, LightGBM, CatBoost)
3. Support Vector Machines (SVM)
4. Neural Networks: MLPs and CNN, LSTMs
5. Ensemble Methods
6. Anomaly Detection Algorithms: Given the nature of anomalies, consider specialized anomaly detection techniques like Isolation Forest, One-Class SVM, or Autoencoders for anomaly detection.