# Data Ingestion and Threshold-Based Anomaly Labeling
---
* Load the dataset from CSV into a Pandas DataFrame.
* Define error margin and quantiles for CPU, RAM, and Disk.
* Compute thresholds for each metric using the specified quantiles, adding the error margin to RAM and Disk.
* Print the computed thresholds as percentages.
* Store thresholds in a dictionary for easy reference.
* Apply rule-based classification: label as `1` (anomaly) if any metric exceeds its threshold, otherwise `0` (normal).
* Print counts of predicted labels (`0` vs `1`).
* Print percentages of predicted labels.


In [14]:
import pandas as pd

# Load dataset
df = pd.read_csv('Data/system_metrics_binary.csv')

error = 0.1

cpu_quantile = 0.925
ram_quantile = 0.9
disk_quantile = 0.85

thresholds = {
    "cpu": df["cpu_ratio"].quantile(cpu_quantile),
    "ram": df["ram_ratio"].quantile(ram_quantile)+error,
    "disk": df["disk_ratio"].quantile(disk_quantile)+error
}
print('----'*15)
print(f"Thresholds: \nCPU: {thresholds['cpu']*100:.2f}%\nRAM: {thresholds['ram']*100:.2f}%\nDISK:{thresholds['disk']*100:.2f}%")
print('----'*15)
# Your computed thresholds
THRESHOLDS = {
    "cpu": thresholds['cpu'],
    "ram": thresholds['ram'],
    "disk": thresholds['disk']
}


# Apply rule-based classification
df["pred_label"] = (
    (df["cpu_ratio"] > THRESHOLDS["cpu"]) |
    (df["ram_ratio"] > THRESHOLDS["ram"]) |
    (df["disk_ratio"] > THRESHOLDS["disk"])
).astype(int)


# Show counts
print("Label counts:")
print(df["pred_label"].value_counts())
print('----'*15)

print("\nLabel percentages:")
print(df["pred_label"].value_counts(normalize=True) * 100)
print('----'*15)


------------------------------------------------------------
Thresholds: 
CPU: 71.40%
RAM: 74.82%
DISK:60.15%
------------------------------------------------------------
Label counts:
pred_label
0    3192
1     422
Name: count, dtype: int64
------------------------------------------------------------

Label percentages:
pred_label
0    88.323188
1    11.676812
Name: proportion, dtype: float64
------------------------------------------------------------


# Checking Label Counts
---

- Verifying Class Imabalance before Data Preprocessing

In [15]:
df['pred_label'].value_counts()

pred_label
0    3018
1     399
Name: count, dtype: int64

# Data Preprocessing
---
1. Label Encoding the Target Column
2. Splitting between Train and Test Data 
3. Creating a Preprocessing Pipline for the Data with the Following Steps:
    > 1. A Column Transformer that Scales the Numeric Data
    > 2. Adding to the Main Pipeline where the Class Imbalance in the Data is Handled through SMOTE
    > 3. Finishing it off with a Placeholder Model for Grid Search CV and Hyper Parameter Tuning
4. This Pipeline is designed for Evaluating Supervised Machine Learning Models.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE



In [None]:
features = ["cpu_ratio", "ram_ratio", "disk_ratio"]
X = df[features]
y = df['pred_label']


In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


In [19]:

# =========================
# PREPROCESSING
# =========================
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), features),]
)

# =========================
# PIPELINE (CLASSIFIER IS SWAPPED)
# =========================
pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression())  # placeholder
])



# Supervised Model Evaluation
---

### **Models Selected:**

1. Logistic Regression
2. Random Forest Classifier
3. Gradient Boosting Classifier


In [20]:
# =========================
# PARAMETER GRID (MULTI-MODEL)
# =========================
param_grid = [

    # ---- Logistic Regression ----
    {
        "clf": [LogisticRegression(
            max_iter=1000,
            solver="liblinear"
        )],
        "clf__C": [0.1, 1.0, 10.0],
    },

    # ---- Random Forest ----
    {
        "clf": [RandomForestClassifier(
            random_state=42,
            n_jobs=-1
        )],
        "clf__n_estimators": [100, 200],
        "clf__max_depth": [None, 10, 20],
        "clf__min_samples_split": [2, 5],
    },

    # ---- Gradient Boosting ----
    {
        "clf": [GradientBoostingClassifier(
            random_state=42
        )],
        "clf__n_estimators": [100, 200],
        "clf__learning_rate": [0.05, 0.1],
        "clf__max_depth": [3, 5],
    }
]

# =========================
# GRID SEARCH CV
# =========================
grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=2
)


In [21]:

# =========================
# TRAIN (CV RUNS HERE)
# =========================
grid.fit(X_train, y_train)

# =========================
# BEST MODEL
# =========================
print("\nBest model:")
print(grid.best_estimator_["clf"])

print("\nBest parameters:")
print(grid.best_params_)

# =========================
# FINAL EVALUATION
# =========================
best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(
    y_test,
    y_pred,
))


Fitting 5 folds for each of 23 candidates, totalling 115 fits

Best model:
RandomForestClassifier(n_jobs=-1, random_state=42)

Best parameters:
{'clf': RandomForestClassifier(n_jobs=-1, random_state=42), 'clf__max_depth': None, 'clf__min_samples_split': 2, 'clf__n_estimators': 100}

Confusion Matrix:
[[604   0]
 [  0  80]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       604
           1       1.00      1.00      1.00        80

    accuracy                           1.00       684
   macro avg       1.00      1.00      1.00       684
weighted avg       1.00      1.00      1.00       684



**Insights**

The Random Forest Classifier Performed the best in terms of accuracy, recall and f1 score

---

# Unsupervised Model Evaluation
--- 

### **Models Selected:**

1. Isolation Forest
2. Local Outlier Factor

In [22]:
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("clf", IsolationForest(random_state=42))  # placeholder
])

# =========================
# PARAMETER GRID
# =========================
param_grid = [
    # Isolation Forest
    {
        "clf": [IsolationForest(random_state=42)],
        "clf__n_estimators": [100, 200],
        "clf__max_samples": ["auto", 0.8],
        "clf__contamination": [0.1, 0.2],
        "clf__max_features": [1.0, 0.8]
    },
    # Local Outlier Factor (novelty=True to allow predict)
    {
        "clf": [LocalOutlierFactor(novelty=True)],
        "clf__n_neighbors": [20, 35],
        "clf__algorithm": ["auto"],
        "clf__leaf_size": [30, 50],
        "clf__contamination": [0.1, 0.2]
    }
]

# =========================
# GRID SEARCH CV
# =========================
grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring="f1",
    cv=3,
    n_jobs=-1,
    verbose=2
)

# Fit using training set (labels only for scoring)
grid.fit(X_train, y_train)

# =========================
# BEST MODEL
# =========================
best_model_unsupervised = grid.best_estimator_
y_pred = best_model_unsupervised.predict(X_test)

# Convert -1 anomaly to 1, inliers to 0
y_pred = (y_pred == -1).astype(int)

print("\nBest model:", best_model_unsupervised["clf"])
print("Best parameters:", grid.best_params_)
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Fitting 3 folds for each of 24 candidates, totalling 72 fits


 nan nan nan nan nan nan]



Best model: IsolationForest(contamination=0.1, random_state=42)
Best parameters: {'clf': IsolationForest(random_state=42), 'clf__contamination': 0.1, 'clf__max_features': 1.0, 'clf__max_samples': 'auto', 'clf__n_estimators': 100}

Confusion Matrix:
 [[587  17]
 [ 28  52]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.97      0.96       604
           1       0.75      0.65      0.70        80

    accuracy                           0.93       684
   macro avg       0.85      0.81      0.83       684
weighted avg       0.93      0.93      0.93       684



**Insights**

The Isolation Forest Model performed the best in the unsupervised category but still didn't meet the requirements

---

# Final Verdict:
---

- The Supervised Models were bettter able to Generalize with the Data
- The Unsupervised Models were unable to Generalize with the Data
- Therefore, supervised models will be used in the final deployment

---


## Saving the Best Model with the Hyper Parameters

In [23]:
import joblib

# Saving the best trained pipeline
joblib.dump(best_model, "D:/Client_Projects/dcml2526/models/supervised_pipeline_simple.joblib")

# Saving the best hyperparameters separately
joblib.dump(grid.best_params_, "D:/Client_Projects/dcml2526/models/supervised_best_params_simple.joblib")


['D:/Client_Projects/dcml2526/models/supervised_best_params_simple.joblib']

In [24]:
joblib.dump(best_model_unsupervised, 'D:/Client_Projects/dcml2526/models/unsupervised_pipeline_simple.joblib')
joblib.dump(grid.best_params_, 'D:/Client_Projects/dcml2526/models/unsupervised_best_params_simple.joblib')

['D:/Client_Projects/dcml2526/models/unsupervised_best_params_simple.joblib']