# Modelling

The main objectives for this notebook are:
* To develop a model that will satisfiy our modelling objective
* To properly evaluate the developed model
* To have a trained model ready for deployment


## good things to do
1. Have a proper baseline
2. Perform post-modellign steps - threshold selection, explainability, false positives / false negatives
3. Use MLFlow for experiment tracking
4. Build an ML training pipeline using Kedro/ZenML/Metaflow/etc


In [1]:
import os
import sys
import warnings

import joblib
import mlflow
import numpy as np
import plotly.express as px
import polars as pl
import shap
from optuna.integration.mlflow import MLflowCallback
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import average_precision_score, roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
    os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if  (path2add not in sys.path):
    sys.path.append(path2add)

from ml_util_funcs import evaluate_thresholds, tune_hgbt

In [4]:
warnings.filterwarnings("ignore")
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("api_anomaly")
mlflow.sklearn.autolog(disable=True)

2024/11/05 09:55:02 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2024/11/05 09:55:02 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

## Read Data

In [6]:
data = pl.read_parquet('../data/supervised_clean_data_w_features.parquet')
data.sample(3)

Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly,max_global_source_degrees,avg_global_source_degrees,min_global_dest_degrees,std_local_source_degrees,max_global_dest_degrees,min_global_source_degrees,std_global_source_degrees,n_connections,avg_global_dest_degrees
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str,bool,u32,f64,u32,f64,u32,u32,f64,u32,f64
1206,"""bf53ab82-2b4e-3820-8818-6c7f02…",0.60275,0.5,6.0,217,"""default""",2.0,1.0,3.0,"""E""","""outlier""",True,32071,32071.0,1217,,1217,32071,,1,1217.0
341,"""d9d846e3-b048-3388-80c8-a361c3…",9e-06,0.017857,8.823151,4,"""default""",1947.0,622.0,98.0,"""E""","""normal""",False,32071,6817.737037,23,6.879318,22416,21,8789.635104,270,8213.018519
729,"""f1dce7b4-d28d-3d8b-9d01-7ef372…",0.00544,0.018262,26.737089,1859,"""default""",296.0,213.0,104.0,"""E""","""normal""",False,32071,6428.850794,26,7.899586,22416,15,8838.839606,315,7639.907937


## Data Processing for Modelling

Doing initial modelling, I've noticed that we can perfectly prdict outliers using the provided features which is not surprising - it's generally quite easy to replicate the results of unsupervised models in a supervised way. Hence, the modelling goal for this project has shifted from <br><br> `To develop a supervised model to classify behaviour into normal and anomalous` <br><br> to <br><br>`To develop a supervised model using the engineered features to classify behaviour into normal and anomalous` 

In [7]:
label = "is_anomaly"
numerical_features = [
    "max_global_source_degrees",
    "avg_global_source_degrees",
    "min_global_dest_degrees",
    "std_local_source_degrees",
    "max_global_dest_degrees",
    "min_global_source_degrees",
    "std_global_source_degrees",
    "n_connections",
    "avg_global_dest_degrees",
]

data = data.filter(pl.col('ip_type') == 'default').select([label] + numerical_features) # 
data.sample(3)

is_anomaly,max_global_source_degrees,avg_global_source_degrees,min_global_dest_degrees,std_local_source_degrees,max_global_dest_degrees,min_global_source_degrees,std_global_source_degrees,n_connections,avg_global_dest_degrees
bool,u32,f64,u32,f64,u32,u32,f64,u32,f64
False,32071,5920.831579,3,7.225176,22416,3,8100.117708,380,7046.963158
False,32071,9067.145455,580,2.325094,22416,476,10049.337307,55,8816.490909
False,32071,6867.384615,813,2.277397,22416,403,9907.997773,65,6246.953846


In [8]:
X_train, X_test, y_train, y_test = train_test_split(data[numerical_features], data[label].to_list(), test_size=0.2)

In [9]:
print("Train shape: ", X_train.shape)
print("Test shape::", X_test.shape)

Train shape:  (1233, 9)
Test shape:: (309, 9)


## Baseline
Before building a model, we need to understand whether the problem can be solved without using ML. 

In [10]:
from sklearn.metrics import f1_score

heuristic_f1_scores = []
possible_values = X_train['n_connections'].sort().unique().to_list()
for v in possible_values:
    heuristic_pred = X_test.select(pl.col('n_connections') <= v).to_pandas()
    heuristic_f1_scores.append(f1_score(y_test, heuristic_pred))

In [11]:
px.line(
    x=possible_values, 
    y=heuristic_f1_scores, 
    labels={
        "x": "Number of Connections Threshold",
        "y": "F1 Score",
    },
    title='F1 Score for Heuristic Rule'
)

**Insights**
* The optimal number of connections to set as threshold is 46
* Heuristic rule can achive the F1 score of 0.73