# 03_Modeling

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_auc_score, average_precision_score
import joblib
import plotly.express as px
from sklearn.decomposition import PCA

## Load data

In [2]:
df = pd.read_csv('../data/processed/windowed_dataset_cleaned.csv')
X = df[df.columns.difference(['is_attack'])]  # Features
y = df['is_attack']  # Labels

## Modeling

In [3]:
contamination = y.mean()
print(f"Contamination (proporción de ataques): {contamination:.4f}")
from sklearn.ensemble import IsolationForest

model = IsolationForest(
    n_estimators=100,
    contamination=contamination, # cantidad estimada de outliers, donde pone el threeshold
    max_samples='auto', # number of samples to draw from X to train each base estimator
    random_state=42,
    n_jobs=-1 # usar todos los núcleos disponibles
)

model.fit(X)

Contamination (proporción de ataques): 0.0144


0,1,2
,"n_estimators  n_estimators: int, default=100 The number of base estimators in the ensemble.",100
,"max_samples  max_samples: ""auto"", int or float, default=""auto"" The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. - If ""auto"", then `max_samples=min(256, n_samples)`. If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).",'auto'
,"contamination  contamination: 'auto' or float, default='auto' The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. - If 'auto', the threshold is determined as in the  original paper. - If float, the contamination should be in the range (0, 0.5]. .. versionchanged:: 0.22  The default value of ``contamination`` changed from 0.1  to ``'auto'``.",np.float64(0....3663820037493)
,"max_features  max_features: int or float, default=1.0 The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max(1, int(max_features * n_features_in_))` features. Note: using a float number less than 1.0 or integer less than number of features will enable feature subsampling and leads to a longer runtime.",1.0
,"bootstrap  bootstrap: bool, default=False If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.",False
,"n_jobs  n_jobs: int, default=None The number of jobs to run in parallel for :meth:`fit`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",-1
,"random_state  random_state: int, RandomState instance or None, default=None Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.",42
,"verbose  verbose: int, default=0 Controls the verbosity of the tree building process.",0
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`the Glossary `. .. versionadded:: 0.21",False


Isolation forest is based on underlying Random Decision Trees that are going to isolate each sample from the rest of samples. The easier it is to isolate the sample the high anomaly score it will get.

- The **n_estimator** is the number of decision trees that are going to be used. The default value was chosen.
- The main hyperparameter of this model is **contamination**. This parameters define where is going to put the threshold the model to tell if a sample is an anomaly or not. It won't influence its training or the anomaly score it is going to give to each sample. We decides to put the proportion of attacks the data has to see if it if able to find them within the data

In [4]:
df['anomaly_score'] = -model.score_samples(X)
df['anomaly_label'] = model.predict(X) == -1
print(roc_auc_score(y, df['anomaly_score']))
print(average_precision_score(y, df['anomaly_score']))

0.9998045446782768
0.992299652602951


- These metric show that the model captures nearly every attack

In [5]:
df.sort_values('anomaly_score', ascending=False).head(5)

Unnamed: 0,n_connections,id.orig_p_mean,id.orig_p_std,id.orig_p_max,id.resp_p_std,orig_bytes_mean,orig_bytes_std,orig_bytes_max,resp_bytes_mean,resp_bytes_std,...,recent_activity_score_max,recent_docker_event_mean,recent_docker_event_std,recent_docker_event_max,time_since_container_start_mean,time_since_container_start_std,time_since_container_start_max,is_attack,anomaly_score,anomaly_label
334,6.0,0.451839,21.567171,0.403767,0.0,-0.349432,85.022406,-0.116995,463.292929,4356.381717,...,4.993675,-0.5,0.0,-1.0,-1.069591,0.0,-1.073349,1.0,0.768074,True
335,6.0,0.451839,21.567171,0.403767,0.0,-0.349432,85.022406,-0.116995,463.292929,4356.381717,...,4.993675,-0.5,0.0,-1.0,-1.069591,0.0,-1.073349,1.0,0.768074,True
336,6.0,0.451839,21.567171,0.403767,0.0,-0.349432,85.022406,-0.116995,463.292929,4356.381717,...,4.993675,-0.5,0.0,-1.0,-1.069591,0.0,-1.073349,1.0,0.768074,True
315,6.0,0.594097,24.34279,0.540395,0.0,-0.348918,85.396554,-0.113417,463.97114,4363.623789,...,4.993675,-0.5,0.0,-1.0,-1.069591,0.0,-1.073349,1.0,0.767387,True
314,6.0,0.594097,24.34279,0.540395,0.0,-0.348918,85.396554,-0.113417,463.97114,4363.623789,...,4.993675,-0.5,0.0,-1.0,-1.069591,0.0,-1.073349,1.0,0.767387,True


In [6]:
df.sort_values('anomaly_score').head(10)

Unnamed: 0,n_connections,id.orig_p_mean,id.orig_p_std,id.orig_p_max,id.resp_p_std,orig_bytes_mean,orig_bytes_std,orig_bytes_max,resp_bytes_mean,resp_bytes_std,...,recent_activity_score_max,recent_docker_event_mean,recent_docker_event_std,recent_docker_event_max,time_since_container_start_mean,time_since_container_start_std,time_since_container_start_max,is_attack,anomaly_score,anomaly_label
20460,0.0,0.440963,0.0,0.391413,0.0,0.482368,0.0,0.027191,0.070707,0.0,...,-0.401829,0.5,0.0,0.0,-0.055757,0.0,-0.08922,0.0,0.379955,False
20461,0.0,0.440963,0.0,0.391413,0.0,0.482368,0.0,0.027191,0.070707,0.0,...,-0.401829,0.5,0.0,0.0,-0.055757,0.0,-0.08922,0.0,0.379955,False
20459,0.0,0.440963,0.0,0.391413,0.0,0.482368,0.0,0.027191,0.070707,0.0,...,-0.401829,0.5,0.0,0.0,-0.055757,0.0,-0.08922,0.0,0.379955,False
14329,0.0,0.057956,0.0,0.022812,0.0,0.481648,0.0,0.026476,0.090909,0.0,...,0.476437,0.5,0.0,0.0,0.540735,0.0,0.489796,0.0,0.382376,False
14327,0.0,0.057956,0.0,0.022812,0.0,0.481648,0.0,0.026476,0.090909,0.0,...,0.476437,0.5,0.0,0.0,0.540735,0.0,0.489796,0.0,0.382376,False
14328,0.0,0.057956,0.0,0.022812,0.0,0.481648,0.0,0.026476,0.090909,0.0,...,0.476437,0.5,0.0,0.0,0.540735,0.0,0.489796,0.0,0.382376,False
5439,0.0,-0.005529,0.0,-0.038285,0.0,0.556135,0.0,0.100537,0.393939,0.0,...,0.005351,0.5,0.0,0.0,0.687272,0.0,0.63204,0.0,0.383666,False
5441,0.0,-0.005529,0.0,-0.038285,0.0,0.556135,0.0,0.100537,0.393939,0.0,...,0.005351,0.5,0.0,0.0,0.687272,0.0,0.63204,0.0,0.383666,False
5440,0.0,-0.005529,0.0,-0.038285,0.0,0.556135,0.0,0.100537,0.393939,0.0,...,0.005351,0.5,0.0,0.0,0.687272,0.0,0.63204,0.0,0.383666,False
2672,0.0,-0.123348,0.0,-0.151673,0.0,0.487765,0.0,0.032558,0.323232,0.0,...,-0.704663,-0.5,0.0,-1.0,-0.172825,0.0,-0.202858,0.0,0.384297,False


## Visualize results

In [7]:
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

print("Varianza explicada por componente:")
print(pca.explained_variance_ratio_)
print("Varianza total explicada:", pca.explained_variance_ratio_.sum())

# Crear DataFrame para visualización
df_pca_vis = pd.DataFrame({
    'PCA1': X_pca[:, 0],
    'PCA2': X_pca[:, 1],
    'anomaly_label': df['anomaly_label'].map({True: 'Anomaly', False: 'Normal'}),
    'is_attack': y
})

# Gráfico interactivo con Plotly
fig = px.scatter(
    df_pca_vis,
    x='PCA1',
    y='PCA2',
    color='anomaly_label',
    title="PCA (2D) – Isolation Forest Anomaly Detection",
    width=900,
    height=700,
    hover_data=['is_attack'],
    color_discrete_map={'Anomaly': 'red', 'Normal': 'blue'}
)

fig.update_traces(marker=dict(size=6, opacity=0.8))
fig.update_layout(legend_title_text='Predicted Label')
fig.show()

Varianza explicada por componente:
[0.55767155 0.43215243]
Varianza total explicada: 0.989823983890221


In [8]:
import umap
import plotly.express as px

# -----------------------------
# UMAP embedding
# -----------------------------
reducer = umap.UMAP(
    n_neighbors=30,
    min_dist=0.1,
    n_components=2,
    metric='euclidean',
    random_state=42
)

X_umap = reducer.fit_transform(X)

# Crear un DataFrame para la visualización
df_plot = pd.DataFrame({
    'UMAP1': X_umap[:, 0],
    'UMAP2': X_umap[:, 1],
    'anomaly_score': df['anomaly_score']
})

# Opcional: incluir más columnas originales para mostrar en el hover
# Por ejemplo, si quieres mostrar las primeras 5 características:
feature_cols = list(df.columns[:5])
# Asegurar que 'is_attack' esté incluido
feature_cols.append('is_attack')

df_plot = pd.concat([df_plot, df[feature_cols].reset_index(drop=True)], axis=1)

# Crear gráfico interactivo
fig = px.scatter(
    df_plot,
    x='UMAP1',
    y='UMAP2',
    color='anomaly_score',
    color_continuous_scale='viridis',
    hover_data=feature_cols,  # ¡esto muestra los valores al pasar el ratón!
    title="UMAP projection – colored by anomaly score",
    width=900,
    height=700
)

fig.update_traces(marker=dict(size=6, opacity=0.8))
fig.show()


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Graph is not fully connected, spectral embedding may not work as expected.



- The model gives to each sample an anomaly score. The samples central cluster have a low anomaly score with means that are difficult to separe from the rest of the samples.
- The island on the right is formed by a mid anomaly score which suggests that those samples are easier to isolate than the previous ones. As we have seen previously, those sample have a higher network intesity, connexions, etc.
- If we zoom in the bottom right corner, we can see that there are two long islands and the model is actually able to make a difference between them in terms of anomaly score, both been more anomalous than the rest.

## Save Model

In [9]:
joblib.dump(model, "../models/isolation_forest_model.joblib")

['../models/isolation_forest_model.joblib']