# PART 5 : Isolation Forest Modelling 

In this final part of the code, we will be using a machine learning algoritmn called "Isolation Forest". It works by for data anamoly detection using binary trees. As this is part of unsupervised machine learning algorithmn,  it will be able to detect the outliers present in the data to isolate from rest of it. This method can be used for checking 'Torque Dropout' and 'Overheating'. 

Further details will be discussed below --

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest

# Step 1: Loading the dataset
df = pd.read_csv("cleaned_vehicle_test_logs.csv")


In [2]:
df.head()

Unnamed: 0,timestamp,vehicle_id,propulsion_type,test_scenario,vehicle_speed_mph,acceleration_mphps,engine_rpm,torque,coolant_temp,ambient_temp,power_kw,phase,soc,source_file
0,2025-08-11 21:02:06.665331,V001,EV,Cold Start,0.0,0.0,500.0,99.49,39.87,29.089257,5.21,Idle,99.99,V001_Cold_Start.csv
1,2025-08-11 21:02:07.665331,V001,EV,Cold Start,0.0,0.0,500.0,98.36,39.89,29.086131,5.15,Idle,100.0,V001_Cold_Start.csv
2,2025-08-11 21:02:08.665331,V001,EV,Cold Start,0.0,0.0,500.0,118.5,39.91,29.093595,6.2,Idle,100.0,V001_Cold_Start.csv
3,2025-08-11 21:02:09.665331,V001,EV,Cold Start,0.0,0.0,500.0,108.82,39.93,29.105409,5.7,Idle,100.0,V001_Cold_Start.csv
4,2025-08-11 21:02:10.665331,V001,EV,Cold Start,0.0,0.0,500.0,123.97,39.95,29.106028,6.49,Idle,99.89,V001_Cold_Start.csv


In [3]:
# Step 2: Droping columns that will not be needed for modelling process
drop_cols = ['timestamp', 'vehicle_id','test_scenario', 'phase', 'ambient_temp','source_file', 'soc']
df_clean = df.drop(columns=drop_cols, errors='ignore')


In [4]:
# Step 3: Encoding categorical variables as Isolation forest model requires 
df_clean['propulsion_type'] = df_clean['propulsion_type'].astype('category').cat.codes

In [5]:
# Step 4: Handling of missing values
df_clean = df_clean.fillna(df_clean.median(numeric_only=True))

In [6]:
# Step 5: Isolation Forest for general anomaly detection
model = IsolationForest(n_estimators=100, contamination=0.02, random_state=42)
df['anomaly'] = model.fit_predict(df_clean)
df['anomaly'] = df['anomaly'].map({1: 0, -1: 1})  # 1 = normal, -1 = anomaly

In the next step, we will discuss more on the algorithmn 
- Each tree randomly picks a feature and a split; a point’s path length (how deep you go to isolate it) is short for outliers and long for normal points.

- The forest averages path lengths across trees to get a score; contamination picks a percentile of that score as the cut-off so that ≈2% are labeled anomalies.

- The process will be broken down into various subsets 

I)
- We need a purely numeric feature matrix to feed the model.
- We are taking numeric as Isolation Forest (IF) in scikit-learn expects numbers; categoricals should be encoded upstream (which you did for propulsion_type/gear).

In [7]:
# Building the feature matrix
try:
    X_used = X_iso  # if you already created it elsewhere
except NameError:
    # Using numeric columns from df_clean directly
    X_used = df_clean.select_dtypes(include=[np.number]).copy()

II) 

- Building many random trees on random subsamples; anomalies get isolated in fewer splits (shorter path length) n_estimators=100: number of trees → smoother, more stable scores. contamination=0.02: we expect ~2% outliers. This sets the threshold that converts continuous scores into inlier/outlier labels; it doesn’t change the raw scoring mechanics.

- fit_predict returns +1 (inlier) or -1 (outlier). We remap to 0/1 so 1 = a

In [8]:
# 2) (Re)fitting the model if it's missing or its feature count doesn't match X_used
if ('model' not in globals()) or (getattr(model, 'n_features_in_', None) != X_used.shape[1]):
    model = IsolationForest(n_estimators=100, contamination=0.02, random_state=42)
    df['anomaly'] = model.fit_predict(X_used)
    df['anomaly'] = df['anomaly'].map({1: 0, -1: 1})

III) 

- score_samples (scikit-learn) is higher for more normal points.
- We negate it so higher = more anomalous, which is intuitive for ranking.
- The score relates to the average path length across trees; shorter paths ⇒ more anomalous.

In [9]:
# 3) Adding anomaly score (higher = more anomalous)
df['anomaly_score'] = -model.score_samples(X_used)

IV) 

- We keep ID/time/scenario/propulsion and a few operational signals to inspect what was flagged.
- Sort by anomaly_score (most unusual first) and add a rank so you can quickly review the top N.

In [10]:
# 4) Show top anomalies with context
context_cols = [c for c in [
    'vehicle_id', 'timestamp', 'test_scenario', 'propulsion_type',
    'vehicle_speed_mph', 'engine_rpm', 'torque'
] if c in df.columns]

top_anomalies = (
    df[df['anomaly'] == 1]
      .sort_values('anomaly_score', ascending=False)
      .assign(anomaly_rank=lambda s: s['anomaly_score']
              .rank(method='first', ascending=False).astype(int))
      [context_cols + ['anomaly_score', 'anomaly_rank']]
)


V) 

- Center by the median (not mean) and scale by MAD (median absolute deviation).
- The constant 1.4826 makes MAD comparable to standard deviation under a normal assumption.
- This is resistant to outliers (robust), so it highlights which features are unusually large/small for that sample compared to the bulk of the data.

In [11]:
# 5) Lightweight “why”: top deviating features via robust z-scores
X_med = X_used.median()
X_mad = (X_used - X_med).abs().median().replace(0, np.nan).fillna(1e-9)
robust_z = ((X_used - X_med) / (1.4826 * X_mad)).replace([np.inf, -np.inf], np.nan).fillna(0.0)

def explain_row(i, k=3):
    zabs = robust_z.iloc[i].abs().sort_values(ascending=False)
    return ", ".join([f"{f} (|z|={v:.2f})" for f, v in zabs.head(k).items()])

df['anomaly_top_features'] = ""
anom_idx = df.index[df['anomaly'] == 1]
df.loc[anom_idx, 'anomaly_top_features'] = [explain_row(i, 3) for i in anom_idx]


For each anomalous row, we list the top k features with the largest |robust z|.
This is a post-hoc diagnostic (not the model’s internal explanation), but it’s a fast and often accurate hint about “what was off” (e.g., torque way below typical, rpm unusually high, etc.).

In [12]:
# Step 6: Overheat Event Detection
def detect_overheat(row):
    if row['propulsion_type'] in ['EV', 0]:  # 0 is EV after encoding and 1 is for Hybrid
        return int(row['coolant_temp'] > 57.8)
    else:
        return int(row['coolant_temp'] > 69.8)

df['overheat_event'] = df.apply(detect_overheat, axis=1)

Different thresholds by propulsion type: EV packs are more temperature-sensitive than hybrid , so using two limits:

EV: 57.8 °C
Hybrid/other: 69.8 °C

Those are slightly above soft ceilings, which gives a buffer to avoid false positives from rounding/measurement noise while still catching genuinely hot samples.

In [13]:
# Step 7: Torque dropout Detection
# Marking low torque
df['torque_dropout_flag'] = (df['torque'] < 40).astype(int)

# Rolling window to check for 3+ consecutive low values
df['torque_dropout_event'] = (
    df.groupby('vehicle_id')['torque_dropout_flag']
    .transform(lambda x: x.rolling(window=3, min_periods=1).sum() >= 3)
    .astype(int)
)
df.drop(columns=['torque_dropout_flag'], inplace=True)

Here, the torque threshold is defined if it below 30 NW for 3 consecutive seconds, then we can call it as Torque droput.

In [14]:
# Printing the top most anomolies detecte by Isolation forest 
print("\n🔎 Top 20 IsolationForest anomalies:")
print(top_anomalies.head(20))


🔎 Top 20 IsolationForest anomalies:
      vehicle_id                   timestamp test_scenario propulsion_type  \
22712       V002  2025-08-11 21:53:38.941160    High Speed          Hybrid   
54723       V004  2025-08-11 22:05:10.265321    High Speed          Hybrid   
22555       V002  2025-08-11 21:51:01.941160    High Speed          Hybrid   
7921        V001  2025-08-11 21:25:07.800516    High Speed              EV   
7933        V001  2025-08-11 21:25:19.800516    High Speed              EV   
7947        V001  2025-08-11 21:25:33.800516    High Speed              EV   
54642       V004  2025-08-11 22:03:49.265321    High Speed          Hybrid   
7946        V001  2025-08-11 21:25:32.800516    High Speed              EV   
54731       V004  2025-08-11 22:05:18.265321    High Speed          Hybrid   
22435       V002  2025-08-11 21:49:01.941160    High Speed          Hybrid   
70256       V005  2025-08-11 21:50:03.466682    High Speed              EV   
36573       V003  2025-08-1

In [15]:
# Step 8: Anomaly, Overheat_ event and Torque Dropout Summary 
summary = df.groupby('propulsion_type')[['anomaly', 'overheat_event', 'torque_dropout_event']].sum().reset_index()

print("Failure Summary by Propulsion Type:")
print(summary)

Failure Summary by Propulsion Type:
  propulsion_type  anomaly  overheat_event  torque_dropout_event
0              EV      504              21                     1
1          Hybrid      990              11                     0


In [16]:
# Grouping by propulsion_type and test_scenario, and sum up the anomaly column
anomaly_summary = df.groupby(['propulsion_type', 'test_scenario'])['anomaly'].sum().reset_index()

# Renaming column for clarity
anomaly_summary.rename(columns={'anomaly': 'anomaly_count'}, inplace=True)

# Sorting by highest anomalies
anomaly_summary = anomaly_summary.sort_values(by='anomaly_count', ascending=False)

# Summary Display 
print(anomaly_summary)

  propulsion_type  test_scenario  anomaly_count
6          Hybrid     High Speed            880
2              EV     High Speed            478
4          Hybrid     Cold Start             97
0              EV     Cold Start             19
7          Hybrid     Hill Climb             11
3              EV     Hill Climb              4
1              EV  Endurance Run              3
5          Hybrid  Endurance Run              2


In [17]:
# Step 9: Anomaly Summary (by vehicle, scenario, and propulsion)
summary = (
    df.groupby(['vehicle_id', 'test_scenario', 'propulsion_type'])
      [['anomaly', 'overheat_event', 'torque_dropout_event']]
      .sum()
      .reset_index()
)

summary_any = (
    summary[
        (summary['anomaly'] > 0) |
        (summary['overheat_event'] > 0) |
        (summary['torque_dropout_event'] > 0)
    ]
    .assign(total_flags=lambda s: s['anomaly'] + s['overheat_event'] + s['torque_dropout_event'])
    .sort_values(['total_flags', 'anomaly', 'overheat_event', 'torque_dropout_event'], ascending=False)
)

print("Failure Summary by Vehicle, Scenario & Propulsion:")
print(summary_any)


Failure Summary by Vehicle, Scenario & Propulsion:
   vehicle_id  test_scenario propulsion_type  anomaly  overheat_event  \
14       V004     High Speed          Hybrid      539               1   
6        V002     High Speed          Hybrid      341               2   
18       V005     High Speed              EV      186               0   
10       V003     High Speed              EV      173               2   
2        V001     High Speed              EV      119               1   
12       V004     Cold Start          Hybrid       69               1   
4        V002     Cold Start          Hybrid       28               4   
8        V003     Cold Start              EV       18               2   
15       V004     Hill Climb          Hybrid       11               0   
19       V005     Hill Climb              EV        4               2   
9        V003  Endurance Run              EV        1               3   
16       V005     Cold Start              EV        1               3   


Now that the results have been displayed, its worth noting about the advantages and drawbacks of this methodology : 

Key hyperparameters
- n_estimators: 100–300 (more trees = smoother scores).
- max_samples: 256 (classic default; sub-sampling is core to IF).
- contamination: expected outlier fraction (e.g., 0.5–2%). Used only to set the decision threshold.
- max_features: 1.0 typically fine; can sub-sample features if very wide.
- random_state: for reproducibility.

Advantages 
- Fast on large datasets (trees on small subsamples).
- Few assumptions; works with mixed distributions.
- Handles high dimensions better than density methods.
- Works well Torque dropout and Overheating really well.

Drawbacks
- Categoricals need encoding (e.g., target/ordinal/one-hot). We used codes for propulsion_type.
- Scaling isn’t required for trees, but features with extreme ranges can dominate split spans; robust scaling can help sometimes.
- Missing values: imputing (median/most frequent) before fitting.
- Showcases poor results with general anomoly detection (eg., High speed is flagged as anamoly and much)

Note: Please take only Torque dropout and Overheating into consideration!

Future Considerations
- Fine Tune evaluation needed 
- Work on refinement with the anamoly detection
- Have data labels planned for it
- More necessary features 