
# SIEM Feature Engineering (Data-Aligned)

## Objective
This notebook performs feature engineering on **actual SIEM alert logs**.
All transformations are aligned with the real dataset structure.

Output:
- ML-ready feature matrix
- Saved to `data/processed/siem_features.csv`


In [1]:

import pandas as pd
import numpy as np
import ipaddress


In [2]:
df = pd.read_csv("siem_alerts.csv")
df.head()


Unnamed: 0,timestamp,alert_type,source_ip,destination_host,user,failed_attempts,success_attempts,process_name,privilege_change,login_time,host_criticality,label
0,2024-02-02 12:00:00,MALWARE_EXEC,192.168.1.150,SERVER-8,alice,0,0,unknown.exe,0,day,medium,1
1,2024-01-16 11:10:00,MALWARE_EXEC,192.168.1.205,SERVER-8,bob,0,0,unknown.exe,0,day,medium,1
2,2024-02-02 10:36:00,PRIV_ESC,192.168.1.241,SERVER-5,alice,0,0,,1,day,critical,1
3,2024-01-26 07:10:00,AUTH_FAILURE,192.168.1.139,SERVER-2,john,1,0,,0,day,medium,1
4,2024-02-02 10:24:00,MALWARE_EXEC,192.168.1.57,SERVER-2,john,0,0,unknown.exe,0,day,critical,1



## 1. Timestamp Features


In [3]:

df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
df["hour"] = df["timestamp"].dt.hour
df["is_off_hours"] = df["hour"].apply(lambda x: 1 if x < 8 or x > 18 else 0)



## 2. Authentication Behaviour Features


In [4]:

df["auth_failure_ratio"] = df["failed_attempts"] / (df["failed_attempts"] + df["success_attempts"] + 1)



## 3. Privilege Escalation Indicator


In [5]:

df["privilege_change_flag"] = df["privilege_change"].astype(int)



## 4. Encode Categorical Variables


In [6]:

categorical_cols = ["alert_type", "destination_host", "user", "login_time", "host_criticality"]
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)



## 5. Final Feature Selection


In [7]:

drop_cols = ["timestamp", "process_name"]
X = df_encoded.drop(columns=drop_cols + ["label"])
y = df_encoded["label"]

X.head(), y.head()


(       source_ip  failed_attempts  success_attempts  privilege_change  hour  \
 0  192.168.1.150                0                 0                 0    12   
 1  192.168.1.205                0                 0                 0    11   
 2  192.168.1.241                0                 0                 1    10   
 3  192.168.1.139                1                 0                 0     7   
 4   192.168.1.57                0                 0                 0    10   
 
    is_off_hours  auth_failure_ratio  privilege_change_flag  \
 0             0                 0.0                      0   
 1             0                 0.0                      0   
 2             0                 0.0                      1   
 3             1                 0.5                      0   
 4             0                 0.0                      0   
 
    alert_type_AUTH_SUCCESS  alert_type_MALWARE_EXEC  ...  \
 0                    False                     True  ...   
 1              


## 6. Save Features


In [9]:
import os

os.makedirs("/content/siem-alert-prioritization/data/processed", exist_ok=True)

OUTPUT_PATH = "/content/siem-alert-prioritization/data/processed/siem_features.csv"
X.assign(label=y).to_csv(OUTPUT_PATH, index=False)
print("Saved:", OUTPUT_PATH)



Saved: /content/siem-alert-prioritization/data/processed/siem_features.csv



## Summary
- Feature engineering aligned with real SIEM dataset
- SOC-relevant behavior encoded
- Ready for model training
