# Log Classification Training Notebook
This notebook processes a dataset of log messages, performs clustering using BERT embeddings, applies regex-based classification, and trains machine learning models for log classification.

## 1. Data Loading and Preparation

In [5]:
import pandas as pd
import numpy as np

df = pd.read_csv('dataset/log_messages.csv')
df = df.drop('complexity', axis = 1)
df.head()

Unnamed: 0,timestamp,source,log_message,target_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status


## 2. Embeddings and Clustering
- SentenceTransformer model for generating embeddings and DBSCAN for clustering
- Encode the log messages into embeddings using the SentenceTransformer model
- Apply DBSCAN clustering to the embeddings, add a 'clusters' column to the DataFrame

In [6]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN

model = SentenceTransformer('all-MiniLM-L6-v2')




In [7]:
# Generate embeddings for log messages
embeddings = model.encode(df['log_message'].tolist())

In [8]:
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.2, min_samples=1, metric='cosine')
clusters = dbscan.fit_predict(embeddings)

df['clusters'] = clusters
df.head()

Unnamed: 0,timestamp,source,log_message,target_label,clusters
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,0
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,0
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,0


In [9]:
df.target_label.value_counts()

target_label
HTTP Status            1017
Security Alert          371
System Notification     356
Error                   177
Resource Usage          177
Critical Error          161
User Action             144
Workflow Error            4
Name: count, dtype: int64

In [10]:
# Analyse Large Clusters (>10 samples)
cluster_cnts = df['clusters'].value_counts()
large_clusters = cluster_cnts[cluster_cnts > 10].index

for cl in large_clusters:
    print(f"Cluster {cl}:")
    print(df[df['clusters'] == cl]['target_label'].head(1).to_string(index=False))
    print(df[df['clusters'] == cl]['log_message'].head().to_string(index=False))
    print()

Cluster 0:
HTTP Status
nova.osapi_compute.wsgi.server [req-b9718cd8-f6...
nova.osapi_compute.wsgi.server [req-4895c258-b2...
nova.osapi_compute.wsgi.server [req-ee8bc8ba-92...
nova.osapi_compute.wsgi.server [req-f0bffbc3-5a...
nova.osapi_compute.wsgi.server [req-2bf7cfee-a2...

Cluster 5:
Resource Usage
nova.compute.claims [req-a07ac654-8e81-416d-bfb...
nova.compute.claims [req-d6986b54-3735-4a42-907...
nova.compute.claims [req-72b4858f-049e-49e1-b31...
nova.compute.claims [req-5c8f52bd-8e3c-41f0-95a...
nova.compute.claims [req-d38f479d-9bb9-4276-968...

Cluster 11:
User Action
User User685 logged out.
 User User395 logged in.
 User User225 logged in.
User User494 logged out.
 User User900 logged in.

Cluster 13:
System Notification
Backup started at 2025-05-14 07:06:55.
Backup started at 2025-02-15 20:00:19.
  Backup ended at 2025-08-08 13:06:23.
Backup started at 2025-11-14 08:27:43.
Backup started at 2025-12-09 10:19:11.

Cluster 7:
Security Alert
Multiple bad login attempts detecte

## 4. Regex Classification
- Classify log messages with common patterns using regular expressions

In [11]:
import re
def classify_with_regex(log_message):
    regex_patterns = {
        r"User ([A-Za-z]+[0-9]*|[0-9]+) logged (in|out).": "User Action",
        r"Backup (started|ended) at .*:": "System Notification",
        r"Backup completed successfully.": "System Notification",
        r"System updated to version .*": "System Notification",
        r"File .* uploaded successfully by user .*": "System Notification",
        r"Disk cleanup completed successfully": "System Notification",
        r"System reboot initiated by user .*": "System Notification",
        r"Account with ID .* created by .*": "User Action"
    }

    for pattern, label in regex_patterns.items():
        if re.search(pattern, log_message, re.IGNORECASE):
            return label
    return None

In [12]:
print(classify_with_regex("User User1235 logged IN."))

User Action


In [13]:
df['regex_label'] = df['log_message'].apply(classify_with_regex)

# verify regex classification
df[df['target_label'].isin(['Security Alert', 'System Notification'])].head()

Unnamed: 0,timestamp,source,log_message,target_label,clusters,regex_label
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2,
7,10/11/2025 8:44,ModernHR,File data_6169.csv uploaded successfully by us...,System Notification,4,System Notification
13,8/4/2025 19:57,ThirdPartyAPI,Multiple bad login attempts detected on user 8...,Security Alert,7,
14,1/4/2025 1:43,ThirdPartyAPI,File data_3847.csv uploaded successfully by us...,System Notification,4,System Notification
15,5/1/2025 9:41,ModernCRM,Backup completed successfully.,System Notification,8,System Notification


## 5. Further Classification Strategy
For label not classified by Regex, we can use BERT classification or LLM classification depending on number of training samples available.

BERT when enough training samples avaialable, LLM classification otherwise.

In [14]:
df_non_regex = df[df['regex_label'].isnull()].copy()
df_non_regex.head()

Unnamed: 0,timestamp,source,log_message,target_label,clusters,regex_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,0,
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1,
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2,
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,0,
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,0,


### Identify low-frequency labels for LLM Classification

In [15]:
# Better to perform LLM classification on the labels from source LegacyCRM whose freq <= 5
print(df_non_regex['target_label'].value_counts()[df_non_regex['target_label'].value_counts() <= 5].index.tolist())



## 6. BERT Encoding
- Filter out data from 'LegacyCRM' source (used for llm classification)
- Encode the log messages from the filtered data

In [16]:
# BERT encoding for these labels
df_non_legacy = df_non_regex[df_non_regex['source'] != 'LegacyCRM']
print(df_non_legacy.source.unique())

['ModernCRM' 'AnalyticsEngine' 'ModernHR' 'BillingSystem' 'ThirdPartyAPI']


In [17]:
filt_embeds = model.encode(df_non_legacy['log_message'].tolist())
filt_embeds[:5]

array([[-0.10293962,  0.03354594, -0.02202607, ...,  0.00457793,
        -0.04259717,  0.00322621],
       [ 0.00804572, -0.03573923,  0.04938739, ...,  0.01538319,
        -0.06230947, -0.02774666],
       [-0.00908224,  0.13003924, -0.05275568, ...,  0.02014104,
        -0.05117098, -0.02930294],
       [-0.09751046,  0.04911299, -0.03977424, ...,  0.02477502,
        -0.03546079, -0.00018598],
       [-0.10468338,  0.05926038, -0.02488499, ...,  0.02502055,
        -0.037193  , -0.0256891 ]], dtype=float32)

## 7. Train and Evaluate Models
- Random Forest Classification
- Naive Bayes Classification
- Logistic Regression

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Features and target
X = filt_embeds
y = df_non_legacy['target_label']

# Stratified train-test split to maintain label distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Initialize Random Forest with class_weight='balanced' to handle imbalance
rf_model = RandomForestClassifier(
    n_estimators=100,
    criterion='entropy',
    class_weight='balanced',
    random_state=42
)

rf_model.fit(X_train, y_train)

y_preds = rf_model.predict(X_test)

print("Classification Report -- Random Forest:\n")
print(classification_report(y_test, y_preds))

Classification Report -- Random Forest:

                precision    recall  f1-score   support

Critical Error       1.00      0.98      0.99        48
         Error       1.00      1.00      1.00        53
   HTTP Status       1.00      1.00      1.00       305
Resource Usage       1.00      1.00      1.00        53
Security Alert       0.99      1.00      1.00       112

      accuracy                           1.00       571
     macro avg       1.00      1.00      1.00       571
  weighted avg       1.00      1.00      1.00       571



In [19]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

gnb_model = GaussianNB()

gnb_model.fit(X_train, y_train)

y_preds = gnb_model.predict(X_test)

print("Classification Report -- Naive-Bayes:\n")
print(classification_report(y_test, y_preds))

Classification Report -- Naive-Bayes:

                precision    recall  f1-score   support

Critical Error       0.94      0.98      0.96        48
         Error       0.89      0.94      0.92        53
   HTTP Status       1.00      0.98      0.99       305
Resource Usage       1.00      1.00      1.00        53
Security Alert       0.99      1.00      1.00       112

      accuracy                           0.98       571
     macro avg       0.96      0.98      0.97       571
  weighted avg       0.98      0.98      0.98       571



In [20]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

lr_model = LogisticRegression(
    solver='lbfgs',
    max_iter=500,
    class_weight='balanced',
    random_state=42
)

lr_model.fit(X_train, y_train)

y_preds = lr_model.predict(X_test)

print("Classification Report -- Logistic Regression:\n")
print(classification_report(y_test, y_preds))

Classification Report -- Logistic Regression:

                precision    recall  f1-score   support

Critical Error       1.00      0.98      0.99        48
         Error       1.00      1.00      1.00        53
   HTTP Status       1.00      1.00      1.00       305
Resource Usage       1.00      1.00      1.00        53
Security Alert       0.99      1.00      1.00       112

      accuracy                           1.00       571
     macro avg       1.00      1.00      1.00       571
  weighted avg       1.00      1.00      1.00       571



## 8. Save trained models

In [21]:
import joblib

joblib.dump(rf_model, '../clf_models/random_forest.joblib')
joblib.dump(gnb_model, '../clf_models/naive_bayes.joblib')
joblib.dump(lr_model, '../clf_models/logistic_regression.joblib')

['../clf_models/logistic_regression.joblib']