## Overview

This Jupyter notebook is used for training and evaluating the log classification models. It includes data preprocessing, feature extraction, clustering, and model training steps. The notebook utilizes various machine learning techniques and libraries to classify log messages.

## Sections

### 1. Data Loading and Preprocessing

- **Load Data**: The synthetic log data is loaded from synthetic_logs.csv using pandas.
- **Display Data**: The first few rows of the dataframe are displayed to understand the structure of the data.

### 2. Clustering Log Messages

- **Generate Embeddings**: The pre-trained sentence transformer model `all-MiniLM-L6-v2` from the `sentence_transformers` library is used to generate embeddings for the log messages.
- **DBSCAN Clustering**: The DBSCAN algorithm is applied to the embeddings to cluster the log messages based on their similarity.
- **Display Clusters**: The dataframe with cluster labels is displayed.

### 3. Regex-Based Classification

- **Define Regex Patterns**: A dictionary of regex patterns and their corresponding labels is defined.
- **Classify with Regex**: A function `classify_with_regex` is implemented to classify log messages based on the defined regex patterns.
- **Apply Regex Classification**: The regex classification function is applied to the log messages, and the results are stored in a new column `regex_label`.

### 4. Filter Non-Regex Classified Logs

- **Filter Non-Regex Logs**: Logs that were not classified by the regex patterns are filtered into a new dataframe `df_non_regex`.
- **Filter Non-Legacy Logs**: Logs from the source `LegacyCRM` are excluded to create a new dataframe `df_non_legacy`.

### 5. Feature Extraction and Model Training

- **Generate Embeddings**: Embeddings for the non-legacy log messages are generated using the sentence transformer model.
- **Train-Test Split**: The data is split into training and testing sets.
- **Train Logistic Regression Model**: A logistic regression model is trained on the training set.
- **Evaluate Model**: The model is evaluated on the test set, and a classification report is printed.

### 6. Save the Model

- **Save Model**: The trained logistic regression model is saved to disk using the `joblib` library.

## Code Snippets

### Load Data


In [1]:
import pandas as pd

df = pd.read_csv('dataset/synthetic_logs.csv')
df

Unnamed: 0,timestamp,source,log_message,target_label,complexity
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,bert
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,bert
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,bert
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,bert
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,bert
...,...,...,...,...,...
2405,2025-08-13 07:29:25,ModernHR,nova.osapi_compute.wsgi.server [req-96c3ec98-2...,HTTP Status,bert
2406,1/11/2025 5:32,ModernHR,User 3844 account experienced multiple failed ...,Security Alert,bert
2407,2025-08-03 03:07:47,ThirdPartyAPI,nova.metadata.wsgi.server [req-b6d4a270-accb-4...,HTTP Status,bert
2408,11/11/2025 11:52,BillingSystem,Email service affected by failed transmission,Critical Error,bert




### Generate Embeddings and Perform Clustering


In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['log_message'].tolist())
dbscan = DBSCAN(eps=0.2, min_samples=1, metric='cosine')
clusters = dbscan.fit_predict(embeddings)
df['cluster'] = clusters
df.head()

Unnamed: 0,timestamp,source,log_message,target_label,complexity,cluster
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,bert,0
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,bert,1
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,bert,2
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,bert,0
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,bert,0




### Define and Apply Regex Patterns


In [3]:
import re

def classify_with_regex(log_message):
    regex_patterns = {
        r"User User\d+ logged (in|out).": "User Action",
        r"Backup (started|ended) at .*": "System Notification",
        r"Backup completed successfully.": "System Notification",
        r"System updated to version .*": "System Notification",
        r"File .* uploaded successfully by user .*": "System Notification",
        r"Disk cleanup completed successfully.": "System Notification",
        r"System reboot initiated by user .*": "System Notification",
        r"Account with ID .* created by .*": "User Action"
    }
    for pattern, label in regex_patterns.items():
        if re.match(pattern, log_message, re.IGNORECASE):
            return label
    return None

df['regex_label'] = df['log_message'].apply(classify_with_regex)
df[df.regex_label.isna()]

Unnamed: 0,timestamp,source,log_message,target_label,complexity,cluster,regex_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,bert,0,
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,bert,1,
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,bert,2,
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,bert,0,
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,bert,0,
...,...,...,...,...,...,...,...
2405,2025-08-13 07:29:25,ModernHR,nova.osapi_compute.wsgi.server [req-96c3ec98-2...,HTTP Status,bert,0,
2406,1/11/2025 5:32,ModernHR,User 3844 account experienced multiple failed ...,Security Alert,bert,7,
2407,2025-08-03 03:07:47,ThirdPartyAPI,nova.metadata.wsgi.server [req-b6d4a270-accb-4...,HTTP Status,bert,0,
2408,11/11/2025 11:52,BillingSystem,Email service affected by failed transmission,Critical Error,bert,1,




### Filter Non-Regex Classified Logs


In [4]:
df_non_regex = df[df['regex_label'].isnull()].copy()
df_non_legacy = df_non_regex[df_non_regex.source != 'LegacyCRM']



### Train Logistic Regression Model


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

filtred_embedings = model.encode(df_non_legacy['log_message'].tolist())
X = filtred_embedings
y = df_non_legacy['target_label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

                precision    recall  f1-score   support

Critical Error       0.91      1.00      0.95        48
         Error       0.98      0.89      0.93        47
   HTTP Status       1.00      1.00      1.00       304
Resource Usage       1.00      1.00      1.00        49
Security Alert       1.00      0.99      1.00       123

      accuracy                           0.99       571
     macro avg       0.98      0.98      0.98       571
  weighted avg       0.99      0.99      0.99       571





### Save the Model


In [6]:
import joblib

joblib.dump(clf, 'models/log_message_classifier.joblib')

['models/log_message_classifier.joblib']



## Conclusion

This notebook provides a comprehensive workflow for training and evaluating a log classification model using a combination of regex-based classification and machine learning techniques. The trained model is saved for future use in classifying log messages.