# Log System Dataset Overview

This dataset contains synthetic log entries collected from an imaginary enterprise systems. Each record represents a single log message generated by different sources within an IT infrastructure.

## Features

- **timestamp**: The date and time when the log entry was generated.
- **source**: The originating system or application (e.g., ModernCRM, AnalyticsEngine, BillingSystem).
- **log_message**: The textual content of the log, describing events, errors, or alerts.
- **target_label**: The manually assigned category or label for the log (e.g., HTTP Status, Critical Error, Security Alert).

## Purpose

The dataset is intended for tasks such as:
- Log classification and anomaly detection
- Clustering and grouping similar log messages
- Building and evaluating NLP models for IT operations

## Size and Structure

- **Total records**: 2,410 log entries
- **Data types**: Combination of categorical, textual, and integer fields

## Loading Data and Exploration

### In this section, we load the synthetic log dataset into a pandas DataFrame and perform initial exploration to understand its structure and contents. We inspect the first few records, examine unique values in key columns, and review the distribution of log sources and target labels. This provides a foundation for further analysis and modeling.

In [21]:
import pandas as pd 

df = pd.read_csv(r'C:\Users\nguye\Documents\AI\Natural Language Processing\Log-Classification-System-Using-Hybrid-Engine\data\raw\synthetic_logs.csv')

In [22]:
df.head()

Unnamed: 0,timestamp,source,log_message,target_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status


In [29]:
df.source.unique()

array(['ModernCRM', 'AnalyticsEngine', 'ModernHR', 'BillingSystem',
       'ThirdPartyAPI', 'LegacyCRM'], dtype=object)

In [30]:
df.target_label.unique()

array(['HTTP Status', 'Critical Error', 'Security Alert', 'Error',
       'System Notification', 'Resource Usage', 'User Action',

In [33]:
df.target_label

np.int64(2410)

In [35]:
from sentence_transformers import SentenceTransformer
import numpy as np 
from sklearn.cluster import DBSCAN

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(df['log_message'].tolist())
embeddings[:2]

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


array([[-1.02939621e-01,  3.35459411e-02, -2.20260732e-02,
         1.55101740e-03, -9.86917876e-03, -1.78956270e-01,
        -6.34409785e-02, -6.01761639e-02,  2.81109158e-02,
         5.99620491e-02, -1.72618348e-02,  1.43363548e-03,
        -1.49560034e-01,  3.15287686e-03, -5.66030927e-02,
         2.71685235e-02, -1.49891041e-02, -3.54037657e-02,
        -3.62936445e-02, -1.45410765e-02, -5.61491773e-03,
         8.75539035e-02,  4.55120578e-02,  2.50963885e-02,
         1.00187510e-02,  1.24267349e-02, -1.39923573e-01,
         7.68696293e-02,  3.14095505e-02, -4.15247958e-03,
         4.36902344e-02,  1.71250012e-02, -8.00951198e-02,
         5.74006326e-02,  1.89091656e-02,  8.55262503e-02,
         3.96398641e-02, -1.34371817e-01, -1.44360063e-03,
         3.06704035e-03,  1.76854044e-01,  4.44885530e-03,
        -1.69274509e-02,  2.24266481e-02, -4.35049310e-02,
         6.09034160e-03, -9.98169929e-03, -6.23972900e-02,
         1.07372422e-02, -6.04895083e-03, -7.14660808e-0

In [40]:
dbscan = DBSCAN(eps=0.2, min_samples=1, metric='cosine')
clusters = dbscan.fit_predict(embeddings)

df['cluster'] = clusters
df.head()

Unnamed: 0,timestamp,source,log_message,target_label,cluster
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,0
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,0
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,0


In [41]:
df[df.cluster==1]

Unnamed: 0,timestamp,source,log_message,target_label,cluster
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1
10,8/9/2025 18:58,ModernCRM,Email server encountered a sending fault,Error,1
217,1/22/2025 5:45,BillingSystem,Mail service encountered a delivery glitch,Error,1
248,5/2/2025 23:04,ModernHR,Service disruption caused by email sending error,Critical Error,1
265,3/30/2025 23:53,ModernCRM,Email system had a problem sending emails,Error,1
361,11/19/2025 23:06,BillingSystem,Email service experienced a sending issue,Error,1
450,10/27/2025 5:59,ThirdPartyAPI,Email delivery system encountered an error,Error,1
477,12/2/2025 10:30,AnalyticsEngine,Email transmission error caused service impact,Critical Error,1
570,11/7/2025 18:08,ThirdPartyAPI,Email service impacted by sending failure,Critical Error,1
678,4/28/2025 15:13,AnalyticsEngine,Email delivery problem affected system,Critical Error,1


In [42]:
# Count records per cluster and sort descending
cluster_counts = df['cluster'].value_counts().sort_values(ascending=False)

# Iterate over clusters with more than 10 records and print 5 log messages from each
for cluster_id, count in cluster_counts.items():
    if count > 10:
        print(f"Cluster {cluster_id} (size: {count}):")
        sample_logs = df[df['cluster'] == cluster_id]['log_message'].head(5)
        for log in sample_logs:
            print(f"  - {log}")
        print()

Cluster 0 (size: 1017):
  - nova.osapi_compute.wsgi.server [req-b9718cd8-f65e-49cc-8349-6cf7122af137 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" status: 200 len: 1893 time: 0.2675118
  - nova.osapi_compute.wsgi.server [req-4895c258-b2f8-488f-a2a3-4fae63982e48 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" HTTP status code -  200 len: 211 time: 0.0968180
  - nova.osapi_compute.wsgi.server [req-ee8bc8ba-9265-4280-9215-dbe000a41209 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" RCODE  200 len: 1874 time: 0.2280791
  - nova.osapi_compute.wsgi.server [req-f0bffbc3-5ab0-4916-91c1-0a61dd7d4ec2 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v

### 1. Classification with Regex 

In [45]:
import re
def classify_with_regex(log_message):
    regex_patterns = {
        r"User User\d+ logged (in|out).": "User Action",
        r"Backup (started|ended) at .*": "System Notification",
        r"Backup completed successfully.": "System Notification",
        r"System updated to version .*": "System Notification",
        r"File .* uploaded successfully by user .*": "System Notification",
        r"Disk cleanup completed successfully.": "System Notification",
        r"System reboot initiated by user .*": "System Notification",
        r"Account with ID .* created by .*": "User Action"
    }
    for pattern, label in regex_patterns.items():
        if re.search(pattern, log_message, re.IGNORECASE):
            return label
    return None

In [49]:
classify_with_regex("User User123 logged out.")

'User Action'

In [50]:
df['regex_label'] = df['log_message'].apply(classify_with_regex)

In [53]:
df[df.regex_label.isna()]

Unnamed: 0,timestamp,source,log_message,target_label,cluster,regex_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,0,
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1,
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2,
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,0,
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,0,
...,...,...,...,...,...,...
2405,2025-08-13 07:29:25,ModernHR,nova.osapi_compute.wsgi.server [req-96c3ec98-2...,HTTP Status,0,
2406,1/11/2025 5:32,ModernHR,User 3844 account experienced multiple failed ...,Security Alert,7,
2407,2025-08-03 03:07:47,ThirdPartyAPI,nova.metadata.wsgi.server [req-b6d4a270-accb-4...,HTTP Status,0,
2408,11/11/2025 11:52,BillingSystem,Email service affected by failed transmission,Critical Error,1,


In [55]:
df[df.regex_label.notnull()].shape

(500, 6)

### 2. Classification with LLM

In [56]:
df_non_regex = df[df.regex_label.isnull()].copy()
df_non_regex.shape

(1910, 6)

In [57]:
df_non_regex 

Unnamed: 0,timestamp,source,log_message,target_label,cluster,regex_label
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,0,
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,1,
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,2,
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,0,
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,0,
...,...,...,...,...,...,...
2405,2025-08-13 07:29:25,ModernHR,nova.osapi_compute.wsgi.server [req-96c3ec98-2...,HTTP Status,0,
2406,1/11/2025 5:32,ModernHR,User 3844 account experienced multiple failed ...,Security Alert,7,
2407,2025-08-03 03:07:47,ThirdPartyAPI,nova.metadata.wsgi.server [req-b6d4a270-accb-4...,HTTP Status,0,
2408,11/11/2025 11:52,BillingSystem,Email service affected by failed transmission,Critical Error,1,


## Handling Rare Labels

  
Before proceeding with model training and evaluation, it is important to address the issue of rare labels in the dataset. Rare labels are those that appear very infrequently (5 or fewer samples in this case). Including such labels can negatively impact model performance and reliability for several reasons:We need to filter out labels with 5 or fewer samples because:

Limited Training Data: Having too few examples makes it difficult for BERT model to learn meaningful patterns for these labels

In this dataset, we identified 2 rare labels:
- Workflow Error (4 samples)
- Deprecation Warning (3 samples)

These will be excluded from further analysis to ensure robust model training.

In [62]:
sufficient_label_counts = df_non_regex['target_label'].value_counts()
rare_labels = sufficient_label_counts[sufficient_label_counts <= 5].index
print(df_non_regex[df_non_regex['target_label'].isin(rare_labels)]['target_label'].unique())



#### We using Bert to trainning based on the data for remaining target labels classification and more power LLMs (Deepseek R1, Llama 3.2, etc) for 'Workflow Error' and 'Deprecation Warning'

In [64]:
df_non_legacy = df_non_regex[df_non_regex.source != 'LegacyCRM']
df_non_legacy.source.unique()

array(['ModernCRM', 'AnalyticsEngine', 'ModernHR', 'BillingSystem',
       'ThirdPartyAPI'], dtype=object)

#### BERT Embeddings

In [65]:
filtered_embeddings = model.encode(df_non_legacy['log_message'].tolist())
filtered_embeddings[:2]

array([[-1.02939621e-01,  3.35459411e-02, -2.20260732e-02,
         1.55101740e-03, -9.86917876e-03, -1.78956270e-01,
        -6.34409785e-02, -6.01761639e-02,  2.81109158e-02,
         5.99620491e-02, -1.72618348e-02,  1.43363548e-03,
        -1.49560034e-01,  3.15287686e-03, -5.66030927e-02,
         2.71685235e-02, -1.49891041e-02, -3.54037657e-02,
        -3.62936445e-02, -1.45410765e-02, -5.61491773e-03,
         8.75539035e-02,  4.55120578e-02,  2.50963885e-02,
         1.00187510e-02,  1.24267349e-02, -1.39923573e-01,
         7.68696293e-02,  3.14095505e-02, -4.15247958e-03,
         4.36902344e-02,  1.71250012e-02, -8.00951198e-02,
         5.74006326e-02,  1.89091656e-02,  8.55262503e-02,
         3.96398641e-02, -1.34371817e-01, -1.44360063e-03,
         3.06704035e-03,  1.76854044e-01,  4.44885530e-03,
        -1.69274509e-02,  2.24266481e-02, -4.35049310e-02,
         6.09034160e-03, -9.98169929e-03, -6.23972900e-02,
         1.07372422e-02, -6.04895083e-03, -7.14660808e-0

In [66]:
X = filtered_embeddings
y = df_non_legacy['target_label'].values

### Machine Learning model comparisons

In [69]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, f1_score
import pandas as pd
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = {}

# Logistic Regression with GridSearchCV
lr_params = {'C': [0.1, 1, 10]}
lr = LogisticRegression(max_iter=1000)
lr_grid = GridSearchCV(lr, lr_params, cv=cv, scoring='f1_macro', n_jobs=-1)
lr_grid.fit(X_train, y_train)
lr_best = lr_grid.best_estimator_
lr_cv_scores = cross_val_score(lr_best, X_train, y_train, cv=cv, scoring='f1_macro')
y_pred_lr = lr_best.predict(X_test)
results['Logistic Regression'] = {
    'best_params': lr_grid.best_params_,
    'cv_f1_macro_mean': np.mean(lr_cv_scores),
    'cv_f1_macro_std': np.std(lr_cv_scores),
    'test_accuracy': accuracy_score(y_test, y_pred_lr),
    'test_macro_f1': f1_score(y_test, y_pred_lr, average='macro'),
    'report': classification_report(y_test, y_pred_lr, output_dict=True)
}

# Random Forest with GridSearchCV
rf_params = {'n_estimators': [50, 100], 'max_depth': [None, 10, 20]}
rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(rf, rf_params, cv=cv, scoring='f1_macro', n_jobs=-1)
rf_grid.fit(X_train, y_train)
rf_best = rf_grid.best_estimator_
rf_cv_scores = cross_val_score(rf_best, X_train, y_train, cv=cv, scoring='f1_macro')
y_pred_rf = rf_best.predict(X_test)
results['Random Forest'] = {
    'best_params': rf_grid.best_params_,
    'cv_f1_macro_mean': np.mean(rf_cv_scores),
    'cv_f1_macro_std': np.std(rf_cv_scores),
    'test_accuracy': accuracy_score(y_test, y_pred_rf),
    'test_macro_f1': f1_score(y_test, y_pred_rf, average='macro'),
    'report': classification_report(y_test, y_pred_rf, output_dict=True)
}

# Linear SVM with GridSearchCV
svc_params = {'C': [0.1, 1, 10]}
svc = LinearSVC(max_iter=2000)
svc_grid = GridSearchCV(svc, svc_params, cv=cv, scoring='f1_macro', n_jobs=-1)
svc_grid.fit(X_train, y_train)
svc_best = svc_grid.best_estimator_
svc_cv_scores = cross_val_score(svc_best, X_train, y_train, cv=cv, scoring='f1_macro')
y_pred_svc = svc_best.predict(X_test)
results['Linear SVM'] = {
    'best_params': svc_grid.best_params_,
    'cv_f1_macro_mean': np.mean(svc_cv_scores),
    'cv_f1_macro_std': np.std(svc_cv_scores),
    'test_accuracy': accuracy_score(y_test, y_pred_svc),
    'test_macro_f1': f1_score(y_test, y_pred_svc, average='macro'),
    'report': classification_report(y_test, y_pred_svc, output_dict=True)
}

# Detailed comparison DataFrame
comparison_df = pd.DataFrame({
    model: {
        'Best Params': res['best_params'],
        'CV Macro F1 Mean': res['cv_f1_macro_mean'],
        'CV Macro F1 Std': res['cv_f1_macro_std'],
        'Test Accuracy': res['test_accuracy'],
        'Test Macro F1': res['test_macro_f1']
    }
    for model, res in results.items()
}).T

print(comparison_df)

                                                 Best Params CV Macro F1 Mean  \
Logistic Regression                                {'C': 10}         0.989405   
Random Forest        {'max_depth': None, 'n_estimators': 50}         0.985887   
Linear SVM                                         {'C': 10}         0.994395   

                    CV Macro F1 Std Test Accuracy Test Macro F1  
Logistic Regression        0.007896           1.0           1.0  
Random Forest              0.004234      0.997375      0.994086  
Linear SVM                 0.006884           1.0           1.0  


In [70]:
# Find the model with the highest Test Macro F1 score
best_model = comparison_df['Test Macro F1'].astype(float).idxmax()
print("Best performing model:")
print(comparison_df.loc[best_model])

Best performing model:
Best Params         {'C': 10}
CV Macro F1 Mean     0.989405
CV Macro F1 Std      0.007896
Test Accuracy             1.0
Test Macro F1             1.0
Name: Logistic Regression, dtype: object


### So we are choosing Logistics Regression

In [71]:
# Train Logistic Regression with best hyperparameters found
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
final_lr = LogisticRegression(C=lr_grid.best_params_['C'], max_iter=1000)
final_lr.fit(X_train, y_train)
y_pred = final_lr.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

                precision    recall  f1-score   support

Critical Error       0.92      1.00      0.96        48
         Error       1.00      0.91      0.96        47
   HTTP Status       1.00      1.00      1.00       304
Resource Usage       1.00      1.00      1.00        49
Security Alert       1.00      1.00      1.00       123

      accuracy                           0.99       571
     macro avg       0.98      0.98      0.98       571
  weighted avg       0.99      0.99      0.99       571



In [74]:
import joblib
joblib.dump(final_lr, r'C:\Users\nguye\Documents\AI\Natural Language Processing\Log-Classification-System-Using-Hybrid-Engine\models\logistic_classifier.joblib')

['C:\\Users\\nguye\\Documents\\AI\\Natural Language Processing\\Log-Classification-System-Using-Hybrid-Engine\\models\\logistic_classifier.joblib']