In [1]:
import pandas as pd

df = pd.read_csv('dataset/synthetic_logs.csv')
df

Unnamed: 0,timestamp,source,log_message,target_label,complexity
0,2025-06-27 07:20:25,ModernCRM,nova.osapi_compute.wsgi.server [req-b9718cd8-f...,HTTP Status,bert
1,1/14/2025 23:07,ModernCRM,Email service experiencing issues with sending,Critical Error,bert
2,1/17/2025 1:29,AnalyticsEngine,Unauthorized access to data was attempted,Security Alert,bert
3,2025-07-12 00:24:16,ModernHR,nova.osapi_compute.wsgi.server [req-4895c258-b...,HTTP Status,bert
4,2025-06-02 18:25:23,BillingSystem,nova.osapi_compute.wsgi.server [req-ee8bc8ba-9...,HTTP Status,bert
...,...,...,...,...,...
2405,2025-08-13 07:29:25,ModernHR,nova.osapi_compute.wsgi.server [req-96c3ec98-2...,HTTP Status,bert
2406,1/11/2025 5:32,ModernHR,User 3844 account experienced multiple failed ...,Security Alert,bert
2407,2025-08-03 03:07:47,ThirdPartyAPI,nova.metadata.wsgi.server [req-b6d4a270-accb-4...,HTTP Status,bert
2408,11/11/2025 11:52,BillingSystem,Email service affected by failed transmission,Critical Error,bert


In [2]:
df.source.unique()

array(['ModernCRM', 'AnalyticsEngine', 'ModernHR', 'BillingSystem',
       'ThirdPartyAPI', 'LegacyCRM'], dtype=object)

In [3]:
df.target_label.unique()

array(['HTTP Status', 'Critical Error', 'Security Alert', 'Error',
       'System Notification', 'Resource Usage', 'User Action',

Let's implement log message clustering using DBSCAN algorithm with sentence transformers for text embedding. We'll:
1. Install required packages
2. Create embeddings using sentence-transformers
3. Apply DBSCAN clustering
4. Analyze the results

In [11]:
!pip install --upgrade pip
!pip install sentence-transformers scikit-learn

Collecting pip
  Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Using cached pip-25.1.1-py3-none-any.whl (1.8 MB)



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
C:\Users\othma\PyCharmMiscProject\.venv\Scripts\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
import numpy as np

# Load the sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
embeddings = model.encode(df['log_message'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 76/76 [00:56<00:00,  1.34it/s]


In [13]:
# Apply DBSCAN clustering
clustering = DBSCAN(eps=0.5, min_samples=5).fit(embeddings)

# Add cluster labels to dataframe
df['cluster'] = clustering.labels_

In [14]:
# Show clusters distribution
print(f"Number of unique clusters: {len(np.unique(clustering.labels_))}")
print("\nCluster distribution:")
print(df['cluster'].value_counts().sort_index())

Number of unique clusters: 29

Cluster distribution:
cluster
-1     548
 0     809
 1      42
 2      53
 3      84
 4      60
 5      31
 6      15
 7      99
 8      86
 9     206
 10     10
 11     48
 12     42
 13     58
 14     14
 15     29
 16     51
 17     21
 18     15
 19      7
 20     20
 21     17
 22      7
 23      8
 24      6
 25     11
 26      6
 27      7
Name: count, dtype: int64


In [15]:
# Display sample messages from each cluster
for cluster in sorted(df['cluster'].unique()):
    print(f"\nCluster {cluster}:")
    sample_messages = df[df['cluster'] == cluster]['log_message'].sample(
        min(3, len(df[df['cluster'] == cluster]))).values
    for msg in sample_messages:
        print(f"- {msg}")


Cluster -1:
- Invalid data encountered during execution of module X
- Unauthorized access to data was attempted
- Multiple disk malfunctions in RAID setup identified

Cluster 0:
- nova.osapi_compute.wsgi.server [req-403846bd-dd0d-4472-a78d-6077e720094b 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" Return code: 200 len: 1893 time: 0.2580168
- nova.osapi_compute.wsgi.server [req-6264d86e-7a74-4647-a528-8e476a256e94 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" RCODE  200 len: 1893 time: 0.2726710
- nova.osapi_compute.wsgi.server [req-fd4a841b-8d46-4485-b466-117bf8e53a94 113d3a99c3da401fbd62cc2caa5b96d2 54fadb412c4e40cdbaed9335e4c35a9e - - -] 10.11.10.1 "GET /v2/54fadb412c4e40cdbaed9335e4c35a9e/servers/detail HTTP/1.1" HTTP status code -  200 len: 1893 time: 0.2663469

Cluster 1:
-

Note: 
- Cluster -1 represents noise points (outliers)
- You may need to adjust eps and min_samples parameters of DBSCAN based on your specific needs
- The chosen model 'all-MiniLM-L6-v2' is a good balance between speed and performance
