# System Logs Anomaly Detection using Fine-Tuned LLMs

A fine-tuned LLMs to classify logs as 'normal' or 'anomalous'.

---

## Install Dependencies

In [97]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


---

## Load Data

In [98]:
import pandas as pd

pd.set_option('future.no_silent_downcasting', True) # Ignore warning

logs_df = pd.read_csv("../data/logs.csv")
labels_df = pd.read_csv("../data/labels.csv")

print(f"Log entries: {len(logs_df)}")

Log entries: 104815


In [109]:
logs_df.head(1)

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,BlockId
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,E5,Receiving block <*> src: /<*> dest: /<*>,blk_-1608999687919862906


In [108]:
labels_df.head(1)

Unnamed: 0,BlockId,Label
0,blk_-1608999687919862906,Normal


---

## Data preprosessing

#### 1. Extract block_id from the content and add it as a new field

In [105]:
logs_df["BlockId"] = logs_df["Content"].str.extract(r'(blk_-?\d+)')
logs_df.head(1)

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,BlockId
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,E5,Receiving block <*> src: /<*> dest: /<*>,blk_-1608999687919862906


#### 2. Merge logs with its label ('Normal' or 'Anomaly')

In [106]:
new_logs_df = pd.merge(logs_df, labels_df, on="BlockId")
new_logs_df.head(1)

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,BlockId,Label
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,E5,Receiving block <*> src: /<*> dest: /<*>,blk_-1608999687919862906,Normal


#### 3. Map 'Normal' to '1' & 'Anomaly' to '0'

In [107]:
new_logs_df["Label"] = new_logs_df["Label"].replace({'Normal': 1, 'Anomaly': 0})
new_logs_df.head(1)

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,BlockId,Label
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,E5,Receiving block <*> src: /<*> dest: /<*>,blk_-1608999687919862906,1


#### 4. Split dataset: Training & Test

In [104]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    new_logs_df,
    test_size=0.2,
    random_state=42,
    stratify=new_logs_df["Label"]
)

print(f"Training Split: {len(train_df)} | Test Split: {len(test_df)}")

Training Split: 83852 | Test Split: 20963
