# Deep Learning-Based Cyber Threat Detection Using the BETH Dataset

![cyber_photo](cyber_photo.jpg)

The rapid evolution of cyber threats—ranging from malware and phishing to denial-of-service (DoS) attacks—poses a growing challenge for organizations worldwide. Traditional rule-based security systems often struggle to detect these increasingly sophisticated attacks, especially as threat actors adapt and innovate faster than static defenses can respond.

This project focuses on developing a deep learning model for real-time cyber threat detection, leveraging the BETH dataset, a simulated collection of system logs that mirrors real-world behavior. Each log event contains rich contextual features such as process IDs, user actions, return values, and mount namespaces, alongside a binary label (sus_label) indicating whether the activity is benign (0) or suspicious (1).

By training a neural network model on this dataset, the goal is to accurately classify and flag potentially malicious events within a system. The use of deep learning allows the system to learn complex patterns and subtle anomalies in behavior that may indicate early signs of intrusion or compromise.

This project demonstrates how intelligent, data-driven methods can be harnessed to build more adaptive and robust cybersecurity solutions, ultimately contributing to safer digital environments for organizations and users alike.

## About the Dataset
Each row in the dataset represents a log event generated by a process running on a machine. Think of it like a snapshot of an action the system took at a certain point in time. These logs come with information that can be used to assess whether the activity is suspicious or not.


| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event. Useful for tracking activity related to specific processes. - int64 
|`threadId`|ID for the thread spawning the log. Some threats operate through specific threads to avoid detection. - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log.  Could help detect if a low-privileged user is doing admin-level actions.|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within.  Attackers sometimes exploit mount namespaces to hide files or isolate malicious actions. - int64|
|`argsNum`|Number of arguments passed to the event. Unusual numbers of arguments might indicate tampering or exploit attempts. - int64|
|`returnValue`|Value returned from the event log (usually 0). A return value of 0 usually indicates success. Unexpected or frequent failures may signal probing or brute-force attempts. - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

In [1]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy

from torch.nn import CrossEntropyLoss

In [2]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [3]:
train_df.sus_label.value_counts()

0    761875
1      1269
Name: sus_label, dtype: int64

In [4]:
X_train = train_df.drop("sus_label", axis =1)
y_train = train_df['sus_label'].values

X_val = val_df.drop("sus_label", axis=1)
y_val = val_df['sus_label'].values

X_test = test_df.drop("sus_label", axis=1)
y_test = test_df["sus_label"].values

In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

In [6]:
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1,1)

X_val_tensor = torch.tensor(X_val_scaled, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1,1)

X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1,1)

In [7]:
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
    nn.Sigmoid()   
)

In [8]:
criterion = nn.BCELoss()
optimizer =optim.SGD(model.parameters(), lr=0.001, momentum =0.95, weight_decay =0.0001)

In [9]:
epochs = 10

In [13]:
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

model.eval()
with torch.no_grad():
    y_train_predict = model(X_train_tensor)
    y_val_predict = model(X_val_tensor)
    y_test_predict = model(X_test_tensor)

accuracy = Accuracy(task="Binary")

train_accuracy = accuracy(y_train_predict, y_train_tensor)
val_accuracy = accuracy(y_val_predict, y_val_tensor)
test_accuracy = accuracy(y_test_predict, y_test_tensor)

In [14]:
train_accuracy = train_accuracy.item()
val_accuracy = val_accuracy.item()
test_accuracy = test_accuracy.item()

In [15]:
print(f"Train accuracy is {train_accuracy}")
print(f"Validation accuracy is {val_accuracy}")
print(f"Test accuracy is {test_accuracy}")

Train accuracy is 0.9983371496200562
Validation accuracy is 0.9958405494689941
Test accuracy is 0.09265109896659851
