![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH [[dataset](accreditation.md)]

In [42]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn

In [43]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [44]:
# Start coding here
# Use as many cells as you need

Loading and Scaling Data

In [45]:
feature_cols = ['processId', 'threadId', 'parentProcessId', 'userId', 'mountNamespace', 'argsNum', 'returnValue']

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on the train set and transform train, val, test
train_scaled = train_df.copy()
val_scaled = val_df.copy()
test_scaled = test_df.copy()

# Fit scaler on train, transform all
train_scaled[feature_cols] = scaler.fit_transform(train_df[feature_cols])
val_scaled[feature_cols] = scaler.transform(val_df[feature_cols])
test_scaled[feature_cols] = scaler.transform(test_df[feature_cols])

# Optionally, separate features and labels for ML
X_train = train_scaled[feature_cols]
y_train = train_scaled['sus_label']
X_val = val_scaled[feature_cols]
y_val = val_scaled['sus_label']
X_test = test_scaled[feature_cols]
y_test = test_scaled['sus_label']

# Example: print first few rows to verify
print("Scaled training features:\n", X_train.head())
print("Training labels:\n", y_train.head())

Scaled training features:
    processId  threadId  parentProcessId  ...  mountNamespace   argsNum  returnValue
0  -3.301279  0.266761        -0.849092  ...        1.783991  1.736080    -0.054994
1  -3.301279  0.266761        -0.849092  ...        1.783991 -1.246980    -0.054994
2  -3.301279  0.266761        -0.849092  ...        1.783991 -1.992744    -0.054994
3   0.273100  0.271924         2.463837  ...       -0.587102 -0.501215    -0.061272
4   0.273100  0.271924         2.463837  ...       -0.587102  0.990315    -0.054994

[5 rows x 7 columns]
Training labels:
 0    1
1    1
2    1
3    1
4    1
Name: sus_label, dtype: int64


In [46]:
import torch
import torch.nn as nn

class SuspiciousProcessClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, num_layers=2, dropout=0.3):
        super(SuspiciousProcessClassifier, self).__init__()
        layers = []
        current_dim = input_dim

        # Hidden layers
        for _ in range(num_layers):
            layers.append(nn.Linear(current_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            current_dim = hidden_dim

        # Output layer
        layers.append(nn.Linear(hidden_dim, 1))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return torch.sigmoid(self.net(x)).squeeze(1)

# Example usage:
# Suppose your features are ['processId', 'threadId', 'parentProcessId', 'userId', 'mountNamespace', 'argsNum', 'returnValue']
input_size = 7
model = SuspiciousProcessClassifier(input_dim=input_size)
print(model)

SuspiciousProcessClassifier(
  (net): Sequential(
    (0): Linear(in_features=7, out_features=64, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.3, inplace=False)
    (3): Linear(in_features=64, out_features=64, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.3, inplace=False)
    (6): Linear(in_features=64, out_features=1, bias=True)
  )
)


In [47]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torchmetrics import Accuracy
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler

# 1. Load and scale data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

X_train = train_df.drop('sus_label', axis=1).values
y_train = train_df['sus_label'].values
X_test = test_df.drop('sus_label', axis=1).values
y_test = test_df['sus_label'].values
X_val = val_df.drop('sus_label', axis=1).values
y_val = val_df['sus_label'].values

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1)

# 2. Define the model
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
    nn.Sigmoid()
)

# 3. Loss and optimizer
criterion = nn.BCELoss()  # Use BCELoss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4. Training loop
num_epoch = 10
for epoch in range(num_epoch):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}/{num_epoch}, Loss: {loss.item():.4f}")

# 5. Model Evaluation
model.eval()
with torch.no_grad():
    y_predict_train = model(X_train_tensor).round()
    y_predict_test = model(X_test_tensor).round()
    y_predict_val = model(X_val_tensor).round()

# 6. Calculate accuracy using torchmetrics
accuracy = Accuracy(task="binary")
train_accuracy = accuracy(y_predict_train, y_train_tensor.int())
test_accuracy = accuracy(y_predict_test, y_test_tensor.int())
val_accuracy = accuracy(y_predict_val, y_val_tensor.int())

print("Training accuracy: {:.4f}".format(train_accuracy.item()))
print("Validation accuracy: {:.4f}".format(val_accuracy.item()))
print("Testing accuracy: {:.4f}".format(test_accuracy.item()))

Epoch 1/10, Loss: 0.7100
Epoch 2/10, Loss: 0.6772
Epoch 3/10, Loss: 0.6450
Epoch 4/10, Loss: 0.6128
Epoch 5/10, Loss: 0.5814
Epoch 6/10, Loss: 0.5513
Epoch 7/10, Loss: 0.5215
Epoch 8/10, Loss: 0.4922
Epoch 9/10, Loss: 0.4638
Epoch 10/10, Loss: 0.4365
Training accuracy: 0.9983
Validation accuracy: 0.9958
Testing accuracy: 0.0927
