![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

In [155]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score 

In [156]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [157]:
scaler = StandardScaler()

In [158]:
features = train_df.iloc[:, :-1]
target = train_df.iloc[:, -1]
features = scaler.fit_transform(features)
features[:5]

array([[-3.301279  ,  0.26676142, -0.84909243,  2.61170424,  1.78399123,
         1.73607956, -0.0549941 ],
       [-3.301279  ,  0.26676142, -0.84909243,  2.61170424,  1.78399123,
        -1.2469795 , -0.0549941 ],
       [-3.301279  ,  0.26676142, -0.84909243,  2.61170424,  1.78399123,
        -1.99274427, -0.0549941 ],
       [ 0.27310016,  0.27192386,  2.46383729, -0.06090978, -0.58710151,
        -0.50121474, -0.06127163],
       [ 0.27310016,  0.27192386,  2.46383729, -0.06090978, -0.58710151,
         0.99031479, -0.0549941 ]])

In [159]:
target[:5]

0    1
1    1
2    1
3    1
4    1
Name: sus_label, dtype: int64

In [160]:
dataset = TensorDataset(torch.tensor(features, dtype=torch.float32), 
                        torch.tensor(target, dtype=torch.float32))
input, label = dataset[0]
print("input:", input, "\ntarget:", label)

input: tensor([-3.3013,  0.2668, -0.8491,  2.6117,  1.7840,  1.7361, -0.0550]) 
target: tensor(1.)


In [161]:
batch_size = 400
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

In [162]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(7, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 8)
        self.fc5 = nn.Linear(8, 1)      

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.sigmoid(self.fc5(x))
        return x

In [163]:
model = Model()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [164]:
epochs = 10
for epoch in range(epochs):
    for input, label in train_loader:
        optimizer.zero_grad()
        output = model(input)
        loss = criterion(output, label.view(-1, 1))
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 0.00016768208297435194
Epoch 2, Loss: 0.00021100473531987518
Epoch 3, Loss: 7.755983097013086e-05
Epoch 4, Loss: 0.00025471311528235674
Epoch 5, Loss: 0.00016263789439108223
Epoch 6, Loss: 8.894057828001678e-05
Epoch 7, Loss: 0.009316994808614254
Epoch 8, Loss: 0.00027685065288096666
Epoch 9, Loss: 0.0010038088075816631
Epoch 10, Loss: 0.00023012021847534925


In [165]:
test_features = test_df.iloc[:, :-1]
test_features = scaler.fit_transform(test_features)
test_target = test_df.iloc[:, -1]
test_features[:5]

array([[-6.27575957, -6.28307638, -3.50731889, -2.12883111, 21.60137574,
         0.16523253,  0.22087792],
       [-6.27846254, -6.28578236, -3.50731889, -2.13165711, 21.54616351,
         0.16523253,  0.22087792],
       [-6.6190371 , -6.62673653, -3.50782583, -2.41425719, -0.04181843,
         1.73244104,  0.18025395],
       [-6.6190371 , -6.62673653, -3.50782583, -2.41425719, -0.04181843,
         1.73244104,  0.22629445],
       [-6.6190371 , -6.62673653, -3.50782583, -2.41425719, -0.04181843,
        -1.40197599,  0.18025395]])

In [166]:
test_dataset = TensorDataset(torch.tensor(test_features, dtype=torch.float32), 
                        torch.tensor(test_target, dtype=torch.float32))
input, label = dataset[0]
print("input:", input, "\ntarget:", label)

input: tensor([-3.3013,  0.2668, -0.8491,  2.6117,  1.7840,  1.7361, -0.0550]) 
target: tensor(1.)


In [167]:
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [168]:
val_features = val_df.iloc[:, :-1]
val_features = scaler.fit_transform(val_features)
val_target = val_df.iloc[:, -1]
val_features[:5]

array([[-3.65879522e+00, -3.66642991e+00, -8.50847686e-01,
         1.50417508e+00,  6.51000888e-01,  4.28087825e-01,
         3.55368471e-03],
       [-3.66049157e+00, -3.66812944e+00, -8.50847686e-01,
         1.48860148e+00,  6.44778472e-01,  4.28087825e-01,
         3.55368471e-03],
       [-3.87366585e+00, -3.88170286e+00, -8.53107939e-01,
        -6.87578855e-02, -1.78818639e+00,  1.20140579e+00,
        -5.10830467e-02],
       [-3.87366585e+00, -3.88170286e+00, -8.53107939e-01,
        -6.87578855e-02, -1.78818639e+00,  1.20140579e+00,
        -7.37366158e-03],
       [-3.87366585e+00, -3.88170286e+00, -8.53107939e-01,
        -6.87578855e-02, -1.78818639e+00, -3.45230137e-01,
        -5.10830467e-02]])

In [169]:
val_dataset = TensorDataset(torch.tensor(val_features, dtype=torch.float32), 
                        torch.tensor(val_target, dtype=torch.float32))
input, label = dataset[0]
print("input:", input, "\ntarget:", label)

input: tensor([-3.3013,  0.2668, -0.8491,  2.6117,  1.7840,  1.7361, -0.0550]) 
target: tensor(1.)


In [170]:
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [171]:
test_acc = Accuracy(task="binary")

model.eval()
with torch.no_grad():
    for input, label in test_loader:
        test_output = model(input)
        test_preds = (test_output >= 0.5).float()
        test_acc(test_preds, label.view(-1, 1))

test_accuracy = float(test_acc.compute())
print("Validation Accuracy", test_accuracy)

Validation Accuracy 0.09265109896659851


In [172]:
val_acc = Accuracy(task="binary")

model.eval()
with torch.no_grad():
    for input, label in val_loader:
        val_output = model(input)
        val_preds = (val_output >= 0.5).float()
        val_acc(val_preds, label.view(-1, 1))

val_accuracy = float(val_acc.compute())
print("Validation Accuracy", test_accuracy)

Validation Accuracy 0.09265109896659851
