<a href="https://colab.research.google.com/github/KunalParkhade/detect-cyber-security-threat/blob/main/Detecting_Cyber_Security_Threats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Cyber Security](https://www.dhl.com/content/dam/dhl/global/core/images/smart-grid-thought-leadership-1375x504/csi-ltr6-cyber-security-trend.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, sus_label, indicating whether an event is `malicious` (1) or `benign` (0).

By successfully developing this model, we will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.

In [10]:
!pip install torchmetrics



In [11]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn

# Step 1: Loading and Scaling Data

In [12]:
# Load preprocessed data
train_df = pd.read_csv('/content/drive/MyDrive/Datasets/Cyber Security Threat Dataset/labelled_train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Datasets/Cyber Security Threat Dataset/labelled_test.csv')
val_df = pd.read_csv('/content/drive/MyDrive/Datasets/Cyber Security Threat Dataset/labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [13]:
# Start coding here
## Separate features and labels
X_train = train_df.drop('sus_label', axis=1).values
y_train = train_df['sus_label'].values

X_val = val_df.drop('sus_label', axis=1).values
y_val = val_df['sus_label'].values

X_test = test_df.drop('sus_label', axis=1).values
y_test = test_df['sus_label'].values

In [14]:
## Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.fit_transform(X_val)
X_test = scaler.fit_transform(X_test)

In [15]:
## Convert data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)

X_val = torch.tensor(X_val, dtype=torch.float32)
y_val = torch.tensor(y_val, dtype=torch.float32)

X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

In [16]:
## Create DataLoader for training and validation
train_data = TensorDataset(X_train, y_train)
val_data = TensorDataset(X_val, y_val)
test_data = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_data, shuffle=True, batch_size=64)
val_loader = DataLoader(val_data, shuffle=False, batch_size=64)

# Step 2: Define the Neural Network Model

In [17]:
class CyberAttackDetector(nn.Module):
    def __init__(self):
        super(CyberAttackDetector, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(X_train.shape[1], 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.network(x)

# Initialize the model, loss function and optimizer
model = CyberAttackDetector()
criterion = nn.BCELoss()       # BCELoss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Step 3: Train and Evaluate the Model

In [18]:
import torch
from torchmetrics import Accuracy

# Define the accuracy metric
accuracy_metric = Accuracy(task='binary', threshold=0.5)

# Training and validation loop
num_epochs = 20
best_val_accuracy = 0.0

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch).squeeze()
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * X_batch.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)

    # Validation phase
    model.eval()
    val_accuracy = 0.0
    with torch.no_grad():
        for X_val_batch, y_val_batch in val_loader:
            val_outputs = model(X_val_batch).squeeze()
            val_predictions = (val_outputs >= 0.5).float()
            val_accuracy += accuracy_metric(val_predictions, y_val_batch).item()

    val_accuracy /= len(val_loader)

    # Save the best validation accuracy
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}')

# Save the best validation accuracy
val_accuracy = int(best_val_accuracy * 100)
print(f'Best Validation Accuracy: {val_accuracy}%')

Epoch 1/20, Loss: 0.0052, Validation Accuracy: 1.0000
Epoch 2/20, Loss: 0.0022, Validation Accuracy: 1.0000
Epoch 3/20, Loss: 0.0021, Validation Accuracy: 1.0000
Epoch 4/20, Loss: 0.0021, Validation Accuracy: 1.0000
Epoch 5/20, Loss: 0.0019, Validation Accuracy: 1.0000
Epoch 6/20, Loss: 0.0018, Validation Accuracy: 1.0000
Epoch 7/20, Loss: 0.0018, Validation Accuracy: 1.0000
Epoch 8/20, Loss: 0.0018, Validation Accuracy: 1.0000
Epoch 9/20, Loss: 0.0020, Validation Accuracy: 1.0000
Epoch 10/20, Loss: 0.0018, Validation Accuracy: 1.0000
Epoch 11/20, Loss: 0.0017, Validation Accuracy: 1.0000
Epoch 12/20, Loss: 0.0019, Validation Accuracy: 1.0000
Epoch 13/20, Loss: 0.0017, Validation Accuracy: 1.0000
Epoch 14/20, Loss: 0.0018, Validation Accuracy: 1.0000
Epoch 15/20, Loss: 0.0016, Validation Accuracy: 1.0000
Epoch 16/20, Loss: 0.0015, Validation Accuracy: 1.0000
Epoch 17/20, Loss: 0.0015, Validation Accuracy: 1.0000
Epoch 18/20, Loss: 0.0022, Validation Accuracy: 1.0000
Epoch 19/20, Loss: 