<center><b><font size=6>Language Models exploration<b><center>

This notebook ...

Experiment language models for solving the same supervised task as in Section 2. In this task, the objective
is to harness the capabilities of language models like Bert or Word2Vec, for supervised learning (assign
intents to sessions).
Two interesting concepts play a role when we use neural networks:
1- it is possible to do transfer learning, i.e., to take a model that have been trained with other
enormous datasets by Big Tech companies, and we can do fine-tuning i.e., to train this model
starting from its pre-trained version.
2- In NLP tasks, words/documents are transformed into vectors (encoding) and this task is
Unsupervised, so we can use a much larger amount of data.
 Choose a language model between Bert and Doc2Vec (word2vec for documents), then:
1. If you choose Doc2Vec: pretrain Doc2Vec on body column of the session text. If you chose Bert: take the pretrained Bert model like in this example. (NB: In this tutorial they used BertForSequenceClassification, but if you want to continue with step 2, you must take an other Bert implementation from HuggingFace)
2. Add a last Dense Layer
3. Fine-tune the last layer of the network on the supervised training set for N epochs.
4. Plot the learning curves on training and validation set. After how many epochs should we stop the training?

<center><b><font size=5>Install Dependencies<b><center>

In [12]:
!python ../scripts/install_dependencies.py section4

[34mInstalling common packages: pandas, pyarrow[0m
[0m[32mSuccessfully installed: pandas[0m
[0m[32mSuccessfully installed: pyarrow[0m
[0m[34mInstalling Section 4 packages: matplotlib, scikit-learn, torch, transformers[0m
[0m[32mSuccessfully installed: matplotlib[0m
[0m[32mSuccessfully installed: scikit-learn[0m
[0m[32mSuccessfully installed: torch[0m
[0m[32mSuccessfully installed: transformers[0m
[0m[0m

<center><b><font size=5>Name<b><center>

In [1]:
import os
import pickle
import time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from transformers import BertTokenizer, BertModel, AdamW
import matplotlib.pyplot as plt

In [2]:
# Paths for saving preprocessed data
TOKENIZED_TRAIN_PATH = "../data/processed/train_encodings.pkl"
TOKENIZED_VAL_PATH = "../data/processed/val_encodings.pkl"

# 1. Load Dataset
print("Loading the dataset...")
df = pd.read_parquet("../data/processed/ssh_attacks_sampled_decoded.parquet")
print("Dataset loaded successfully!")
print(f"Dataset size: {df.shape[0]} rows")


Loading the dataset...
Dataset loaded successfully!
Dataset size: 23297 rows


In [3]:
# 2. Preprocess Set_Fingerprint column (multi-label encoding)
print("Preprocessing 'Set_Fingerprint' column...")
df['Set_Fingerprint'] = df['Set_Fingerprint'].apply(lambda x: [intent.strip() for intent in x.split(',')])
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['Set_Fingerprint'])
print(f"Classes identified: {mlb.classes_}")


Preprocessing 'Set_Fingerprint' column...
Classes identified: ['Defense Evasion' 'Discovery' 'Execution' 'Harmless' 'Other'
 'Persistence']


In [4]:
# 3. Train-test split
print("Splitting the data into training and validation sets...")
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['full_session'], y, test_size=0.2, random_state=42
)
print("Data split complete.")

Splitting the data into training and validation sets...
Data split complete.


In [5]:
# 4. Tokenization with Save/Load Mechanism
print("Loading or performing tokenization...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def save_tokenized_data(filepath, data):
    with open(filepath, 'wb') as f:
        pickle.dump(data, f)

def load_tokenized_data(filepath):
    with open(filepath, 'rb') as f:
        return pickle.load(f)

# Tokenize only if necessary
if os.path.exists(TOKENIZED_TRAIN_PATH) and os.path.exists(TOKENIZED_VAL_PATH):
    print("Loading pre-tokenized data...")
    train_encodings = load_tokenized_data(TOKENIZED_TRAIN_PATH)
    val_encodings = load_tokenized_data(TOKENIZED_VAL_PATH)
else:
    print("Tokenizing data...")
    train_encodings = tokenizer(list(train_texts.fillna("").astype(str)), truncation=True, padding=True, max_length=128)
    val_encodings = tokenizer(list(val_texts.fillna("").astype(str)), truncation=True, padding=True, max_length=128)
    save_tokenized_data(TOKENIZED_TRAIN_PATH, train_encodings)
    save_tokenized_data(TOKENIZED_VAL_PATH, val_encodings)
    print("Tokenization complete and data saved.")


Loading or performing tokenization...
Loading pre-tokenized data...


In [6]:

# 5. Custom Dataset Class
class SSHDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.encodings['input_ids'][idx]),
            'attention_mask': torch.tensor(self.encodings['attention_mask'][idx]),
            'labels': torch.tensor(self.labels[idx], dtype=torch.float)
        }

    def __len__(self):
        return len(self.labels)


In [7]:
# 6. Prepare DataLoaders
print("Creating DataLoaders...")
train_dataset = SSHDataset(train_encodings, train_labels)
val_dataset = SSHDataset(val_encodings, val_labels)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=8, num_workers=4)
print("DataLoaders are ready.")

Creating DataLoaders...
DataLoaders are ready.


## Addition of the dense layer

In [8]:
# 7. Initialize the Model
print("Initializing the BERT model for sequence classification...")
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = BertModel.from_pretrained('bert-base-uncased')

# Add a custom Dense layer for fine-tuning
class CustomBERTModel(torch.nn.Module):
    def __init__(self, bert_model, num_labels):
        super(CustomBERTModel, self).__init__()
        self.bert = bert_model
        self.classifier = torch.nn.Linear(bert_model.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]  # CLS token output
        logits = self.classifier(cls_output)
        return logits

model = CustomBERTModel(model, num_labels=y.shape[1])
model.to(device)

# Disable printing of model architecture
# Removed model printing to clean up logs


Initializing the BERT model for sequence classification...


CustomBERTModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [9]:
# 8. Optimizer and Loss
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.BCEWithLogitsLoss()



## Fine tuning of the model

In [10]:
# 9. Training Loop with Remaining Time Estimate
train_loss_list, val_loss_list = [], []
print("Starting the training process...")

for epoch in range(10):  # Fine-tune for 10 epochs
    print(f"Epoch {epoch+1} / 10")
    model.train()
    total_loss = 0
    start_time = time.time()  # Start epoch timer
    
    batch_start_time = time.time()  # Timer for each batch
    for batch_idx, batch in enumerate(train_loader):
        optimizer.zero_grad()
        input_ids, attention_mask, labels = (
            batch['input_ids'].to(device),
            batch['attention_mask'].to(device),
            batch['labels'].to(device),
        )
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        
        # Calculate remaining time
        elapsed_time = time.time() - batch_start_time
        remaining_batches = len(train_loader) - (batch_idx + 1)
        remaining_time = elapsed_time * remaining_batches
        print(f"Batch {batch_idx+1}/{len(train_loader)} - Remaining time: {remaining_time:.2f} seconds", end='\r')
        batch_start_time = time.time()  # Reset timer for the next batch
    
    epoch_time = time.time() - start_time
    train_loss_list.append(total_loss / len(train_loader))
    print(f"\nEpoch {epoch+1} Training loss: {train_loss_list[-1]:.4f}, Time: {epoch_time:.2f} seconds")

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids, attention_mask, labels = (
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device),
                batch['labels'].to(device),
            )
            logits = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(logits, labels)
            val_loss += loss.item()
    val_loss_list.append(val_loss / len(val_loader))
    print(f"Epoch {epoch+1} Validation loss: {val_loss_list[-1]:.4f}")
    

Starting the training process...
Epoch 1 / 10
Batch 57/2330 - Remaining time: 380.67 seconds

KeyboardInterrupt: 

In [11]:
# 10. Evaluation Metrics and Visualizations
from sklearn.metrics import roc_curve, auc, precision_recall_curve, confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

# 10.1 Plot ROC Curves
print("Plotting ROC Curves...")
for i, label in enumerate(mlb.classes_):
    fpr, tpr, _ = roc_curve([y[i] for y in y_true], [y[i] for y in y_pred])
    plt.plot(fpr, tpr, label=f"{label} (AUC = {auc(fpr, tpr):.2f})")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

Loading the dataset...
Dataset loaded successfully!
Dataset size: 23297 rows
Preprocessing 'Set_Fingerprint' column...
Classes identified: ['Defense Evasion' 'Discovery' 'Execution' 'Harmless' 'Other'
 'Persistence']
Splitting the data into training and validation sets...
Data split complete.
Loading or performing tokenization...
Loading pre-tokenized data...
Creating DataLoaders...
DataLoaders are ready.
Initializing the BERT model for sequence classification...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Starting the training process...
Epoch 1 / 10




Batch 14/2330 - Remaining time: 63924.59 seconds

KeyboardInterrupt: 

In [None]:
# 10.2 Plot Precision-Recall Curves
print("Plotting Precision-Recall Curves...")
for i, label in enumerate(mlb.classes_):
    precision, recall, _ = precision_recall_curve([y[i] for y in y_true], [y[i] for y in y_pred])
    plt.plot(recall, precision, label=f"{label}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.show()

In [None]:
# 10.3 Plot Confusion Matrix
print("Plotting Confusion Matrix...")
cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=mlb.classes_)
disp.plot(cmap="viridis")
plt.title("Confusion Matrix")
plt.show()

In [None]:
# 10.4 Feature Importance (Attention Weights Visualization)
print("Visualizing Feature Importance via Attention Weights...")
# Extract attention weights from the model
# Assuming we use the last encoder layer for visualization
attention_weights = model.bert.encoder.layer[-1].attention.self.get_attention_map()
# Example heatmap visualization for a single sample
sns.heatmap(attention_weights[0].cpu().detach().numpy(), cmap="viridis")
plt.title("Attention Weights Heatmap")
plt.show()

In [None]:
# 10.5 Training Time Per Epoch
print("Plotting Training Time Per Epoch...")
plt.plot(range(1, len(epoch_times) + 1), epoch_times, label="Epoch Times")
plt.xlabel("Epochs")
plt.ylabel("Time (seconds)")
plt.title("Training Time Per Epoch")
plt.legend()
plt.show()

After how many epochs should we stop the training?

porcodedio