<center><b><font size=6>Language Models exploration<b><center>

This notebook ...

Experiment language models for solving the same supervised task as in Section 2. In this task, the objective
is to harness the capabilities of language models like Bert or Word2Vec, for supervised learning (assign
intents to sessions).
Two interesting concepts play a role when we use neural networks:
1- it is possible to do transfer learning, i.e., to take a model that have been trained with other
enormous datasets by Big Tech companies, and we can do fine-tuning i.e., to train this model
starting from its pre-trained version.
2- In NLP tasks, words/documents are transformed into vectors (encoding) and this task is
Unsupervised, so we can use a much larger amount of data.
 Choose a language model between Bert and Doc2Vec (word2vec for documents), then:
1. If you choose Doc2Vec: pretrain Doc2Vec on body column of the session text. If you chose Bert: take the pretrained Bert model like in this example. (NB: In this tutorial they used BertForSequenceClassification, but if you want to continue with step 2, you must take an other Bert implementation from HuggingFace)
2. Add a last Dense Layer
3. Fine-tune the last layer of the network on the supervised training set for N epochs.
4. Plot the learning curves on training and validation set. After how many epochs should we stop the training?

0. **Install Dependencies**
1. ** ... **
2. ** ... **

<center><b><font size=5>Install Dependencies<b><center>

In [1]:
!python ../scripts/install_dependencies.py 

Installing packages for section4: transformers
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mmm
[?25hCollecting regex!=2019.12.17
  Downloading regex-2024.4.16-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (761 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m761.6/761.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting filelock
  Downloading filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_

<center><b><font size=5>Name<b><center>

text

In [2]:
!pip install PyArrow

Collecting PyArrow
  Using cached pyarrow-12.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.1 MB)
Installing collected packages: PyArrow
Successfully installed PyArrow-12.0.1


In [3]:
!pip install torch

Collecting torch
  Using cached torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Using cached nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
Collecting nvidia-cudnn-cu11==8.5.0.96
  Using cached nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
Collecting nvidia-cublas-cu11==11.10.3.66
  Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
Installing collected packages: nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1


In [4]:
!pip install transformers

Collecting transformers
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting filelock
  Using cached filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting regex!=2019.12.17
  Using cached regex-2024.4.16-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (761 kB)
Collecting safetensors>=0.3.1
  Using cached safetensors-0.4.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (436 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
Collecting huggingface-hub<1.0,>=0.14.1
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Installing collected packages: tokenizers, safetensors, regex, filelock, huggingface-hub, transformers
Successfully installed filelock-3.12.2 huggingface-hub-0.16.4 regex-2024.4.16 safetensors-0.4.5 tokenizers-0.13.3 transformers-4.30.2


In [5]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt


In [None]:

# 1. Load Dataset
print("Loading the dataset...")
df = pd.read_csv("ssh_attacks_dataset.csv")
print("Dataset loaded successfully!")
print(f"Dataset size: {df.shape[0]} rows")

# 2. Preprocess Set_Fingerprint column (multi-label encoding)
print("Preprocessing 'Set_Fingerprint' column...")
df['Set_Fingerprint'] = df['Set_Fingerprint'].apply(
    lambda x: [intent.strip() for intent in x.split(',')]
)
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['Set_Fingerprint'])
print(f"Classes identified: {mlb.classes_}")

# 3. Train-test split
print("Splitting the data into training and validation sets...")
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['full_session'], y, test_size=0.2, random_state=42
)
print("Data split complete.")

# 4. Tokenization
print("Tokenizing the text data...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_texts = train_texts.fillna("").astype(str)
val_texts = val_texts.fillna("").astype(str)

train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True, max_length=128)
print("Tokenization complete.")

# 5. Create Custom Dataset Class
print("Creating the custom dataset class...")
class SSHDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.encodings['input_ids'][idx]),
            'attention_mask': torch.tensor(self.encodings['attention_mask'][idx]),
            'labels': torch.tensor(self.labels[idx], dtype=torch.float)
        }

    def __len__(self):
        return len(self.labels)

print("Custom dataset class created.")

# 6. Prepare DataLoader
print("Creating DataLoaders for training and validation...")
train_dataset = SSHDataset(train_encodings, train_labels)
val_dataset = SSHDataset(val_encodings, val_labels)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
print("DataLoaders are ready.")

# 7. Initialize the Model
print("Initializing the BERT model for sequence classification...")
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=y.shape[1])
model.to(device)
print("Model initialized and moved to device:", device)

# 8. Optimizer and Loss
print("Setting up optimizer and loss function...")
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.BCEWithLogitsLoss()
print("Optimizer and loss function are ready.")

# 9. Training Loop
print("Starting the training process...")
train_loss_list, val_loss_list = [], []

for epoch in range(5):  # Fine-tune for 5 epochs
    print(f"Epoch {epoch+1} / 5")
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids, attention_mask, labels = (
            batch['input_ids'].to(device),
            batch['attention_mask'].to(device),
            batch['labels'].to(device),
        )
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    train_loss_list.append(total_loss / len(train_loader))
    print(f"Training loss: {train_loss_list[-1]:.4f}")

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids, attention_mask, labels = (
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device),
                batch['labels'].to(device),
            )
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.logits, labels)
            val_loss += loss.item()
    val_loss_list.append(val_loss / len(val_loader))
    print(f"Validation loss: {val_loss_list[-1]:.4f}")

print("Training complete!")

# 10. Plot Learning Curves
print("Plotting the learning curves...")
plt.plot(range(1, 6), train_loss_list, label="Training Loss")
plt.plot(range(1, 6), val_loss_list, label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
print("Learning curves plotted successfully.")
