## Step 1: Setup and Data Loading

In this step, we set up the environment by importing necessary libraries and loading the processed dataset.
We will use only the email `body` text and its associated `label` (0 = legitimate, 1 = phishing) as inputs for the BERT model.


In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Load the processed dataset
df = pd.read_csv("../data/processed/CEAS_08_feature_engineered.csv")

# Keep only the email body and label
df = df[['body', 'label']]

# Quick preview
print(df.head())
print(df['label'].value_counts())


                                                body  label
0  buck up, your troubles caused by small dimensi...      1
1  \nupgrade your sex and pleasures with these te...      1
2  >+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...      1
3  would anyone object to removing .so from this ...      0
4  \nwelcomefastshippingcustomersupport\nhttp://7...      1
label
1    21827
0    17312
Name: count, dtype: int64


## Step 2: Train/Test Split

We will split the data into training and testing sets.
- 80% of the emails will be used for training the BERT model.
- 20% of the emails will be used for final evaluation.
Stratified sampling will be used to ensure both classes (phishing and legitimate) are represented proportionally in both sets.


In [10]:
# Perform train/test split
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['body'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    stratify=df['label'],
    random_state=42
)

# Quick check
print(f"Number of training samples: {len(train_texts)}")
print(f"Number of testing samples: {len(test_texts)}")


Number of training samples: 31311
Number of testing samples: 7828


## Step 3: Tokenization

We will use a pre-trained BERT tokenizer to convert the email text into numerical tokens that BERT can process.
Each email will be tokenized into:
- Input IDs: the numerical representation of each word piece.
- Attention Masks: a binary mask indicating which tokens should be attended to.

We will use the "bert-base-uncased" model from Hugging Face for tokenization.


In [11]:
from transformers import BertTokenizer

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the training and testing texts
train_encodings = tokenizer(
    train_texts,
    truncation=True,
    padding=True,
    max_length=512,  # Truncate longer emails
    return_tensors='pt'
)

test_encodings = tokenizer(
    test_texts,
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors='pt'
)

# Quick check
print("Example tokenized input IDs:", train_encodings['input_ids'][0])
print("Example attention mask:", train_encodings['attention_mask'][0])


Example tokenized input IDs: tensor([  101,  7314,  2024,  2734,  2005,  9441,  6887, 17175,  2705,  2239,
         2263,  1012,  2000,  5589,  1999,  9441,  6887, 17175,  2705,  2239,
         2263,  1010,  3531,  3967,  2472,  1997,  3296,  3228,  7437, 24185,
         3363,  2386,  2011,  1041,  1011,  5653,  2012,  1041,  4140,  2100,
         7677, 13535,  1030, 27302, 28139,  2361,  1012,  8917,  2030,  2011,
         7026,  2012,  6390,  2620,  1011,  4029,  2581,  1011,  9683,  2692,
         4654,  2102,  1012, 19348,  1012,  1996,  5246,  1997,  9441,  6887,
        17175,  2705,  2239,  2263,  2024,  1024, 14803,  2233,  1016,  2233,
         1023, 28401,  2337,  2423,  2233,  1017,  2233,  2184,  9857,  2015,
         2337,  2656,  2233,  1018,  2233,  2340,  9317,  2015,  2337,  2676,
         2233,  2260,  2000,  2424,  2041,  2062,  2055,  2054,  1005,  1055,
         6230,  2012,  1996, 17463,  1010,  3942,  8299,  1024,  1013,  1013,
         7479,  1012, 27302, 28139,

## Step 4: Model Initialization

We will initialize a BERT model pre-trained on English text ("bert-base-uncased") for a binary classification task (phishing vs. legitimate).
The model will output a probability for each class (0 or 1).
We will also set up the datasets needed for training using the tokenized data.


In [19]:
from transformers import BertForSequenceClassification

# Load pre-trained BERT model for sequence classification (binary classification)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define a custom Dataset class
class PhishingDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create the training and testing datasets
train_dataset = PhishingDataset(train_encodings, train_labels)
test_dataset = PhishingDataset(test_encodings, test_labels)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 5: Training the BERT Model

We will fine-tune the BERT model on the phishing email classification dataset.
Due to version differences in libraries, we moved evaluation control into the Trainer configuration.


In [23]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',             # where to save model checkpoints
    num_train_epochs=2,                  # number of training epochs
    per_device_train_batch_size=16,      # batch size for training
    per_device_eval_batch_size=64,       # batch size for evaluation
    warmup_steps=500,                    # number of warmup steps
    weight_decay=0.01,                   # strength of weight decay
    evaluation_strategy="epoch",         # evaluate at the end of each epoch
    save_strategy="epoch",               # save model after each epoch
    load_best_model_at_end=True,          # load the best model (highest metric) at the end
    metric_for_best_model="f1",           # optimize for F1-score
    greater_is_better=True,               # higher F1 is better
    logging_dir='./logs',                 # directory for logs
    logging_steps=10,                     # log every 10 steps
)


# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Fine-tune the model
trainer.train()

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
import transformers
print(transformers.__version__)

4.51.3
