# BERT Training for Phishing Email Detection

This notebook is designed to fine-tune a BERT model for the task of phishing email detection. We will utilize the Enron dataset for legitimate emails and a public phishing dataset for training and evaluation.

In [1]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import os

# Check if GPU is available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Using device: {device}')

## Load and Preprocess Data

We will load the datasets from the Enron and phishing directories, preprocess them, and create a training and testing split.

In [2]:
# Load datasets
enron_data = pd.read_csv(os.path.join('..', 'data', 'enron', 'enron_emails.csv'))
phishing_data = pd.read_csv(os.path.join('..', 'data', 'phishing', 'phishing_emails.csv'))

# Combine datasets
enron_data['label'] = 0  # Legitimate emails
phishing_data['label'] = 1  # Phishing emails
data = pd.concat([enron_data, phishing_data], ignore_index=True)

# Shuffle and split the dataset
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42, stratify=data['label'])

## Tokenization

We will use the BERT tokenizer to convert the text data into the format required for BERT.

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

# Create Dataset objects
train_dataset = Dataset.from_pandas(train_data[['text', 'label']])
test_dataset = Dataset.from_pandas(test_data[['text', 'label']])

# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

## Model Training

We will now set up the BERT model for sequence classification and train it using the Trainer API.

In [4]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device)

training_args = TrainingArguments(
    output_dir='..\models\fine_tuned_bert_model',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()

## Save the Model

After training, we will save the fine-tuned model for later use.

In [5]:
model.save_pretrained('..\models\fine_tuned_bert_model')
tokenizer.save_pretrained('..\models\fine_tuned_bert_model')

## Conclusion

In this notebook, we have fine-tuned a BERT model for phishing email detection. The model is now ready for evaluation and integration into the Streamlit application.