
# 🤖 Phase 2: Adaptive Deep NLP with BERT/DistilBERT

This phase enhances the spear-phishing detection framework using transformer-based models for better contextual understanding.

---

## 🎯 Goal

- Improve detection accuracy, especially on **nuanced and context-aware spear-phishing attacks**
- Replace or complement traditional models with **BERT-family transformers**
- Build an **updatable, fine-tuned classification system**

---

## 🧠 Recommended Models

| Model        | Description |
|--------------|-------------|
| DistilBERT   | Lightweight, fast, suitable for production |
| BERT (base)  | More powerful, needs more compute |
| RoBERTa      | Robust optimization of BERT, more accurate |
| CamemBERT, etc. | Specialized for different languages (optional) |

---

## 🧱 Architecture Plan

```
+------------------+     +------------------+     +-------------------------+     +------------------+
| Raw Email Text   | --> | Preprocessing    | --> | Tokenization (BERTTokenizer) | --> | BERT Encoder      |
|                  |     | (clean text)     |     | Attention + Input IDs       |     +------------------+
+------------------+     +------------------+     +-------------------------+             |
                                                                                        v
                                                                       +--------------------------+
                                                                       | Classification Head      |
                                                                       | (Linear Layer + Softmax) |
                                                                       +--------------------------+
                                                                                        |
                                                                                        v
                                                                       +--------------------------+
                                                                       | Prediction: Phishing?    |
                                                                       +--------------------------+
```

---

## 🛠️ Tools & Libraries

- `transformers` (HuggingFace)
- `datasets` (for tokenization + training)
- `sklearn` (evaluation)
- `torch` (backend)

---

## 🧪 Example Code (HuggingFace Style)

```python
from transformers import DistilBERTTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import torch
import pandas as pd

# Load your cleaned dataset
df = pd.read_csv("data/phishing_email_clean.csv")
X_train, X_val, y_train, y_val = train_test_split(df['clean_text'], df['label'], test_size=0.2, random_state=42)

# Initialize tokenizer
tokenizer = DistilBERTTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize data
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
val_encodings = tokenizer(list(X_val), truncation=True, padding=True)

# Convert to torch Dataset
class EmailDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} | {'labels': torch.tensor(self.labels[idx])}

    def __len__(self):
        return len(self.labels)

train_dataset = EmailDataset(train_encodings, list(y_train))
val_dataset = EmailDataset(val_encodings, list(y_val))

# Model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# Training
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    evaluation_strategy='epoch',
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()
```

---

## ✅ Outcome of Phase 2

- Trained BERT-family model on spear-phishing dataset
- Evaluated and benchmarked against MVP
- Ready for integration into scalable pipeline

