In [None]:
# Install necessary libraries
!pip install -q transformers datasets evaluate accelerate huggingface_hub wandb

In [2]:
from transformers import (AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments,
                          DataCollatorWithPadding, Trainer, pipeline)
import torch, wandb, evaluate, huggingface_hub
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np

# Finetune BERT For Phishing Detection

BERT is a model that pre-trains deep bidirectional representations from unlabeled text using masked language modeling and next sentence prediction objectives. It can be fine-tuned with just one additional output layer to create state-of-the-art models for various natural language processing tasks, such as text classification, token classification, question answering, and more. BERT is good for classification tasks as phishing classification because it can capture the context and semantics of the text from both left and right directions, and learn to predict the correct label based on the pre-trained knowledge. BERT also has a special [CLS] token that is used for classification tasks, which is trained to represent the whole input sequence and can be fed to a classifier layer. BERT has achieved impressive results on several text classification benchmarks, such as GLUE, SST-2, and CoLA.

This project will show how to:

- Finetune BERT on a custom phishing dataset to determine whether a text is phishing or benign.
- Use the finetuned model for inference.

We have to login to HuggingFace account in order to retrieve BERT model

In [None]:
huggingface_hub.login()

Also, to monitor metrics during evaluation, we can login to Wandb account

In [None]:
wandb.login()

## Load phishing dataset

We're going to start loading the phishind dataset

In [4]:
dataset = load_dataset("ealvaradob/phishing-dataset")

Let's see a sample of the dataset:

In [5]:
dataset['train'][0]

{'label': 1, 'text': 'https://vpoasss-ne-inbex.gynsujh.cn/'}

There are two fields:

- `text`: Text that can contain URLs, HTML codes, mails, and SMS messages.
- `label`: 0 for benign, 1 for phishing

## Tokenize Dataset

The next step is to load a BERT tokenizer to preprocess the text field. BERT expects input data in a specific format, and the tokenizer is responsible for converting the text into that format. The tokenizer splits the text into tokens, which are the basic units of language that the model can understand. The tokenizer also adds special tokens, such as [CLS] and [SEP], to mark the beginning and the end of the text or the separation between two sentences. The tokenizer also converts the tokens into numerical indices that correspond to the vocabulary of the model. These indices are then fed to the model as inputs. By using the same tokenizer that was used to pre-train the model, you can ensure that the model can process the text correctly and produce meaningful outputs.

In [6]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

We will create a preprocessing function to tokenize `text` and truncate sequences to be no longer than BERT's maximum input length:

In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

We will create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluation metrics

Including a metric during training is often helpful for evaluating your model's performance. For phishing detection, the most important metrics are:

- **True-Positive Rate (TPR) or Recall** - This is the ratio of the number of phishing emails or websites that the model correctly identifies as phishing and the total number of phishing emails or websites. It measures how well the model can detect phishing attacks and avoid false negatives. A high TPR means that the model can catch most of the phishing attempts and protect the users from falling victim to them.

- **False-Positive Rate (FPR)**: This is the ratio of the number of legitimate emails or websites that the model incorrectly identifies as phishing and the total number of legitimate emails or websites. It measures how often the model makes mistakes and flags benign messages or sites as malicious. A low FPR means that the model can avoid unnecessary alerts and reduce the user frustration and the security team workload.

In [9]:
metrics = evaluate.combine(["accuracy", "precision", "recall", "ealvaradob/false_positive_rate"])

This function passes our predictions and labels to compute to calculate the accuracy, recall and FPR:

In [10]:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metrics.compute(predictions=predictions, references=labels)

## Training

Before we start training our model, we'll create a map of the expected ids to their labels with `id2label` and `label2id`:

In [11]:
id2label = {0: "benign", 1: "phishing"}
label2id = {"benign": 0, "phishing": 1}

In addition, we'll start monitoring with Wandb

In [12]:
wandb.init(project="BERT-FINETUNING")

[34m[1mwandb[0m: Currently logged in as: [33mealvarado[0m. Use [1m`wandb login --relogin`[0m to force relogin


Bring BERT base model from HuggingFace

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-large-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pre-trained base model achieves an accuracy of 51%, which is deficient. However, we can improve its performance by fine-tuning it on our dataset. To do that, we have to specify the training arguments.

In [14]:
training_args = TrainingArguments(
    output_dir="bert-finetuned-phishing",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    torch_compile=True,
    fp16=True,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model='recall',
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [15]:
# Train model
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,False Positive Rate
1,0.1457,0.123448,0.961916,0.958353,0.950845,0.030036
2,0.0771,0.13446,0.969093,0.972727,0.953303,0.019428
3,0.0249,0.168286,0.970387,0.9664,0.963134,0.024341


  torch.has_cuda,
  torch.has_cudnn,
  torch.has_mps,
  torch.has_mkldnn,


TrainOutput(global_step=11598, training_loss=0.09884800195981766, metrics={'train_runtime': 2599.746, 'train_samples_per_second': 71.371, 'train_steps_per_second': 4.461, 'total_flos': 1.7224450545564864e+17, 'train_loss': 0.09884800195981766, 'epoch': 3.0})

Once training is completed, we can share the model to the Hub with the `push_to_hub()` method so everyone can use our model:

In [16]:
trainer.push_to_hub()
wandb.finish()

VBox(children=(Label(value='0.013 MB of 0.015 MB uploaded\r'), FloatProgress(value=0.87532732962892, max=1.0))…

0,1
eval/accuracy,▁▇█
eval/false_positive_rate,█▁▄
eval/loss,▁▃█
eval/precision,▁█▅
eval/recall,▁▂█
eval/runtime,█▅▁
eval/samples_per_second,▁▄█
eval/steps_per_second,▁▄█
train/epoch,▁▁▂▂▂▃▃▃▃▄▄▄▄▅▅▅▆▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▃▃▃▃▄▄▄▄▅▅▅▆▆▆▆▇▇▇████

0,1
eval/accuracy,0.97039
eval/false_positive_rate,0.02434
eval/loss,0.16829
eval/precision,0.9664
eval/recall,0.96313
eval/runtime,88.2346
eval/samples_per_second,175.283
eval/steps_per_second,10.959
train/epoch,3.0
train/global_step,11598.0


## Testing model

Now that we've finetuned the model, we can use it for phishing detection! Consider this example text:

In [None]:
text = (
    "Text: Dear hotmail user. We noticed a login to your Hotmail account "
          "from an unrecognized device on Tuesday, August 15, 2023 (GMT-5) 7:32 A.M. "
          "Lima, Peru. Was it you? If so, ignore the rest of this email. If it was not "
          "you, follow the links below to keep your Hotmail account secure and "
          "provide the necessary information to keep your account active. CLICK HERE."
          "Thank you, Hotmail Team."
    "\nURL: https://ec-ec.squarespace.com"
)

The simplest way to try out the finetuned model for phishing detection is to use it in a `pipeline()`. Instantiate a `pipeline` for text classification with your model, and pass your text to it:

In [None]:
classifier = pipeline("text-classification", model="ealvaradob/bert-finetuned-phishing")
classifier(text)

config.json:   0%|          | 0.00/845 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'phishing', 'score': 0.9901213645935059}]