<a href="https://colab.research.google.com/github/oya163/bert-llm/blob/master/cyber_security_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cyber Security Named Entity Recognition (NER) with BERT Fine-Tuning

This project focuses on identifying critical entities within the cybersecurity domain using a fine-tuned BERT-based model. Here, I leverage the CyNER dataset, which specializes in recognizing entities related to vulnerabilities, firmware, and various cyber threats. This project is an essential step towards automating cybersecurity analysis and improving threat intelligence.

## 🔧 Environment Setup

In [1]:
!pip install -U huggingface_hub accelerate transformers datasets evaluate seqeval


Collecting huggingface_hub
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Downloading accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Collecting transformers
  Downloading transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting fsspec>=2023.5.0 (from huggingface_hub)
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting safetensors>=0.4.3 (from accelerate)
  Downloading safetensors-0.4.5-cp311-none-win_amd64.whl.metadata (3.9 kB)
Collecting torch>=1.10.0 (from accelerate)
  Downloading torch-2.5.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting tokenizers<0.21,>=0.20 (from t

In [2]:
# Wrap the text in ipython notebook
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# 📊 Dataset Loading & Preprocessing

For this project, I utilized the MITRE-based CyNER dataset to train my custom NER model. The dataset is structured in a format that aligns with cybersecurity data analysis.

## Load MITRE (CyNER) Download the Dataset

I opted to download the dataset directly from the web using wget


In [None]:
!wget https://raw.githubusercontent.com/Aditya2344s/LLM_Aditya/main/train.txt
!wget https://raw.githubusercontent.com/Aditya2344s/LLM_Aditya/main/valid.txt
!wget https://raw.githubusercontent.com/Aditya2344s/LLM_Aditya/main/test.txt
!wget https://raw.githubusercontent.com/Aditya2344s/LLM_Aditya/main/load_ner.py

--2023-11-25 21:46:11--  https://raw.githubusercontent.com/oya163/bert-llm/master/CyberSecurityNER/cyner_data/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 661664 (646K) [text/plain]
Saving to: ‘train.txt.1’


2023-11-25 21:46:11 (11.7 MB/s) - ‘train.txt.1’ saved [661664/661664]

--2023-11-25 21:46:11--  https://raw.githubusercontent.com/oya163/bert-llm/master/CyberSecurityNER/cyner_data/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 184460 (180K) [text/plain]
Saving to: ‘valid.txt.1’


2023-11-25 21:46:11 (

In [4]:
from datasets import load_dataset

data_files = {
    "train": "train.txt",
    "validation": "valid.txt",
    "test": "test.txt",
}

raw_datasets = load_dataset("load_ner.py", data_files=data_files)

Check the basic information on the loaded dataset

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2811
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 813
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 748
    })
})

Check sample of tokens from train dataset

In [6]:
print(raw_datasets["train"][0]["tokens"])

['Super', 'Mario', 'Run', 'Malware', '#', '2', '–', 'DroidJack', 'RAT', 'Gamers', 'love', 'Mario', 'and', 'Pokemon', ',', 'but', 'so', 'do', 'malware', 'authors', '.']


Check the NER tags (its IDS) of the corresponding sample

In [7]:
print(raw_datasets["train"][0]["ner_tags"])

[1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]


In [8]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-Malware', 'I-Malware', 'B-System', 'I-System', 'B-Organization', 'I-Organization', 'B-Indicator', 'I-Indicator', 'B-Vulnerability', 'I-Vulnerability'], id=None), length=-1, id=None)

Check the labels in the dataset

In [9]:
label_names = ner_feature.feature.names
label_names

['O',
 'B-Malware',
 'I-Malware',
 'B-System',
 'I-System',
 'B-Organization',
 'I-Organization',
 'B-Indicator',
 'I-Indicator',
 'B-Vulnerability',
 'I-Vulnerability']

Display the token and labels

In [10]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

for x, y in zip(line1.split(), line2.split()):
    print(x, '\t', y)

Super 	 B-Malware
Mario 	 I-Malware
Run 	 I-Malware
Malware 	 I-Malware
# 	 O
2 	 O
– 	 O
DroidJack 	 B-Malware
RAT 	 I-Malware
Gamers 	 O
love 	 O
Mario 	 B-System
and 	 O
Pokemon 	 B-System
, 	 O
but 	 O
so 	 O
do 	 O
malware 	 O
authors 	 O
. 	 O


## Tokenization

In [11]:
from transformers import AutoTokenizer

model_checkpoint = "xlm-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [12]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())

['<s>', '▁Super', '▁Mario', '▁Run', '▁Mal', 'ware', '▁#', '▁2', '▁–', '▁Dro', 'id', 'Jack', '▁', 'RAT', '▁Gam', 'ers', '▁love', '▁Mario', '▁and', '▁Pokemon', '▁', ',', '▁but', '▁so', '▁do', '▁malware', '▁author', 's', '▁', '.', '</s>']


In [13]:
print(inputs.word_ids())

[None, 0, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 8, 9, 9, 10, 11, 12, 13, 14, 14, 15, 16, 17, 18, 19, 19, 20, 20, None]


## Data Preprocessing

In [14]:
# Align the number of labels and the tokens
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [15]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]
[-100, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [16]:
# Helper function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [17]:
# Tokenize all the examples from the datasets
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/2811 [00:00<?, ? examples/s]

Map:   0%|          | 0/813 [00:00<?, ? examples/s]

Map:   0%|          | 0/748 [00:00<?, ? examples/s]

# Fine Tuning

## Data Collation

Prepare the dataloader for the training session

In [18]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    1,    2,    2,    2,    2,    0,    0,    0,    1,    2,    2,
            2,    2,    0,    0,    0,    3,    0,    3,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    3,
            1,    2,    0,    0,    0,    0,    0,    0,    0,    3,    4,    4,
            0,    0,    3,    0,    0, -100, -100]])

In [19]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 2, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 3, 0, 0, -100]


## Setup Evaluation

In [20]:
import evaluate
import numpy as np

metric = evaluate.load("seqeval")


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [21]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [22]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
model.config.num_labels

11

## Training

In [24]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [25]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.166373,0.628019,0.663265,0.645161,0.96226
2,0.205700,0.126472,0.700483,0.739796,0.719603,0.97015
3,0.063800,0.13374,0.724726,0.758929,0.741433,0.971442


TrainOutput(global_step=1056, training_loss=0.1305677407618725, metrics={'train_runtime': 811.0243, 'train_samples_per_second': 10.398, 'train_steps_per_second': 1.302, 'total_flos': 1373294813753520.0, 'train_loss': 0.1305677407618725, 'epoch': 3.0})

In [26]:
trainer.evaluate()

{'eval_loss': 0.13373979926109314,
 'eval_precision': 0.7247259439707674,
 'eval_recall': 0.7589285714285714,
 'eval_f1': 0.7414330218068537,
 'eval_accuracy': 0.9714415389449429,
 'eval_runtime': 13.1094,
 'eval_samples_per_second': 62.017,
 'eval_steps_per_second': 7.781,
 'epoch': 3.0}

## Save the model

In [27]:
saved_model_path='/content/drive/MyDrive/bert_finetuned_ner/'
trainer.save_model(saved_model_path)

## Evaluation

In [28]:
predictions = trainer.predict(tokenized_datasets["test"])

In [29]:
from tabulate import tabulate

metrics = ['precision', 'recall', 'f1', 'accuracy']
prediction_results = []

for key, val in predictions.metrics.items():
    if any(item in key for item in metrics):
        prediction_results.append([key, str(round(val,4)*100)+'%'])

print(tabulate(prediction_results, headers=['Metric', 'Score']))

Metric          Score
--------------  -------
test_precision  65.16%
test_recall     70.94%
test_f1         67.93%
test_accuracy   96.56%


## Inference

In [30]:
from transformers import pipeline

token_classifier = pipeline(
    "token-classification", model=saved_model_path, aggregation_strategy="simple"
)
results = token_classifier("vulnerabilities reported BLU Products, founded in 2009, makes lower-end Android-powered smartphones that sell for as little as $50 on Amazon company.")

In [31]:
prediction_results = []
for each_entity in results:
    prediction_results.append([each_entity['word'], each_entity['entity_group']])

print(tabulate(prediction_results, headers=['Word', 'Predictions']))


Word           Predictions
-------------  -------------
BLU Products   Organization
Android-power  System
Amazon         Organization
