<p align="center"> <img src="https://upload.wikimedia.org/wikipedia/en/thumb/f/f1/Logo-IPSA.png/200px-Logo-IPSA.png" alt="IPSA Logo" style="width:150px;height:auto;"> </p>

# IPSA - Ma 513 - Hands-on Machine Learning for Cybersecurity

## Named Entity Recognition for Cybersecurity

### Projet infos
Teacher : 
Student : Valentin DESFORGES (Valouuuu24) ; Sylvain LAGARENNE () ; Pierre VAUDRY (Rsky-20) 

### Project Overview
As part of the Statistical Learning course at IPSA, this project focuses on Named Entity Recognition (NER) in the cybersecurity domain. The main objective is to develop a system capable of recognizing and classifying critical cybersecurity-related entities in text, such as malware, attacks, and threat actors, using state-of-the-art Natural Language Processing (NLP) techniques.

### Dataset
The dataset is sourced from SemEval-2018 Task 8 ("SecureNLP") and is formatted in JSON Lines. Each entry contains:

- unique_id: Unique identifier for the sentence.
- tokens: List of tokens (strings) forming the text.
- ner_tags: Named Entity Recognition tags following the IOB2 convention.

Example Entry:

[json format]
> {
>
>   "unique_id": 4775,
>
>   "tokens": ["This", "collects", ":", "Collected", "data", "will", "be", "uploaded", "to", "a", "DynDNS", "domain", "currently", "hosted", "on", "a", "US", "webhosting", "service", "."],
>
>  "ner_tags": ["B-Entity", "B-Action", "O", "B-Entity", "I-Entity", "O", "B-Action", "I-Action", "B-Modifier", "B-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "I-Entity", "O"]
> 
>}

Named entity recognition (NER): Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

Part-of-speech tagging (POS): Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

Chunking: Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.

O means the word doesn’t correspond to any entity.

B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.

B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.

B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.

B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [1]:
# install packages
! pip install transformers datasets tokenizers seqeval -q
! pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
! pip install -r requirements.txt

Looking in indexes: https://download.pytorch.org/whl/cu124


### **Import Libraries**

In [2]:

import datasets
import json
import evaluate
import json
import numpy as np
import transformers
import torch
print(torch.__version__)  # Vérifie la version de PyTorch
if torch.cuda.is_available():
    print(f"{torch.cuda.get_device_name(0)}")  # Affiche le nom de votre GPU

import accelerate
import transformers
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from transformers import TrainingArguments
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification  # This libary apply augumentation technique at runtime
from transformers import AutoModelForTokenClassification     # This class is responsible for load model into my memory
from datasets import DatasetDict, Dataset, Features, Sequence, ClassLabel, Value


2.5.1+cu121
NVIDIA GeForce RTX 4060 Ti



### **Global variable**

In [3]:
MAX_LEN = 131
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 32
EPOCHS = 3
LEARNING_RATE = 2e-5
BASE_MODEL_PATH = "dslim/bert-large-NER"
MODEL_PATH = "./data/bert-base-NER_model"
TRAINING_FILE = "./data/NER-TRAINING.jsonlines"
VALIDATION_FILE = "./data/NER-VALIDATION.jsonlines"
TESTING_FILE = "./data/NER-TESTING.jsonlines"
TESTING_OUTPUT_FILE = "./data/NER-TESTING-PREDICTED.jsonlines"
VALIDATION_OUTPUT_FILE = "./data/NER-VALIDATION-PREDICTED.jsonlines"

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {DEVICE}")

TOKENIZER = transformers.BertTokenizer.from_pretrained(BASE_MODEL_PATH, do_lower_case=True)
# Mapping des labels
ID2LABEL = {
    0: "B-Action",
    1: "B-Entity",
    2: "B-Modifier",
    3: "I-Action",
    4: "I-Entity",
    5: "I-Modifier",
    6: "O"
}
LABEL2ID = {v: k for k, v in ID2LABEL.items()}

Using device: cuda


### **Load dataset**

In [4]:

# Charger les fichiers JSONlines et extraire les données
def load_and_prepare_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = [json.loads(line) for line in f]
    return data

# Convertir une section de données en un objet Dataset avec ClassLabel
def convert_to_dataset_with_labels(data_section, labels):
    # Vérifier dynamiquement si "ner_tags" est présent dans les données
    has_ner_tags = "ner_tags" in data_section[0]
    
    # Définir les features dynamiquement
    if has_ner_tags:
        features = Features({
            "id": Value("int64"),
            "tokens": Sequence(Value("string")),
            "ner_tags": Sequence(ClassLabel(names=labels))
        })
    else:
        features = Features({
            "id": Value("int64"),
            "tokens": Sequence(Value("string"))
        })
    
    # Préparer les données
    dataset_dict = {
        "id": [example["unique_id"] for example in data_section],
        "tokens": [example["tokens"] for example in data_section],
    }
    if has_ner_tags:
        dataset_dict["ner_tags"] = [example["ner_tags"] for example in data_section]

    # Créer le dataset
    dataset = Dataset.from_dict(dataset_dict, features=features)
    return dataset



# Charger les données brutes
train_data = load_and_prepare_data(TRAINING_FILE)
validation_data = load_and_prepare_data(VALIDATION_FILE)
test_data = load_and_prepare_data(TESTING_FILE)

# Liste des labels
ner_labels = ["B-Action", "B-Entity", "B-Modifier", "I-Action", "I-Entity", "I-Modifier", "O"]

# Créer un DatasetDict avec des labels
ner_data = DatasetDict({
    "train": convert_to_dataset_with_labels(train_data, ner_labels),
    "validation": convert_to_dataset_with_labels(validation_data, ner_labels),
    "test": convert_to_dataset_with_labels(test_data, ner_labels)
})

### **Know Your Data**

In [5]:
#dataset information
ner_data

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 4876
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1044
    })
    test: Dataset({
        features: ['id', 'tokens'],
        num_rows: 1046
    })
})

In [6]:
# structure of train data
ner_data['train'][0]

{'id': 6506,
 'tokens': ['Later',
  'in',
  'May',
  'of',
  '2010',
  'within',
  'a',
  'Pakistani',
  'Senate',
  'question',
  'and',
  'answer',
  'session',
  ',',
  'the',
  'Gulshan-e-Jinnah',
  'Complex',
  'was',
  'cited',
  'under',
  'Federal',
  'Lodges',
  '/',
  'Hostels',
  'in',
  'Islamabad',
  'under',
  'the',
  'control',
  'of',
  'Pakistan',
  'Ministry',
  'for',
  'Housing',
  'and',
  'Works',
  '.'],
 'ner_tags': [6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6]}

In [7]:
#structure of test data
ner_data['test'][0]

{'id': 1357,
 'tokens': ['Stage',
  '3',
  'exports',
  'hundreds',
  'of',
  'methods',
  ',',
  'organized',
  'into',
  '12',
  'different',
  'major',
  'groups',
  '.']}

In [8]:
#structure of validaion data
ner_data['validation'][0]

{'id': 6422,
 'tokens': ['Just',
  '1',
  'year',
  'later',
  ',',
  'after',
  'beginning',
  'their',
  'enterprise',
  'on',
  '3',
  'servers',
  'they',
  'had',
  'filled',
  '2',
  'server',
  'racks',
  'with',
  'happy',
  'clients',
  'receiving',
  'quality',
  'U.S',
  'support',
  '.'],
 'ner_tags': [6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6]}

In [9]:
# name of the ner-tags
ner_data['train'].features["ner_tags"]

Sequence(feature=ClassLabel(names=['B-Action', 'B-Entity', 'B-Modifier', 'I-Action', 'I-Entity', 'I-Modifier', 'O'], id=None), length=-1, id=None)

In [10]:
#description of the dataset
ner_data["train"].info.description = "This is the training dataset for Named Entity Recognition (NER)."
ner_data["validation"].info.description = "This is the validation dataset for Named Entity Recognition (NER)."
ner_data["test"].info.description = "This is the test dataset for Named Entity Recognition (NER)."


ner_data['train'].description

'This is the training dataset for Named Entity Recognition (NER).'

### **Hugging face bert-base-uncased model**

In [11]:
# intializing tokenizer with help of bert model
tokenizer = BertTokenizerFast.from_pretrained("bert-large-uncased")

In [12]:
tokenizer

BertTokenizerFast(name_or_path='bert-large-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [13]:
example_text = ner_data['train'][0]
tokenized_input = tokenizer(example_text['tokens'],is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
word_ids = tokenized_input.word_ids()

In [14]:
print(tokenized_input)
print("\n")
print(tokens)
print("\n")
print(word_ids)

{'input_ids': [101, 2101, 1999, 2089, 1997, 2230, 2306, 1037, 9889, 4001, 3160, 1998, 3437, 5219, 1010, 1996, 19739, 4877, 4819, 1011, 1041, 1011, 9743, 15272, 3375, 2001, 6563, 2104, 2976, 26767, 1013, 21071, 2015, 1999, 26905, 2104, 1996, 2491, 1997, 4501, 3757, 2005, 3847, 1998, 2573, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


['[CLS]', 'later', 'in', 'may', 'of', '2010', 'within', 'a', 'pakistani', 'senate', 'question', 'and', 'answer', 'session', ',', 'the', 'gu', '##ls', '##han', '-', 'e', '-', 'jin', '##nah', 'complex', 'was', 'cited', 'under', 'federal', 'lodges', '/', 'hostel', '##s', 'in', 'islamabad', 'under', 'the', 'control', 'of', 'pakistan', 'ministry', 'for', 'housing', 'and', 'work

In [15]:
print(f'Length of the tokens is : {len(tokens)}')
print(f'Length of the ner tags is: {len(ner_data["train"][0]["ner_tags"])}')

Length of the tokens is : 47
Length of the ner tags is: 37


* Here the dimensions of the ner tags and tokens are different , so to make same dimensions of tokens and ner tags we add -100 at the first and last position of the ner tags.

* During training , BERT model avoid the -100.

In [16]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    # Vérification du format des données
    if isinstance(examples, list):  # Si c'est une liste de dictionnaires
        examples = {
            "tokens": [example["tokens"] for example in examples],
            "ner_tags": [example["ner_tags"] for example in examples]
        }
    elif isinstance(examples["tokens"], list) and isinstance(examples["tokens"][0], str):
        # Si un seul exemple est fourni
        examples = {
            "tokens": [examples["tokens"]],
            "ner_tags": [examples["ner_tags"]]
        }

    # Tokenization
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [17]:
ner_data['train'][4:5]


{'id': [3114],
 'tokens': [['The',
   "regime's",
   'CSTIA',
   'relies',
   'on',
   'Russia',
   'as',
   'one',
   'of',
   'several',
   'sources',
   'for',
   'technical',
   'data',
   '.']],
 'ner_tags': [[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]]}

In [18]:
q = tokenize_and_align_labels(ner_data['train'][4:5])
print(q)


{'input_ids': [[101, 1996, 6939, 1005, 1055, 20116, 10711, 16803, 2006, 3607, 2004, 2028, 1997, 2195, 4216, 2005, 4087, 2951, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100]]}


So before applying the tokenize_and_align_labels() the tokenized_input has 3 keys

* input_ids
* token_type_ids
* attention_mask

But after applying tokenize_and_align_labels() we have an extra key - 'labels'

In [19]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
the_____________________________________ 6
regime__________________________________ 6
'_______________________________________ 6
s_______________________________________ 6
cs______________________________________ 6
##tia___________________________________ 6
relies__________________________________ 6
on______________________________________ 6
russia__________________________________ 6
as______________________________________ 6
one_____________________________________ 6
of______________________________________ 6
several_________________________________ 6
sources_________________________________ 6
for_____________________________________ 6
technical_______________________________ 6
data____________________________________ 6
._______________________________________ 6
[SEP]___________________________________ -100


In [20]:
# Appliquer .map() uniquement sur "train" et "validation"
tokenized_datasets = DatasetDict({
    key: dataset.map(tokenize_and_align_labels, batched=True) if key != "test" else dataset
    for key, dataset in ner_data.items()
})

# Vérification
print(tokenized_datasets)

Map:   0%|          | 0/4876 [00:00<?, ? examples/s]

Map:   0%|          | 0/1044 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 4876
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1044
    })
    test: Dataset({
        features: ['id', 'tokens'],
        num_rows: 1046
    })
})


In [21]:
tokenized_datasets['train'][0]

{'id': 6506,
 'tokens': ['Later',
  'in',
  'May',
  'of',
  '2010',
  'within',
  'a',
  'Pakistani',
  'Senate',
  'question',
  'and',
  'answer',
  'session',
  ',',
  'the',
  'Gulshan-e-Jinnah',
  'Complex',
  'was',
  'cited',
  'under',
  'Federal',
  'Lodges',
  '/',
  'Hostels',
  'in',
  'Islamabad',
  'under',
  'the',
  'control',
  'of',
  'Pakistan',
  'Ministry',
  'for',
  'Housing',
  'and',
  'Works',
  '.'],
 'ner_tags': [6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6,
  6],
 'input_ids': [101,
  2101,
  1999,
  2089,
  1997,
  2230,
  2306,
  1037,
  9889,
  4001,
  3160,
  1998,
  3437,
  5219,
  1010,
  1996,
  19739,
  4877,
  4819,
  1011,
  1041,
  1011,
  9743,
  15272,
  3375,
  2001,
  6563,
  2104,
  2976,
  26767,
  1013,
  21071,
  2015,
  1999,
  26905,
  2104,
  1996,
  2491,
  1997,
  4501,
  3757,
  2005,
  3847,
  1998,
 

### **Defining the model**

In [22]:
# Defining model
ner_model = AutoModelForTokenClassification.from_pretrained("bert-large-uncased", num_labels=7)
ner_model.to(DEVICE)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024

In [23]:
#! pip install -U accelerate
#! pip install -U transformers


In [24]:
#!pip install accelerate
#!pip install 'accelerate>=0.20.1,<0.21'


In [25]:
transformers.__version__, accelerate.__version__

('4.47.1', '1.2.1')

In [26]:
#Define training args
args = TrainingArguments(
    "test-ner",
    evaluation_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
)


print("TrainingArguments created successfully!")

TrainingArguments created successfully!




In [27]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [28]:
import evaluate

# Charger une métrique, par exemple, "seqeval"
metric = evaluate.load("seqeval")

In [29]:
label_list = ner_data["train"].features["ner_tags"].feature.names
label_list


['B-Action',
 'B-Entity',
 'B-Modifier',
 'I-Action',
 'I-Entity',
 'I-Modifier',
 'O']

Compute Metrics

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [30]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds
    print(eval_preds)

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

### **Train the model**

In [31]:
trainer = Trainer(
   ner_model,
   args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
trainer.train()

  0%|          | 0/459 [00:00<?, ?it/s]

### **Save Model**

In [33]:
## Save model
ner_model.save_pretrained("ner_model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer\\tokenizer_config.json',
 'tokenizer\\special_tokens_map.json',
 'tokenizer\\vocab.txt',
 'tokenizer\\added_tokens.json',
 'tokenizer\\tokenizer.json')

In [35]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [36]:
id2label

{'0': 'B-Action',
 '1': 'B-Entity',
 '2': 'B-Modifier',
 '3': 'I-Action',
 '4': 'I-Entity',
 '5': 'I-Modifier',
 '6': 'O'}

In [37]:
label2id

{'B-Action': '0',
 'B-Entity': '1',
 'B-Modifier': '2',
 'I-Action': '3',
 'I-Entity': '4',
 'I-Modifier': '5',
 'O': '6'}

### **Load model & predicton**

In [38]:
import json

In [39]:
config = json.load(open("ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("ner_model/config.json","w"))

In [40]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")


In [41]:
from transformers import pipeline

In [None]:
# Labels définis dans votre modèle
ner_labels = ["B-Action", "B-Entity", "B-Modifier", "I-Action", "I-Entity", "I-Modifier", "O"]

# Initialiser le pipeline NER
nlp = pipeline("ner", model=model_fine_tuned.to(DEVICE), tokenizer=tokenizer, device=0 if DEVICE == 'cuda' else -1)

# Fonction pour convertir les indices en labels
def convert_indices_to_labels(indices, label_list):
    """
    Convertit une liste d'indices en étiquettes à l'aide de la liste de labels.
    """
    return [label_list[int(idx)] for idx in indices]  # Conversion explicite en int

# Fonction pour générer les prédictions et écrire dans un fichier JSONlines
def predict_and_save(ner_dataset, output_file):
    results = []
    for example in ner_dataset:
        # Effectuer les prédictions sur les tokens
        tokens = example["tokens"]
        ner_results = nlp(" ".join(tokens))
        
        # Initialiser les ner_tags prédits
        ner_tags_predicted = ["O"] * len(tokens)
        
        for ner_result in ner_results:
            label = ner_result["entity"]
            
            # Trouver le mot correspondant au résultat NER
            word = ner_result["word"]
            try:
                token_idx = tokens.index(word)
                ner_tags_predicted[token_idx] = label
            except ValueError:
                # Si le mot ne correspond pas, continuez
                continue

        # Ajouter le résultat au format JSONlines
        results.append({
            "unique_id": int(example["id"]),  # Conversion explicite en int
            "tokens": tokens,
            "ner_tags": convert_indices_to_labels(example.get("ner_tags", []), ner_labels),  # Conversion des indices
            "ner_tags_predicted": ner_tags_predicted
        })
    
    # Écrire les résultats dans le fichier JSONlines
    with open(output_file, "w", encoding="utf-8") as f:
        for item in results:
            f.write(json.dumps(item) + "\n")
    print(f"Prédictions enregistrées dans {output_file}")

# Effectuer les prédictions pour le jeu de test
predict_and_save(ner_data["test"], "./data/NER-TESTING-PREDICTED.jsonlines")

# Effectuer les prédictions pour le jeu de validation
predict_and_save(ner_data["validation"], "./data/NER-VALIDATION-PREDICTED.jsonlines")

# Fonction pour évaluer les prédictions
def evaluate_predictions(validation_file, label_list):
    """
    Évalue les prédictions en comparant les ner_tags_predicted aux ner_tags.
    """
    # Charger les prédictions
    with open(validation_file, "r", encoding="utf-8") as f:
        examples = [json.loads(line) for line in f]

    metric = evaluate.load("seqeval")
    all_predictions = []
    all_references = []

    for example in examples:
        if "ner_tags" in example and "ner_tags_predicted" in example:
            references = example["ner_tags"]
            predictions = example["ner_tags_predicted"]

            all_predictions.append(predictions)
            all_references.append(references)

    # Calcul des métriques
    results = metric.compute(predictions=all_predictions, references=all_references)
    print("Résultats d'évaluation :")
    print(json.dumps(results, indent=2, default=str))  # Ajout de default=str
    return results

# Évaluer sur le jeu de validation
validation_results = evaluate_predictions("./data/NER-VALIDATION-PREDICTED.jsonlines", ner_labels)


Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Prédictions enregistrées dans ./data/NER-TESTING-PREDICTED.jsonlines
Prédictions enregistrées dans ./data/NER-VALIDATION-PREDICTED.jsonlines
Résultats d'évaluation :
{
  "Action": {
    "precision": 0.6049046321525886,
    "recall": 0.5336538461538461,
    "f1": 0.5670498084291188,
    "number": "416"
  },
  "Entity": {
    "precision": 0.10482019892884469,
    "recall": 0.14842903575297942,
    "f1": 0.12286995515695068,
    "number": "923"
  },
  "Modifier": {
    "precision": 0.6015037593984962,
    "recall": 0.5714285714285714,
    "f1": 0.586080586080586,
    "number": "280"
  },
  "overall_precision": 0.2675257731958763,
  "overall_recall": 0.32056825200741196,
  "overall_f1": 0.29165495925821855,
  "overall_accuracy": 0.8628137540325753
}
