***
># __NER with BERT__
*Fine-tuning BERT for Named Entity Recognition*
***

This notebook demonstrates the use of BERT models for natural language processing tasks, specifically named entity recognition (NER). We fine-tune pre-trained models available on Hugging Face to adapt them to our NER dataset. The models explored in this notebook include:
-  __[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)__
-  __[BERT-Base Cased](https://huggingface.co/google-bert/bert-base-cased)__
***

### __Notebook Overview__

- #### *PART I : Fine-Tuning the Models*
        Import the required libraries and datasets.
        Preprocess the data for compatibility with the pre-trained models.
        Train and fine-tune the models for the NER task.

- #### *PART II : Make predictions*
        Utilize the fine-tuned models to make predictions on new text data.
***


<div class="alert alert-block alert-info">
<b>Author:</b> Adrien CORDONNIER & Mustapha KOYTCHA
</div>


# *PART I: Fine-Tuning the Models*

In [1]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report
from safetensors.torch import load_file
from datasets import Dataset
import transformers as tfm
import pandas as pd
import numpy as np
import torch as to
import evaluate
import json
import os

metric = evaluate.load('seqeval')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# If necessary
os.chdir("C:/Users/Mustapha/OneDrive - IPSA/Aéro 5/Ma513 Hands-on Machine Learning for Cybersecurity/NER4Cyber-IPSA-main") # <---- Change the path
os.listdir()  # jsonfiles need to be in the folder data

['2_NER_Training___NLP_with_HuggingFace_Tutorial.ipynb',
 '461930.pdf',
 'Completed_testing.jsonlines',
 'data',
 'distilbert-finetuned-ner',
 'graph.py',
 'ner_model',
 'output_decision_tree.jsonlines',
 'output_SVM_poly_c1.jsonlines',
 'Projet-BERT_Adrien.ipynb',
 'README.md',
 'SVM&DECISIONTree.ipynb']

In [3]:
# Open the jsonlines files as pandas DataFrame and remove NAN values
with open("./data/NER-TRAINING.jsonlines", 'r') as f:
    training_data = [json.loads(l) for l in list(f)]
training_data = pd.DataFrame(training_data).dropna()

with open("./data/NER-TESTING.jsonlines", 'r') as f:
    testing_data = [json.loads(l) for l in list(f)]
testing_data = pd.DataFrame(testing_data).dropna()

with open("./data/NER-VALIDATION.jsonlines", 'r') as f:
    validation_data = [json.loads(l) for l in list(f)]
validation_data = pd.DataFrame(validation_data).dropna()

print(f"training_data.shape = {training_data.shape}\nvalidation_data.shape = {validation_data.shape}\ntesting_data.shape = {testing_data.shape}")

training_data.head()

training_data.shape = (4876, 3)
validation_data.shape = (1044, 3)
testing_data.shape = (1046, 2)


Unnamed: 0,unique_id,tokens,ner_tags
0,6506,"[Later, in, May, of, 2010, within, a, Pakistan...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,5221,"[In, 2008, ,, Tom, Donahue, ,, a, senior, Cent...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,1923,"[On, the, spectrum, of, state, responsibility,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,5905,"[If, we, observe, the, network, communications...","[O, O, O, O, O, O, O, O, O, O, B-Entity, O, B-..."
4,3114,"[The, regime's, CSTIA, relies, on, Russia, as,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"


In [4]:
# Create the dictionary with a number for each unique value
unique_tags = set(tag for tags_list in training_data["ner_tags"] for tag in tags_list)

id_to_tag = {index: tag for index, tag in enumerate(unique_tags)}
tag_to_id = {tag: index for index, tag in enumerate(unique_tags)}
print(f"id_to_tag = {id_to_tag}")
print(f"tag_to_id = {tag_to_id}")

# Add a new column ner_tags_numeric to train and validation dataset containing the numeric value of ner_tags
training_data["ner_tags_numeric"] = training_data["ner_tags"].apply(lambda tags_list: [tag_to_id[tag] for tag in tags_list])
validation_data["ner_tags_numeric"] = validation_data["ner_tags"].apply(lambda tags_list: [tag_to_id[tag] for tag in tags_list])

training_data.head()

id_to_tag = {0: 'I-Modifier', 1: 'B-Entity', 2: 'B-Modifier', 3: 'O', 4: 'I-Entity', 5: 'I-Action', 6: 'B-Action'}
tag_to_id = {'I-Modifier': 0, 'B-Entity': 1, 'B-Modifier': 2, 'O': 3, 'I-Entity': 4, 'I-Action': 5, 'B-Action': 6}


Unnamed: 0,unique_id,tokens,ner_tags,ner_tags_numeric
0,6506,"[Later, in, May, of, 2010, within, a, Pakistan...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ..."
1,5221,"[In, 2008, ,, Tom, Donahue, ,, a, senior, Cent...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ..."
2,1923,"[On, the, spectrum, of, state, responsibility,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ..."
3,5905,"[If, we, observe, the, network, communications...","[O, O, O, O, O, O, O, O, O, O, B-Entity, O, B-...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 6, 1, 4, ..."
4,3114,"[The, regime's, CSTIA, relies, on, Russia, as,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]"


In [5]:
def tokenize_and_align_labels_from_dataframe(row, tokenizer):
    """
    Tokenises a line in the DataFrame and aligns with the ner_tags_numeric.
    Args:
        row: A row in the DataFrame containing token, ner_tags, ner_tags_numeric.
        tokenizer: The Hugging Face tokenizer to use.
    Returns:
        A dictionary containing the tokenised data and aligned tags.
    """
    # Tokenize the row
    tokenized_inputs = tokenizer(row['tokens'], truncation=True, is_split_into_words=True, padding = True)

    # Aligning labels with subtokens
    word_ids = tokenized_inputs.word_ids()
    numeric_labels = row['ner_tags_numeric']
    
    aligned_labels = []
    for word_id in word_ids:
        if word_id is None:  # Avoid special sub-tokens ([CLS], [SEP])
            aligned_labels.append(-100)
        else:
            aligned_labels.append(numeric_labels[word_id])

    # Add aligned_labels to tokenised entries
    tokenized_inputs['labels'] = aligned_labels

    return tokenized_inputs

In [6]:
# Import the pre-trained model and apply tokenization
model_checkpoint = "bert-base-cased" # Rerun everything after this with 'distilbert-base-cased' or 'jackaduma/SecBERT'
tokenizer = tfm.AutoTokenizer.from_pretrained(model_checkpoint)
model = tfm.AutoModelForTokenClassification.from_pretrained(model_checkpoint, id2label = id_to_tag, label2id = tag_to_id)

training_data['tokenized'] = training_data.apply(lambda row: tokenize_and_align_labels_from_dataframe(row, tokenizer), axis=1)
validation_data['tokenized'] = validation_data.apply(lambda row: tokenize_and_align_labels_from_dataframe(row, tokenizer), axis=1)

training_data.head()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,unique_id,tokens,ner_tags,ner_tags_numeric,tokenized
0,6506,"[Later, in, May, of, 2010, within, a, Pakistan...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[input_ids, token_type_ids, attention_mask, la..."
1,5221,"[In, 2008, ,, Tom, Donahue, ,, a, senior, Cent...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[input_ids, token_type_ids, attention_mask, la..."
2,1923,"[On, the, spectrum, of, state, responsibility,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[input_ids, token_type_ids, attention_mask, la..."
3,5905,"[If, we, observe, the, network, communications...","[O, O, O, O, O, O, O, O, O, O, B-Entity, O, B-...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 6, 1, 4, ...","[input_ids, token_type_ids, attention_mask, la..."
4,3114,"[The, regime's, CSTIA, relies, on, Russia, as,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]","[input_ids, token_type_ids, attention_mask, la..."


In [7]:
# Create variables that match the model specifications
tokenized_training_data = training_data.copy()
tokenized_training_data["input_ids"] = tokenized_training_data["tokenized"].apply(lambda x: x["input_ids"])
tokenized_training_data["attention_mask"] = tokenized_training_data["tokenized"].apply(lambda x: x["attention_mask"])
tokenized_training_data["labels"] = tokenized_training_data["tokenized"].apply(lambda x: x["labels"])
tokenized_training_data = tokenized_training_data.drop(columns=["tokenized", "unique_id","tokens","ner_tags","ner_tags_numeric"])

tokenized_validation_data = validation_data.copy()
tokenized_validation_data["input_ids"] = tokenized_validation_data["tokenized"].apply(lambda x: x["input_ids"])
tokenized_validation_data["attention_mask"] = tokenized_validation_data["tokenized"].apply(lambda x: x["attention_mask"])
tokenized_validation_data["labels"] = tokenized_validation_data["tokenized"].apply(lambda x: x["labels"])
tokenized_validation_data = tokenized_validation_data.drop(columns=["tokenized", "unique_id","tokens","ner_tags","ner_tags_numeric"])

# Convert the DataFrame into a Hugging Face Dataset
hf_tokenized_training_data = Dataset.from_pandas(tokenized_training_data)
hf_tokenized_validation_data = Dataset.from_pandas(tokenized_validation_data)
hf_tokenized_testing_data = Dataset.from_pandas(testing_data)

hf_tokenized_training_data

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 4876
})

In [8]:
def pad_and_truncate(examples, max_length=50):
    """
    Pads and truncates input sequences to a fixed maximum length.
    This function processes batches of examples by truncating or padding their input IDs, attention masks,
    and labels to ensure uniform sequence lengths. Padding tokens are added to reach the maximum length,
    and truncation is applied if a sequence exceeds the specified limit.

    Args:
        examples (dict): A dictionary containing the following keys:
            - "input_ids" (list of lists): Tokenized input sequences.
            - "attention_mask" (list of lists): Attention masks corresponding to the input sequences.
            - "labels" (list of lists): Label sequences corresponding to the inputs.
        max_length (int): The desired fixed length for each sequence. Default is 50.

    Returns:
        dict: A dictionary with the same keys as the input, where each sequence has been padded
              or truncated to the specified max_length:
              - "input_ids" (list of lists): Padded/truncated tokenized input sequences.
              - "attention_mask" (list of lists): Padded/truncated attention masks.
              - "labels" (list of lists): Padded/truncated label sequences, with padding values set to -100
                to ensure they are ignored during loss computation.
    """
    padded_examples = {"input_ids": [], "attention_mask": [], "labels": []}
    for input_ids, attention_mask, labels in zip(examples["input_ids"], examples["attention_mask"], examples["labels"]):
        # Truncation
        input_ids = input_ids[:max_length]
        attention_mask = attention_mask[:max_length]
        labels = labels[:max_length]

        # Padding
        input_ids += [0] * (max_length - len(input_ids))
        attention_mask += [0] * (max_length - len(attention_mask))
        labels += [-100] * (max_length - len(labels))  # -100 to ignore padding in loss computation

        # Append to the padded examples
        padded_examples["input_ids"].append(input_ids)
        padded_examples["attention_mask"].append(attention_mask)
        padded_examples["labels"].append(labels)
    
    return padded_examples

# Apply the function to the DataFrame
hf_tokenized_training_data = pad_and_truncate(hf_tokenized_training_data)
hf_tokenized_validation_data = pad_and_truncate(hf_tokenized_validation_data)

hf_tokenized_training_data = Dataset.from_dict(hf_tokenized_training_data)
hf_tokenized_validation_data = Dataset.from_dict(hf_tokenized_validation_data)

hf_tokenized_training_data

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 4876
})

In [9]:
label_names = list(tag_to_id.keys())

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  true_labels = [[label_names[l] for l in label if l!=-100] for label in labels]
  true_predictions = [[label_names[p] for p,l in zip(prediction, label) if l!=-100]
                      for prediction, label in zip(predictions, labels)]

  all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
  return {"precision": all_metrics['overall_precision'],
          "recall": all_metrics['overall_recall'],
          "f1": all_metrics['overall_f1'],
          "accuracy": all_metrics['overall_accuracy']}


In [10]:
# If necessary
os.chdir('C:/Users/Mustapha/Documents/IPSA-Cours/Aero5/Models')
print(os.getcwd())

C:\Users\Mustapha\Documents\IPSA-Cours\Aero5\Models


In [11]:
hf_tokenized_validation_data

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1044
})

In [None]:
# Prepare the trainer and then start the computation
training_args = tfm.TrainingArguments(output_dir="./bert-base-cased", # <----Change the output folder name
                                    eval_strategy = "epoch",
                                    save_strategy="epoch",
                                    learning_rate = 2e-5,
                                    num_train_epochs=5,  # <----Change epoch
                                    weight_decay=0.01)   # loss = loss + weight decay parameter * L2 norm of the weigh
trainer = tfm.Trainer(
    model=model,
    eval_dataset=hf_tokenized_validation_data,
    train_dataset=hf_tokenized_training_data,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()


# *PART II: Make Predictions*

The two following cells are not runable, we tried to import a generated model and make prdiction but we didn't succeed :(

In [31]:
"""def tokenize_test_data_from_dataframe(row, tokenizer):
    tokenized_inputs = tokenizer(row['tokens'], 
                                is_split_into_words=True, 
                                padding="max_length",
                                return_tensors="pt",
                                truncation=True,
                                max_length=16)
    
    tokenized_inputs["word_ids"] = tokenized_inputs.word_ids()
    return tokenized_inputs

hf_tokenized_testing_data = hf_tokenized_testing_data.map(lambda row: tokenize_test_data_from_dataframe(row, tokenizer), 
                                                       remove_columns=['tokens', 'unique_id'])

hf_tokenized_testing_dataset = hf_tokenized_testing_data.map(remove_columns=['word_ids',  'token_type_ids'])

hf_tokenized_testing_dataset"""

Map: 100%|██████████| 1046/1046 [00:00<00:00, 3342.28 examples/s]
Map: 100%|██████████| 1046/1046 [00:00<00:00, 54304.27 examples/s]


Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 1046
})

In [None]:
"""os.chdir("C:/Users/Mustapha/Documents/IPSA-Cours/Aero5/Models/")
models_path = ['bert-base-cased', 'distilbert-base-cased']

for path in models_path:
    for folder in os.listdir(f"./{path}"):
        output = []
        if folder != "runs":
            # Specify the path to the model directory
            checkpoint_dir = f"./{path}/{folder}"

            # Load custom configuration
            config = tfm.AutoConfig.from_pretrained(checkpoint_dir)
            tokenizer = tfm.AutoTokenizer.from_pretrained(checkpoint_dir)
            model = tfm.AutoModelForTokenClassification.from_pretrained(checkpoint_dir, config=config)

             # Apply the tokenisation function
            hf_tokenized_testing_dataset = hf_tokenized_testing_data.map(
                lambda row: tokenize_test_data_from_dataframe(row, tokenizer))

            for token in testing_data['tokens']:
                inputs = tokenizer(                 # Tokenization
                    token,
                    truncation=True,
                    padding=True,
                    max_length=128,
                    return_tensors="pt"
                )

                print("\n INPUT :", inputs.keys())
            
                # Faire des prédictions
                model.eval()

                with to.torch.no_grad():
                    outputs = model(**inputs)

                # Retrieve the logs and apply argmax to obtain the predicted IDs
                logits = outputs.logits # Logits : [num_samples, seq_len, num_labels]
                print("input size (sentence) : ", len(inputs['input_ids']))
                print("\n LOGITS shape :", logits.shape)
                predicted_ids = to.torch.argmax(logits, dim=2)
                print("prediction shape : ", predicted_ids.shape, predicted_ids)

                id2label = model.config.id2label

                predicted_labels = [[id2label[int(idx.item())] for idx in sequence] for sequence in predicted_ids]

                aligned_predictions = []

                for i, pred_sequence in enumerate(predicted_labels):
                    word_ids = hf_tokenized_testing_dataset[i]["word_ids"]
                    aligned_labels = []
                    previous_word_id = None
                    for word_id, label in zip(word_ids, pred_sequence):
                        if word_id is None:
                            continue
                        if word_id != previous_word_id:
                            aligned_labels.append(label)
                        previous_word_id = word_id
                    aligned_predictions.append(aligned_labels)
                
                aligned_predictions_final = []
                for lab in aligned_predictions:
                    aligned_predictions_final.append(lab[0])
          
                output.append({"tokens":token, "ner_tags": aligned_predictions_final})

        
        with open(f"C:/Users/Mustapha/Documents/IPSA-Cours/Aero5/Models/complete_testing_file/{path}_{folder}.jsonlines", "w") as f:
            for entry in output:
                f.write(json.dumps(entry) + "\n")
    
        print("Fichier 'output.jsonlines' créé.")"""

So we use the trainer'prediction function wich correspond to the last generated model

In [None]:
def tokenize_test_data_from_dataframe(row, tokenizer):
    """
    Tokenises a row in the DataFrame for the test data.
    Args:
        row: A row in the DataFrame containing only the "tokens" column.
        tokenizer: The Hugging Face tokenizer to use.
    Returns:
        A dictionary containing the tokenised data.
    """
    tokenized_inputs = tokenizer(
        row['tokens'], 
        truncation=True, 
        is_split_into_words=True, 
        padding=True
    )
    
    tokenized_inputs["word_ids"] = tokenized_inputs.word_ids()
    return tokenized_inputs

test_dataset = Dataset.from_pandas(testing_data)

tokenized_test_dataset = test_dataset.map(lambda row: tokenize_test_data_from_dataframe(row, tokenizer),
                                          remove_columns=['tokens', 'unique_id'])
tokenized_test_dataset_1 = tokenized_test_dataset.map(remove_columns=['word_ids'])

Map:   0%|          | 0/1046 [00:00<?, ? examples/s]

Map: 100%|██████████| 1046/1046 [00:00<00:00, 5643.10 examples/s]
Map: 100%|██████████| 1046/1046 [00:00<00:00, 37240.30 examples/s]

{'input_ids': [101, 6160, 124, 15765, 5229, 1104, 4069, 117, 3366, 1154, 1367, 1472, 1558, 2114, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}





In [62]:
predictions = trainer.predict(tokenized_test_dataset_1)
logits = predictions.predictions  # Logits : [num_samples, seq_len, num_labels]

# Convert logits into index
predicted_indices = np.argmax(logits, axis=-1)  # [num_samples, seq_len]
predictions_text = [[id_to_tag[int(idx)] for idx in sequence] for sequence in predicted_indices]
print(predicted_indices)
print(predictions_text)

100%|██████████| 131/131 [00:22<00:00,  5.81it/s]

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [3 1 3 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 3 ... 0 0 0]]
[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'




In [56]:
aligned_predictions = []

for i, pred_sequence in enumerate(predictions_text):
    word_ids = tokenized_test_dataset[i]["word_ids"]  # Get the word_ids for this sequence
    aligned_labels = []
    previous_word_id = None
    for word_id, label in zip(word_ids, pred_sequence):
        if word_id is None:  # Ignore special sub-tokens ([CLS], [SEP])
            continue
        if word_id != previous_word_id:  # Add a label for each unique token
            aligned_labels.append(label)
        previous_word_id = word_id
    aligned_predictions.append(aligned_labels)
print(aligned_predictions)

[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-Entity', 'I-Entity', 'I-Entity', 'O', 'O', 'B-Entity', 'I-Entity', 'I-Entity', 'I-Entity', 'I-Entity', 'B-Action', 'I-Action', 'B-Modifier', 'B-Entity', 'I-Entity', 'B-Modifier', 'B-Modifier', 'B-Modifier', 'B-Entity', 'I-Entity', 'I-Entity', 'I-Entity', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-Entity', 'I-Entity', 'O', 'B-Action', 'B-Entity', 'I-Entity', 'I-Entity', 'B-Entity', 'I-Entity', 'B-Modifier', 'B-Entity', 'I-Entity', 'O'], ['O', 'O', 'O', 'O', 'O', 'I-Entity', 'I-Entity', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O

In [57]:
test = pd.DataFrame({'ner_tags':aligned_predictions})
print(test)

                                               ner_tags
0            [O, O, O, O, O, O, O, O, O, O, O, O, O, O]
1     [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
2     [O, O, O, O, O, O, O, O, O, O, O, O, O, O, I-E...
3     [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
4     [O, O, O, O, O, I-Entity, I-Entity, O, O, O, O...
...                                                 ...
1041  [O, O, O, O, O, O, B-Entity, O, O, O, B-Entity...
1042  [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
1043  [B-Entity, I-Entity, I-Entity, I-Entity, I-Ent...
1044  [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
1045  [B-Entity, I-Entity, O, O, O, O, O, O, O, O, O...

[1046 rows x 1 columns]


In [None]:
testing_data['ner_tags'] = test['ner_tags']

a = 0
indices = []
for i in range(len(testing_data)):
    if(len(testing_data.iloc[i]['ner_tags'])) != len(testing_data.iloc[i]['tokens']):
        a = a+1
        indices.append(i)
print("nombre de lignes fausses : ",a)

testing_data = testing_data.drop(indices).reset_index(drop=True)
print(testing_data)

      unique_id                                             tokens  \
0          1357  [Stage, 3, exports, hundreds, of, methods, ,, ...   
1          3016  [These, campaigns, leverage, the, phenomenon, ...   
2          6936  [Interestingly, ,, most, of, the, affected, vi...   
3          4538  [The, framework, is, notable, for, a, number, ...   
4          4327  ['', APT28, 's, lures, and, domain, registrati...   
...         ...                                                ...   
1041       4724  [From, captured, traffic, it, appears, that, t...   
1042        545  [As, the, Syrian, Civil, War, continues, ,, Sy...   
1043       4105  [The, communication, between, the, Gen, 2, mal...   
1044       3139  [In, 2007, ,, Israel, launched, an, airstrike,...   
1045       1251  [The, sample, ,, which, was, deployed, against...   

                                               ner_tags  
0            [O, O, O, O, O, O, O, O, O, O, O, O, O, O]  
1     [O, O, O, O, O, O, O, O, O, O, O, O, 

In [60]:
os.getcwd()
os.chdir("C:/Users/Mustapha/OneDrive - IPSA/Aéro 5/Ma513 Hands-on Machine Learning for Cybersecurity/NER4Cyber-IPSA-main") # <---- Change the path


'C:\\Users\\Mustapha\\Documents\\IPSA-Cours\\Aero5\\Models'

In [None]:
with open("output_bert-base-cased.jsonlines", "w") as f:
    for _, row in testing_data.iterrows():
        f.write(json.dumps(row.to_dict()) + "\n")

print("Fichier 'output_bert-base-cased.jsonlines' créé.")

Fichier 'output_BERT.jsonlines' créé.
