## Master degree in Computer Engineering (Ai & Robotics)

**Stella Francesco 2124359** \
**Lavezzi Luca 2154256**

# Named Entity Recognition (NER)

In this notebook we address the task of Named-Entity Recognition (NER), which consists in identifying and classifying entities such as persons, organizations, locations, and miscellaneous names within text. NER is a fundamental problem in Natural Language Processing (NLP), with applications ranging from information extraction to question answering.

For our experiments, we use the widely adopted CoNLL-2003 dataset, which contains annotated newswire articles with four types of named entities: persons, organizations, locations, and miscellaneous. This dataset is a standard benchmark for evaluating NER systems in the general domain.

To explore zero(few)-shot NER, we experiment with two instruction-tuned large language models (LLMs): **Gemma 3-27b-it**, which is accessed via API, and **DeepSeek-R1:14b**, which is executed locally using the Ollama framework.

Following the guidelines of the project, we implement and evaluate the baseline method from Xie et al. (2023), as well as one additional zero-shot prompting strategy inspired by their work. We refine our prompts and methods using a training split of the dataset, and report results on a separate test split, following best practices for NER evaluation.

The notebook is organized as follows:

- Introduction of the dataset and domain
- Documentation of any additional annotation or external libraries used
- Introduction to the models used
- Description of each zero(few)-shot method
- Evaluation and critical discussion of the results

# CoNLL 2003 Dataset

For this project, we used the CoNLL-2003 shared task dataset on language-independent named entity recognition (NER). This dataset includes annotated data for two languages, German and English, but we exclusively used the English portion for training and testing.

The English data is sourced from the Reuters Corpus, comprising newswire articles published between August 1996 and August 1997. Specifically, the training set consists of articles from a ten-day period toward the end of August 1996, while the test set includes articles from December 1996.

Each token in the dataset is annotated with one of the following four named entity types:

* PER – Person
* ORG – Organization
* LOC – Location
* MISC – Miscellaneous entities not covered by the other categories
* O – Outside of a named entity

The CoNLL-2003 dataset is widely regarded as a benchmark for evaluating NER systems due to its standardized format, high-quality annotations, and well-defined evaluation protocol (typically based on precision, recall, and F1-score). It continues to serve as a critical resource for comparing both traditional machine learning and modern deep learning approaches in sequence labeling tasks.

Using the Hugging Face Datasets library, we imported the dataset in JSON format, where each instance contains the following fields:

* id – The identifier of the sample
* tokens – A list containing the tokenized sentence
* pos_tags – A list of part-of-speech (POS) tags associated with each token
* chunk_tags – A list of syntactic chunk tags
* ner_tags – A list of named entity recognition tags corresponding to the tokens

In [1]:
from datasets import load_dataset
import random

# Set the random seed for reproducibility
random.seed(0)
# Load the dataset
dataset = load_dataset("eriktks/conll2003")

# Access the train, validation, and test splits
train_data = dataset["train"]
validation_data = dataset["validation"]
test_data = dataset["test"]

# Sample
print("Example:", train_data[0])
print("\nTraining set size:", len(train_data))
print("Validation set size:", len(validation_data))
print("Test set size:", len(test_data))

Example: {'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

Training set size: 14041
Validation set size: 3250
Test set size: 3453


For both the training and testing phases, we used seed 0 to ensure reproducibility, and the samples were taken from the training and test splits provided by the dataset.

# Libraries
## Google library
The Google Gen AI Python SDK provides an interface for integrating Google's generative models into Python applications. We used this library because we tested our prompts using Gemma 3. The SDK offers access to a variety of large language models from Google, through a user-friendly structure and flexible configuration options. It supports features such as adjusting the temperature for response creativity and sending batch requests for more efficient processing.

## Hugging Face
Hugging Face is a leading machine learning and data science platform and community that enables users to build, train, and deploy machine learning models with ease. It offers a wide range of pretrained large language models (LLMs) that can be used for tasks such as text generation, summarization, translation, classification, and more.

Developers can also fine-tune these models on custom datasets or even build new models from scratch using the tools provided by Hugging Face. A key component of the platform is the Transformers library, which we used in this project. It provides seamless integration with popular deep learning frameworks like PyTorch and TensorFlow, making it easier to experiment and scale models across different environments.

In addition to models, Hugging Face hosts datasets such as CoNLL2003.

## SeqEval
Seqeval is a Python library designed for evaluating sequence labeling tasks, such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and chunking. It provides a simple and efficient way to compute standard evaluation metrics, including precision, recall, and F1-score, specifically tailored for structured prediction tasks where labels follow formats like IOB/IOB2. We adopt this library to evaluate our results and compare the various methods.

## Ollama
The ollama library provides a Python interface to interact with large language models (LLMs) running locally via the Ollama server. It allows users to send prompts, receive model-generated responses, and manage inference workflows directly from Python code. This makes it easy to integrate advanced generative AI capabilities into data analysis pipelines, research experiments, or application development, all while keeping computation on the local machine.

## SpaCy
The SpaCy library is an open-source software library for advanced natural language processing (NLP), written in Python and Cython. Unlike libraries such as NLTK, which are often used for teaching and research, spaCy is designed specifically for production use, offering fast and robust tools for tasks like tokenization, part-of-speech tagging, dependency parsing, text categorization, and named entity recognition (NER).

spaCy supports deep learning workflows and can integrate with popular machine learning libraries such as TensorFlow and PyTorch via its own backend, Thinc. It provides prebuilt neural network models for 23 languages and supports tokenization for over 65 languages, making it suitable for both multilingual applications and custom model training.


In [5]:
%pip install -q -U google-genai
%pip install seqeval
%pip install google-generativeai
%pip install ipywidgets --upgrade
%pip install transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install pydantic
!pip install datasets[all]
!pip install pandas
!pip install ollama
!pip install spacy
!python -m spacy download en_core_web_sm






















Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -------- ------------------------------- 2.6/12.8 MB 54.5 MB/s eta 0:00:01
     ---------------------- ----------------- 7.2/12.8 MB 77.0 MB/s eta 0:00:01
     ------------------------------------ - 12.3/12.8 MB 110.0 MB/s eta 0:00:01
     --------------------------------------  12.8/12.8 MB 93.9 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 65.5 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\__init__.py", line 6, in <module>
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util imp

# Models

For the development and evaluation of the prompts in this project, we used two different large language models: Gemma 3 (27B) and DeepSeek.

## Gemma 3 (27b)

We chose Gemma 3 because it is a recently released model by Google, offering several notable advantages:

* Computational efficiency: Gemma 3 is a relatively lightweight model that achieves reasonable performance compared to state-of-the-art (SOTA) models like Gemini 2.5 Pro, which has over ten times more parameters.
* API limits: Unlike many other models with free-tier APIs, Gemma 3 allows for a higher number of requests per day, making it more suitable for iterative development and testing.

However, these advantages come with a significant trade-off:

* Lower performance: In our tests, Gemma 3 underperformed compared to more advanced models. When evaluated against Gemini 2.0 Flash, we observed a performance gap of up to 20% in F1-score. Despite this, using higher-performing models was impractical due to API rate limitations, which would have required a considerable amount of additonal time to complete equivalent testing. 

The lower accuracy of Gemma 3 also influenced our implementation: model outputs required additional validation and post-processing. In some cases, responses did not adhere to the expected output format, necessitating extra checks and error handling in our code.

## DeepSeek-R1 (distilled)
In this section, we show and discuss our implementation on the DeepSeek-R1 model (distilled).

### How to run DeepSeek-R1 locally
Go to https://ollama.com, download and install ollama.
Then, search for DeepSeek-R1's available models: https://ollama.com/library/deepseek-r1.
Click on a model you wish to download and copy the command to run in the terminal (for instance: ollama run deepseek-r1:14b).


### Model selection
Our choice for the model was based on two factors: time consumption and performance.
We decided to use `deepseek-r1:14b` because it generated tokens relatively fast while still performing almost as good as gemma-3-27b-it, a model we used previously which had acceptable performance.
Trying to run bigger models such as deepseek-r1:32b was not feasible due to the very long token generation times.
The increase in performance compared to deepseek-r1:14b was also very minimal, which led us to try even smaller models, such as deepseek-r1:7b and deepseek-r1:8b, which performed much worse instead.



## **Implemented methods**
In this section, we list and describe the methods we implemented using deepseek-r1:14b. Each prompt corresponds to a specific approach, and the results for each method are saved in a CSV file identified by the method's acronym.

### - Word-level named entity reflection (WLNER)

In this method, we add to the vanilla prompt another instruction which asks the model to generate a short summary for each word in the sentence. The summary had to be very short (around ten words) and is supposed to give an explanation, motivating the reason why a certain word can or cannot be a potential candidate to be classified as a NER tag.

This approach explicitly leverages the **chain of thought** capability of large language models: by prompting the model to reflect and reason about each token individually, it is encouraged to articulate its decision-making process step by step. This not only helps the model to make more informed tagging decisions, but also provides interpretability, as the generated summaries reveal the rationale behind each prediction.

However, this method performed slightly worse compared to the baseline reported in the reference paper.

>This method was designed and developed by us specifically for this project.

### - Multi-turn adaptive refinement (MTAR)

In this method we asked the model to name the potential candidate words to be classified as NER tags. The model had to give an explanation for every word it believed to be a candidate NER tag, and then give the results in the correct format.

Similarly to the previously mentioned method, this approach explicitly leverages the **chain of thought** capabilities of large language models. By prompting the model to reason step by step about which words are likely entities and to justify its choices, we encourage a more structured and interpretable decision-making process. This not only helps the model focus on the most relevant tokens, but also reduces the overall size of the output text generated, since the prompt specifies to just list the candidate NER tags and their explanations.

This method performed slightly better compared to the baseline counterpart.

>This method was designed and developed by us specifically for this project.

### - Dependency-based entity validation (DBEV)
In this method we added the dependency tree in the prompt, obtained by using spaCy (the dependency tree was not part of the dataset).

The motivation behind this approach was to provide the model with additional syntactic information about the relationships between words in the sentence. By including the dependency tree, the model could potentially leverage the grammatical structure to better identify entities, especially in complex sentences where context and word dependencies play a crucial role.

To implement this, we used spaCy to generate the dependency tree for each sentence, representing the syntactic relations in a structured format. This tree was then inserted into the prompt alongside the tokens to be tagged. The expectation was that the model, with access to this extra layer of linguistic information, would be able to make more informed tagging decisions, particularly in cases where surface-level cues were insufficient.

However, in our experiments, we observed that the F1 score associated to this method was slightly worse.

It is also important to note that the dependency trees were not originally part of the dataset and had to be computed using spaCy. This introduces a potential source of error, as the quality of the dependency parsing depends on the accuracy of spaCy's model. If the dataset had included gold-standard dependency trees, or if the dependency information had been annotated and curated specifically for the NER task, the results could have been better and the model might have been able to exploit this information more effectively.

>This method was already used in the reference paper.

### - POS-guided named entity recognition (POSGNER)
In this method we added the POS tags for every token of the sentence, which are part of the dataset.

The main idea behind this approach is to provide the model with explicit information about the grammatical role of each token in the sentence. Part-of-speech (POS) tags indicate whether a word is a noun, verb, adjective, proper noun, etc., and can be very helpful for named entity recognition because certain entity types are strongly associated with specific POS tags (for example, proper nouns are often persons, organizations, or locations).

By including the POS tags in the prompt, we aimed to help the model disambiguate cases where the token alone might not be sufficient to determine the correct entity type. For instance, the word "Apple" could refer to a fruit (common noun) or a company (proper noun), and the POS tag can provide a useful clue for the model to make the right choice.

In practice, we formatted the prompt so that the model received both the list of tokens and the corresponding POS tags in order. This additional information allowed the model to make more informed predictions, especially in sentences with ambiguous or rare words. Our experiments showed that this method generally improved the F1 score compared to the vanilla approach, confirming the usefulness of syntactic information for NER tasks.

The performance scores indicate that this method was better than the baseline method implemented in the reference paper.

>This method was already used in the reference paper.

### - POS-dependency hybrid NER (POSDHNER)
In this method we added to the prompt both the POS tags and the dependency tree.

The rationale behind this approach is to combine two complementary sources of linguistic information: the part-of-speech (POS) tags, which indicate the grammatical role of each token, and the dependency tree, which describes the syntactic relationships between words in the sentence. By providing both types of annotations, we aimed to give the model a richer context for making NER predictions, especially in cases where either POS or dependency information alone might not be sufficient.

In practice, the prompt was structured to include the list of tokens, their corresponding POS tags, and the full dependency tree (as computed by spaCy) for each sentence. This allowed the model to consider not only the type of each word but also how words are connected and which tokens are likely to form multi-word entities based on their syntactic structure.

However, our experiments showed that the performance was close to the dependency-based entity validation method. Additionally, as with the dependency-based method, the quality of the dependency tree depends on the accuracy of spaCy's parser. If gold-standard dependency annotations had been available in the dataset, the results might have been better and the model could have leveraged this information more effectively.

>This method was designed and developed by us specifically for this project.

### - Example-driven POS NER (EDPOSNER)
In this method we added the POS tags for every token of the sentence. We also added three complete examples, which consisted of the sentence tokens, POS tags and NER tags associated to such tokens. Thus this can be considered as a **few-shot learning** method.

The main motivation for this approach was to provide the model with concrete, context-rich demonstrations of the NER task, making it easier for the model to generalize the tagging strategy to new sentences. By including three full examples in the prompt—each showing the tokens, their corresponding POS tags, and the correct NER tags—the model could observe how the tagging should be performed in practice, even for more complex or ambiguous cases.

In our experiments, this method generally led to similar performance compared to the original prompt without examples.

>This method was designed and developed by us specifically for this project.


### Evaluation Without BIO Tag Distinction

In addition to the standard evaluation using the full BIO tagging scheme (where each entity is labeled as either Beginning (B), Inside (I), or Outside (O)), we also conducted experiments by removing the distinction between B- and I- tags. In this alternative evaluation, entity types are considered without regard to whether a token is at the beginning or inside of an entity span. For example, both `B-PER` and `I-PER` are mapped to a single `PER` category.

This approach allows us to consider a prediction as correct even if the model predicts a `B` tag instead of an `I` tag (or vice versa), as long as the entity type itself is correct. The motivation behind this evaluation is to focus on the model's ability to recognize the correct entity type, regardless of its position within the entity span. This can be particularly useful for assessing the robustness of the model in scenarios where the precise span boundaries are less critical than the correct identification of entity types.

To implement this, we mapped all `B-` and `I-` tags of the same entity type to a single label. The evaluation metrics were then computed on these simplified labels, providing an additional perspective on the model's performance.

The csvs obtained by ignoring the BIO tagging end with **NOBIO**

### Model's response parsers

Here we list the functions for parsing and storing the model's response.

In [2]:
import csv
import os


def parse_response(tokens : list, response_labels : list, true_labels : list) -> list: 
    '''
    Store the response in a list of lists where the first element is the token, the second element 
    is the predicted label and the third is the true label
    '''
    response_labels = response_labels.split(":")
    response_labels = response_labels[1].strip('\n').split(',')
    if (len(response_labels) != len(tokens)):
        if (len(response_labels) > len(tokens)):
            response_labels = response_labels[:len(tokens)]
        if (len(response_labels) < len(tokens)):
            response_labels = response_labels + ['0'] * (len(tokens) - len(response_labels))

    temp = []
    for i in range(len(tokens)):
        pred_label = int(response_labels[i].strip())
        if(pred_label < 0 or pred_label > 8):
            print(f"Token: {tokens[i]}, Predicted Label: {pred_label}, True Label: {true_labels[i]}")
        #assert (pred_label >= 0 and pred_label <= 8), "Predicted label is out of range"
        temp.append([tokens[i], pred_label, true_labels[i]])

    return temp

def save_to_csv_vanilla(tokens : list, pred_labels : list, true_labels : list, filename : str) -> None:
    file_exists = os.path.isfile(filename)
    # Write header only if the file didn't exist before
    if not file_exists:
        with open(filename, 'a', newline='') as csvfile:
            header = ['token', 'pred', 'true']
            writer = csv.writer(csvfile)
            writer.writerow(header)
    if (len(pred_labels) == 0 and len(true_labels) == 0):
        return
    data = [[tokens[i], pred_labels[i], true_labels[i]] for i in range(len(tokens)) if pred_labels[i] != 0 or true_labels[i] != 0]
    # Remove duplicates
    # Open the file in append mode and write data to analysis purpose
    with open(filename, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(data)
        
        
        

begin_tags = {1,3,5,7}

def parse_response_no_bio(tokens : list, response_labels : list, true_labels : list) -> list: 
    '''
    Store the response in a list of lists where the first element is the token, the second element 
    is the predicted label and the third is the true label
    '''
    response_labels = response_labels.split(":")
    response_labels = response_labels[1].strip('\n').split(',')
    if (len(response_labels) != len(tokens)):
        if (len(response_labels) > len(tokens)):
            response_labels = response_labels[:len(tokens)]
        if (len(response_labels) < len(tokens)):
            response_labels = response_labels + ['0'] * (len(tokens) - len(response_labels))

    temp = []
    for i in range(len(tokens)):
        if i > 0 and temp[i-1][1] in begin_tags and int(response_labels[i].strip()) in begin_tags:
            # If the previous token is a begin tag and this is not '0', we assume it's a continuation
            pred_label = int(response_labels[i].strip()) + 1
        elif i > 0 and int(response_labels[i].strip()) in begin_tags and (temp[i-1][1] - 1) == int(response_labels[i].strip()):
            # if previous token is an inside token and the current is in the same category, we assume it's a continuation
            pred_label = int(response_labels[i].strip()) + 1
        else:
            # Otherwise, we take the label as is
            pred_label = int(response_labels[i].strip())
        if(pred_label < 0 or pred_label > 8):
            print(f"Token: {tokens[i]}, Predicted Label: {pred_label}, True Label: {true_labels[i]}")
        #assert (pred_label >= 0 and pred_label <= 8), "Predicted label is out of range"
        temp.append([tokens[i], pred_label, true_labels[i]])

    return temp


# Take predicted labels and for each token save the label in a list to be used for voting
def store_predicted_labels(pred_labels : list, votes : list) -> None:
    for i in range(len(pred_labels)):
        votes[i].append(pred_labels[i])


### Training data sampling

Here we sample the first 100 sentences from the training set using a fixed seed.

In [3]:
import random

random.seed(0)

# Sample 100 random elements from the test set
sampled_train_data = random.sample(list(train_data), 100)

# Print the first few samples to verify
for i, sample in enumerate(sampled_train_data[:5]):  # Display the first 5 samples
    print(f"Sample {i + 1}:")
    print(sample)
    print("\n")

Sample 1:
{'id': '13835', 'tokens': ['RESERVES', '(', '$', 'MLN', ')', 'Jul', '+1,161', '+400.9', '+310.4', '-', '54,703.0'], 'pos_tags': [24, 4, 3, 22, 5, 22, 11, 11, 11, 8, 11], 'chunk_tags': [11, 0, 11, 12, 0, 11, 12, 12, 12, 12, 12], 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


Sample 2:
{'id': '6311', 'tokens': ['7.', 'Sebastian', 'Lindholm', '(', 'Finland', ')', 'Ford', 'Escort', '5:17'], 'pos_tags': [22, 22, 22, 4, 22, 5, 22, 22, 11], 'chunk_tags': [11, 12, 12, 0, 11, 0, 11, 12, 12], 'ner_tags': [0, 1, 2, 0, 5, 0, 7, 8, 0]}


Sample 3:
{'id': '12418', 'tokens': ['7', '-', 'Chris', 'Walker', '(', 'England', ')', 'beat', 'Amr', 'Shabana', '(', 'Egypt', ')', '15-13', '15-10', '15-6'], 'pos_tags': [11, 8, 22, 22, 4, 22, 5, 37, 22, 22, 4, 22, 5, 11, 16, 11], 'chunk_tags': [11, 12, 12, 12, 0, 11, 0, 21, 11, 12, 0, 11, 0, 11, 12, 12], 'ner_tags': [0, 0, 1, 2, 0, 5, 0, 0, 1, 2, 0, 5, 0, 0, 0, 0]}


Sample 4:
{'id': '6890', 'tokens': ['OB', '2', 'Lotte', '1'], 'pos_tags': [22, 11, 22

### Dependency tree computation

Here we report the code for computing the dependency trees for the training data usign spaCy.

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

dependency_trees = []

for sample in sampled_train_data:
    tokens = sample["tokens"]
    sentence = " ".join(tokens)
    doc = nlp(sentence)
    tree = [
        {
            "text": token.text,
            "dep": token.dep_,
            "head": token.head.text,
            "pos": token.pos_,
            "index": token.i,
            "head_index": token.head.i
        }
        for token in doc
    ]
    dependency_trees.append(tree)

# Link the two lists
for i, sample in enumerate(sampled_train_data):
    sample['dependency_tree'] = dependency_trees[i]

# Now each sample has the tree inside
print(sampled_train_data[0]['dependency_tree'])



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
    app.start()
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\ipykernel\kernelapp.py", line 701, in start
    self.io_loop.start()
  File "c:\Users\

[{'text': 'RESERVES', 'dep': 'nsubj', 'head': '+1,161', 'pos': 'NOUN', 'index': 0, 'head_index': 6}, {'text': '(', 'dep': 'punct', 'head': 'RESERVES', 'pos': 'PUNCT', 'index': 1, 'head_index': 0}, {'text': '$', 'dep': 'nmod', 'head': 'MLN', 'pos': 'SYM', 'index': 2, 'head_index': 3}, {'text': 'MLN', 'dep': 'appos', 'head': 'RESERVES', 'pos': 'X', 'index': 3, 'head_index': 0}, {'text': ')', 'dep': 'punct', 'head': 'MLN', 'pos': 'PUNCT', 'index': 4, 'head_index': 3}, {'text': 'Jul', 'dep': 'appos', 'head': 'RESERVES', 'pos': 'PROPN', 'index': 5, 'head_index': 0}, {'text': '+1,161', 'dep': 'ROOT', 'head': '+1,161', 'pos': 'VERB', 'index': 6, 'head_index': 6}, {'text': '+400.9', 'dep': 'advmod', 'head': '+1,161', 'pos': 'ADV', 'index': 7, 'head_index': 6}, {'text': '+310.4', 'dep': 'dep', 'head': '+400.9', 'pos': 'NUM', 'index': 8, 'head_index': 7}, {'text': '-', 'dep': 'punct', 'head': '54,703.0', 'pos': 'PUNCT', 'index': 9, 'head_index': 10}, {'text': '54,703.0', 'dep': 'dobj', 'head': '

### Clean response function

This function is used to format properly the response coming from a deepseek model.

In [5]:
import re
import os
import csv
import json
from datasets import load_dataset
from ollama import Client


def clean_response(text):
    """Cleans the model's output to extract only the numbers."""
    # Remove <think>...</think> blocks
    cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()

    # Look for the pattern 'ner_tags: 0, 1, 2, ...'
    match = re.search(r'ner_tags\s*:\s*([0-9,\s]+)', cleaned)
    if match:
        number_str = match.group(1)
    else:
        # If 'ner_tags:' is not found, but there are still numbers, extract them all
        number_str = cleaned

    # Extract all integers as strings
    number_list = re.findall(r'\d+', number_str)
    return number_list



### Format example function

This function is used for the few-shot prompts, giving them a formatted example.

In [6]:
def format_example(ex):
    return f"""Tokens: {ex['tokens']}
POS tags: {ex['pos_tags']}
NER tags: {ex['ner_tags']}"""

# Using 3 random examples in the prompt (different from the ones used in the training)
example1 = train_data[11000]
example2 = train_data[12000]
example3 = train_data[13000]

## Best prompt (POS-guided named entity recognition)

Here we report the best prompt found using deepseek-r1:14b and the code used to generate the responses.

In [None]:
# Initialize client
client = Client()

for j in range(len(sampled_train_data)):
    tokens = sampled_train_data[j]['tokens']
    pos_tags = sampled_train_data[j]['pos_tags']
    true_labels = sampled_train_data[j]['ner_tags']
    dependency_tree = dependency_trees[j]

    # Convert dependency_tree to JSON string
    dependency_tree_str = json.dumps(dependency_tree, indent=2)

    prompt = f"""You are a strict NER tagging system.

    Given the following NER tags:
    {{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}}

    Your task is to assign the correct tag number to each token in this sentence:
    {tokens}

    You are also given the POS tag (part-of-speech) for each token.
    POS tags (in order, one per token):
    {pos_tags}

    Respond ONLY with:
    ner_tags: x, x, x, ..., x  ← (exactly {len(tokens)} integers)

    Do NOT include explanations, thoughts, or any other content.
    Do NOT write anything before or after "ner_tags: ...".
    Just print the sequence in the format specified.
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_train_ds/vanilla_train_100_ds_14b_POSGNER.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


### Using the same model without the BIO tagging

Here we present the code used for the same prompt without using the BIO tagging.
For this instance, the F1 score obtained using the nobio format was worse compared to his regular counterpart.

In [None]:
# Initialize client
client = Client()

for j in range(len(sampled_train_data)):
    tokens = sampled_train_data[j]['tokens']
    pos_tags = sampled_train_data[j]['pos_tags']
    true_labels = sampled_train_data[j]['ner_tags']
    dependency_tree = dependency_trees[j]

    # Convert dependency_tree to JSON string
    dependency_tree_str = json.dumps(dependency_tree, indent=2)

    prompt = f"""You are a strict NER tagging system.

    Given the following NER tags:
    {{'O': 0, 'B-PER': 1, 'B-ORG': 3, 'B-LOC': 5, 'B-MISC': 7}}

    Your task is to assign the correct tag number to each token in this sentence:
    {tokens}

    You are also given the POS tag (part-of-speech) for each token.
    POS tags (in order, one per token):
    {pos_tags}

    Respond ONLY with:
    ner_tags: x, x, x, ..., x  ← (exactly {len(tokens)} integers)

    Do NOT include explanations, thoughts, or any other content.
    Do NOT write anything before or after "ner_tags: ...".
    Just print the sequence in the format specified.
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response_no_bio(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_train_ds/vanilla_train_100_ds_14b_POSGNER_NOBIO.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


### Removal of Rows with Invalid Prediction Labels
To ensure the correctness of our evaluation, we removed all rows from the CSV files where the prediction label (pred) was not an integer between 0 and 8 (inclusive). Specifically, any row with a pred value outside this range, or with a non-integer value, was discarded.

In [None]:
import os
import csv

input_folder = 'data100_train_ds'
output_folder = 'data100_ds_corrected'
os.makedirs(output_folder, exist_ok=True)

for filename in os.listdir(input_folder):
    if filename.endswith('.csv'):
        input_path = os.path.join(input_folder, filename)
        output_path = os.path.join(output_folder, filename)
        rows = []
        with open(input_path, newline='', encoding='utf-8') as infile:
            reader = csv.DictReader(infile)
            fieldnames = reader.fieldnames
            for row in reader:
                try:
                    pred = int(row['pred'])
                    if 0 <= pred <= 8:
                        row['pred'] = str(pred)
                        rows.append(row)
                    # Otherwise, the row is discarded
                except ValueError:
                    # Row is discarded if pred cannot be converted to int
                    continue

        with open(output_path, 'w', newline='', encoding='utf-8') as outfile:
            writer = csv.DictWriter(outfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(rows)

### Confrontation between the prompts

Here we report the F1 scores obtained with the various prompts tested on the training data, organized from best to worst

In [7]:
import os
import csv
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

label_mapping = {
    0: 'O',
    1: 'B-PER',
    2: 'I-PER',
    3: 'B-ORG',
    4: 'I-ORG',
    5: 'B-LOC',
    6: 'I-LOC',
    7: 'B-MISC',
    8: 'I-MISC'
}

folder = 'data100_ds_corrected'
best_f1 = -1
best_file = None
results = []

for filename in os.listdir(folder):
    if filename.endswith('.csv'):
        true_seqs = []
        pred_seqs = []
        current_true = []
        current_pred = []
        filepath = os.path.join(folder, filename)
        with open(filepath, newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                try:
                    true_label = label_mapping[int(row['true'])]
                    pred_label = label_mapping[int(row['pred'])]
                    current_true.append(true_label)
                    current_pred.append(pred_label)
                except Exception:
                    continue
            if current_true and current_pred:
                true_seqs.append(current_true)
                pred_seqs.append(current_pred)
        if true_seqs and pred_seqs:
            f1 = f1_score(true_seqs, pred_seqs)
            results.append((filename, f1))
            if f1 > best_f1:
                best_f1 = f1
                best_file = filename

# Prints the results sorted by F1 score
for fname, f1 in sorted(results, key=lambda x: x[1], reverse=True):
    print(f"{fname}: F1 Score = {f1}")

print("\nBest file:", best_file)
print("Best F1 Score:", best_f1)

vanilla_train_100_ds_14b_ADVANCEDPOS_NOBIO.csv: F1 Score = 0.6074498567335244
vanilla_train_100_ds_14b_POSGNER.csv: F1 Score = 0.5649717514124294
vanilla_train_100_ds_14b_ADVANCED_NOBIO.csv: F1 Score = 0.5630498533724341
vanilla_train_100_ds_14b_EDPOSNER_NOBIO.csv: F1 Score = 0.5574712643678161
vanilla_train_100_ds_14b_NOBIO.csv: F1 Score = 0.5531914893617021
vanilla_train_100_ds_14b_MTAR_NOBIO.csv: F1 Score = 0.5406824146981627
vanilla_train_100_ds_14b_POSGNER_NOBIO.csv: F1 Score = 0.516795865633075
vanilla_train_100_ds_14b_MTAR.csv: F1 Score = 0.5111111111111112
vanilla_train_100_ds_14b.csv: F1 Score = 0.5013774104683195
vanilla_train_100_ds_14b_EDPOSNER.csv: F1 Score = 0.49853372434017595
vanilla_train_100_ds_14b_POSDHNER_NOBIO.csv: F1 Score = 0.4876847290640394
vanilla_train_100_ds_14b_WLNER_NOBIO.csv: F1 Score = 0.48292682926829267
vanilla_train_100_ds_14b_WLNER.csv: F1 Score = 0.47916666666666663
vanilla_train_100_ds_14b_DBEV_NOBIO.csv: F1 Score = 0.4766839378238342
vanilla_train

## **Confrontation between our models, prompt switching and prompt mixing**
After evaluating the various prompts for our models, we chose to focus on those approaches that provided the highest F1 scores.
The next step consisted in applying the best-performing prompt from one model to the other model. In this way, we could directly compare whether the improvements were due to the prompt itself or to the specific model architecture.

In particular, we took the prompt that gave the best results with the first model and used it as input for the second model, and vice versa. This allowed us to assess the transferability and general effectiveness of each prompt, independently of the model for which it was originally designed.

In the end, we also experimented with combining the two best prompts, integrating their most effective elements into a single hybrid prompt. This "prompt mixing" approach aimed to leverage the strengths of both strategies, further improving the overall performance and robustness of our NER system.

We now present the results obtained by running deepseek-r1:14b with the best prompt originally designed for Gemma 3-27b-it. The F1-score achieved with this approach is very close to the one obtained using the POS-guided named entity recognition prompt, suggesting that both prompts are effective for this task.

In [None]:
# Initialize client
client = Client()

for j in range(len(sampled_train_data)):
    tokens = sampled_train_data[j]['tokens']
    pos_tags = sampled_train_data[j]['pos_tags']
    true_labels = sampled_train_data[j]['ner_tags']
    dependency_tree = dependency_trees[j]

    # Convert dependency_tree to JSON string
    dependency_tree_str = json.dumps(dependency_tree, indent=2)

    prompt = f"""You are an expert in Named Entity Recognition (NER). Your task is to assign an entity tag ID to each token in the sentence using the strict schema below.

    NER Tag IDs:
    - 0 = O → Not an entity  
    - 1 = PERSON → Real names, personal titles (e.g., "Barack", "Dr.", "Angela")  
    - 3 = ORGANIZATION → Companies, institutions, teams, agencies (e.g., "Google", "United Nations", "Lakers", "Juventus")  
    - 5 = LOCATION → Cities, countries, natural landmarks, buildings (e.g., "Paris", "Mount Everest", "Empire State Building")  
    - 7 = MISCELLANEOUS → Nationalities, languages, events, products, titles of works (e.g., "Italian", "Olympics", "iPhone", "French", "The Matrix")

    ---

    Rules:
    - Use only contextual evidence to determine entity types.
    - If the token does not clearly match a defined type, assign `0 (O)`.
    - Please do not assign tags based on assumptions or incomplete context.
    - Please be as precise as possible

    ---

    Now tag the sentence below:
    Sentence: {tokens}  
    This sentence contains exactly {len(tokens)} tokens.

    Your answer MUST follow this format:  
    ner_tags: 0, 1, 0, 5, 5, 0  
    (Must return exactly {len(tokens)} tag IDs in order, no extra text.)
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response_no_bio(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_train_ds/vanilla_train_100_ds_14b_ADVANCED_NOBIO.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


We now present the results obtained by running the model with a hybrid prompt that combines the best elements from the Gemma 3-27b-it prompt and the POS-guided named entity recognition approach. In this prompt, the model is provided with both detailed definitions for each entity type and the part-of-speech (POS) tags for every token in the sentence. The instructions explicitly require the model to use only contextual evidence, avoid making assumptions, and to be as precise as possible when assigning entity tags.

By integrating both the explicit entity schema and the syntactic information from POS tags, this prompt aims to maximize the model's ability to accurately identify and classify entities. The F1-score achieved with this combined prompt is even higher than those obtained with either the Gemma 3-27b-it prompt or the POS-guided prompt alone. This improvement suggests that the two strategies are complementary: the detailed entity definitions help clarify the categories, while the POS tags provide valuable grammatical context for disambiguating entity types. 

In [None]:
# Initialize client
client = Client()

for j in range(len(sampled_train_data)):
    tokens = sampled_train_data[j]['tokens']
    pos_tags = sampled_train_data[j]['pos_tags']
    true_labels = sampled_train_data[j]['ner_tags']
    dependency_tree = dependency_trees[j]

    # Convert dependency_tree to JSON string
    dependency_tree_str = json.dumps(dependency_tree, indent=2)

    prompt = f"""You are an expert in Named Entity Recognition (NER). Your task is to assign an entity tag ID to each token in the sentence using the strict schema below.

    NER Tag IDs:
    - 0 = O → Not an entity  
    - 1 = PERSON → Real names, personal titles (e.g., "Barack", "Dr.", "Angela")  
    - 3 = ORGANIZATION → Companies, institutions, teams, agencies (e.g., "Google", "United Nations", "Lakers", "Juventus")  
    - 5 = LOCATION → Cities, countries, natural landmarks, buildings (e.g., "Paris", "Mount Everest", "Empire State Building")  
    - 7 = MISCELLANEOUS → Nationalities, languages, events, products, titles of works (e.g., "Italian", "Olympics", "iPhone", "French", "The Matrix")

    ---

    Rules:
    - Use only contextual evidence to determine entity types.
    - If the token does not clearly match a defined type, assign `0 (O)`.
    - Please do not assign tags based on assumptions or incomplete context.
    - Please be as precise as possible

    ---

    Now tag the sentence below:
    Sentence: {tokens}  
    This sentence contains exactly {len(tokens)} tokens.
    
    You are also given the POS tag (part-of-speech) for each token.
    POS tags (in order, one per token):
    {pos_tags}

    Your answer MUST follow this format:  
    ner_tags: 0, 1, 0, 5, 5, 0  
    (Must return exactly {len(tokens)} tag IDs in order, no extra text.)
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response_no_bio(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_train_ds/vanilla_train_100_ds_14b_ADVANCEDPOS_NOBIO.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


## Running the model on the test set

In this section, we apply the best-performing prompt to the test set to evaluate the generalization ability of our models. This allows us to assess the final performance of each approach on unseen data and compare the results obtained during training with those on the test set.

In [7]:
random.seed(0)

# Sample 100 random elements from the test set
sampled_test_data = random.sample(list(test_data), 100)

In [8]:
# Initialize client
client = Client()

for j in range(len(sampled_test_data)):
    tokens = sampled_test_data[j]['tokens']
    pos_tags = sampled_test_data[j]['pos_tags']
    true_labels = sampled_test_data[j]['ner_tags']

    prompt = f"""You are an expert in Named Entity Recognition (NER). Your task is to assign an entity tag ID to each token in the sentence using the strict schema below.

    NER Tag IDs:
    - 0 = O → Not an entity  
    - 1 = PERSON → Real names, personal titles (e.g., "Barack", "Dr.", "Angela")  
    - 3 = ORGANIZATION → Companies, institutions, teams, agencies (e.g., "Google", "United Nations", "Lakers", "Juventus")  
    - 5 = LOCATION → Cities, countries, natural landmarks, buildings (e.g., "Paris", "Mount Everest", "Empire State Building")  
    - 7 = MISCELLANEOUS → Nationalities, languages, events, products, titles of works (e.g., "Italian", "Olympics", "iPhone", "French", "The Matrix")

    ---

    Rules:
    - Use only contextual evidence to determine entity types.
    - If the token does not clearly match a defined type, assign `0 (O)`.
    - Please do not assign tags based on assumptions or incomplete context.
    - Please be as precise as possible

    ---

    Now tag the sentence below:
    Sentence: {tokens}  
    This sentence contains exactly {len(tokens)} tokens.
    
    You are also given the POS tag (part-of-speech) for each token.
    POS tags (in order, one per token):
    {pos_tags}

    Your answer MUST follow this format:  
    ner_tags: 0, 1, 0, 5, 5, 0  
    (Must return exactly {len(tokens)} tag IDs in order, no extra text.)
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response_no_bio(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_test_ds/vanilla_test_100_ds_14b_ADVANCEDPOS_NOBIO.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


[✓] Sentence 0
Tokens:     ['"', 'The', 'recent', 'level', 'of', 'the', 'yen', 'exchange', 'rate', 'has', 'been', 'stable', ',', 'and', 'it', 'does', 'not', 'appear', 'to', 'be', 'moving', 'towards', 'a', 'further', 'depreciation', 'of', 'the', 'yen', 'immediately', ',', 'so', 'import', 'prices', 'are', 'likely', 'to', 'stabilise', 'at', 'current', 'levels', ',', '"', 'Matsushita', 'said', 'in', 'an', 'interview', 'with', 'the', 'Nihon', 'Keizai', 'Shimbun', '.']
Predicted:  [0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0]
True:       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 4, 4, 0]
---
[✓] Sentence 1
Tokens:     ['Blackpool', '0', 'Hednesford', '1']
Predicted:  [5, 0, 5, 0]
True:       [3, 0, 3, 0]
---
[✓] Sentence 2
Tokens:     ['Wheat', '387.4', '4677.8', '4553.6',

### Performance on test set

In [10]:
import csv
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

label_mapping = {
    0: 'O',
    1: 'B-PER',
    2: 'I-PER',
    3: 'B-ORG',
    4: 'I-ORG',
    5: 'B-LOC',
    6: 'I-LOC',
    7: 'B-MISC',
    8: 'I-MISC'
}

category_to_index = {
    'O': 0,
    'B-PER': 1, 
    'I-PER': 2, 
    'B-ORG': 3, 
    'I-ORG': 4, 
    'B-LOC': 5, 
    'I-LOC': 6, 
    'B-MISC': 7, 
    'I-MISC': 8
    }

valid_labels = set(label_mapping.values())

def evaluate_predictions(filename: str) -> None:
    true_seqs = []
    pred_seqs = []
    current_true = []
    current_pred = []
    invalid_count = 0

    with open(filename, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            true_index = int(row['true'])
            pred_index = int(row['pred'])
            
            true_label = label_mapping.get(true_index, 'O')
            pred_label = label_mapping.get(pred_index, 'INVALID')

            if pred_label not in valid_labels:
                pred_label = 'INVALID'
                invalid_count += 1
                continue

            current_true.append(true_label)
            current_pred.append(pred_label)

        true_seqs.append(current_true)
        pred_seqs.append(current_pred)

    print(f"Invalid predictions: {invalid_count}\n")

    print("Precision:", precision_score(true_seqs, pred_seqs))
    print("Recall:", recall_score(true_seqs, pred_seqs))
    print("F1 Score:", f1_score(true_seqs, pred_seqs))

    print("\nDetailed classification report:\n")
    print(classification_report(true_seqs, pred_seqs))


evaluate_predictions('data100_test_ds/vanilla_test_100_ds_14b.csv')

Invalid predictions: 29

Precision: 0.37089201877934275
Recall: 0.5197368421052632
F1 Score: 0.4328767123287671

Detailed classification report:

              precision    recall  f1-score   support

         LOC       0.49      0.75      0.59        59
        MISC       0.09      0.25      0.13        12
         ORG       0.21      0.16      0.18        49
         PER       0.46      0.75      0.57        32

   micro avg       0.37      0.52      0.43       152
   macro avg       0.31      0.48      0.37       152
weighted avg       0.36      0.52      0.42       152



### Using the test set on the vanilla prompt

In [9]:
# Initialize client
client = Client()

for j in range(len(sampled_test_data)):
    tokens = sampled_test_data[j]['tokens']
    pos_tags = sampled_test_data[j]['pos_tags']
    true_labels = sampled_test_data[j]['ner_tags']


    prompt = f"""You are a strict NER tagging system.

    Given the following NER tags:
    {{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}}

    Your task is to assign the correct tag number to each token in this sentence:
    {tokens}

    This sentence contains exactly {len(tokens)} tokens.

    Respond ONLY with:
    ner_tags: x, x, x, ..., x  ← (exactly {len(tokens)} integers)

    Do NOT include explanations, thoughts, or any other content.
    Do NOT write anything before or after "ner_tags: ...".
    Just print the sequence in the format specified.
    """

    try:
        response = client.generate(model="deepseek-r1:14b", prompt=prompt)
        raw_text = response.response
        pred_tags_str = clean_response(raw_text)

        # Parsing (conversion and validation)
        parsed_data = parse_response(tokens, f"ner_tags: {','.join(pred_tags_str)}", true_labels)

        # Debug print
        print(f"[✓] Sentence {j}")
        print("Tokens:    ", tokens)
        print("Predicted: ", [x[1] for x in parsed_data])
        print("True:      ", [x[2] for x in parsed_data])
        print("---")

        # Save to file
        save_to_csv_vanilla(tokens, [x[1] for x in parsed_data], true_labels, "data100_test_ds/vanilla_test_100_ds_14b.csv")

    except Exception as e:
        print(f"[!] Error at sentence {j}: {e}")
        print(f"Raw response: {response.response if 'response' in locals() else 'None'}")
        print("---")


[✓] Sentence 0
Tokens:     ['"', 'The', 'recent', 'level', 'of', 'the', 'yen', 'exchange', 'rate', 'has', 'been', 'stable', ',', 'and', 'it', 'does', 'not', 'appear', 'to', 'be', 'moving', 'towards', 'a', 'further', 'depreciation', 'of', 'the', 'yen', 'immediately', ',', 'so', 'import', 'prices', 'are', 'likely', 'to', 'stabilise', 'at', 'current', 'levels', ',', '"', 'Matsushita', 'said', 'in', 'an', 'interview', 'with', 'the', 'Nihon', 'Keizai', 'Shimbun', '.']
Predicted:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0]
True:       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 4, 4, 0]
---
[✓] Sentence 1
Tokens:     ['Blackpool', '0', 'Hednesford', '1']
Predicted:  [5, 0, 5, 0]
True:       [3, 0, 3, 0]
---
[✓] Sentence 2
Tokens:     ['Wheat', '387.4', '4677.8', '4553.6',