## Master degree in Computer Engineering (Ai & Robotics)

**Stella Francesco 2124359** \
**Lavezzi Luca 2154256**

# Named Entity Recognition (NER)

In this notebook we address the task of Named-Entity Recognition (NER), which consists in identifying and classifying entities such as persons, organizations, locations, and miscellaneous names within text. NER is a fundamental problem in Natural Language Processing (NLP), with applications ranging from information extraction to question answering.

For our experiments, we use the widely adopted CoNLL-2003 dataset, which contains annotated newswire articles with four types of named entities: persons, organizations, locations, and miscellaneous. This dataset is a standard benchmark for evaluating NER systems in the general domain.

To explore zero(few)-shot NER, we experiment with two instruction-tuned large language models (LLMs): **Gemma 3-27b-it**, which is accessed via API, and **DeepSeek-R1:14b**, which is executed locally using the Ollama framework.

Following the guidelines of the project, we implement and evaluate the baseline method from Xie et al. (2023), as well as one additional zero-shot prompting strategy inspired by their work. We refine our prompts and methods using a training split of the dataset, and report results on a separate test split, following best practices for NER evaluation.

The notebook is organized as follows:

- Introduction of the dataset and domain
- Documentation of any additional annotation or external libraries used
- Introduction to the models used
- Description of each zero(few)-shot method
- Evaluation and critical discussion of the results

# CoNLL 2003 Dataset

For this project, we used the CoNLL-2003 shared task dataset on language-independent named entity recognition (NER). This dataset includes annotated data for two languages, German and English, but we exclusively used the English portion for training and testing.

The English data is sourced from the Reuters Corpus, comprising newswire articles published between August 1996 and August 1997. Specifically, the training set consists of articles from a ten-day period toward the end of August 1996, while the test set includes articles from December 1996.

Each token in the dataset is annotated with one of the following four named entity types:

* PER – Person
* ORG – Organization
* LOC – Location
* MISC – Miscellaneous entities not covered by the other categories
* O – Outside of a named entity

The CoNLL-2003 dataset is widely regarded as a benchmark for evaluating NER systems due to its standardized format, high-quality annotations, and well-defined evaluation protocol (typically based on precision, recall, and F1-score). It continues to serve as a critical resource for comparing both traditional machine learning and modern deep learning approaches in sequence labeling tasks.

Using the Hugging Face Datasets library, we imported the dataset in JSON format, where each instance contains the following fields:

* id – The identifier of the sample
* tokens – A list containing the tokenized sentence
* pos_tags – A list of part-of-speech (POS) tags associated with each token
* chunk_tags – A list of syntactic chunk tags
* ner_tags – A list of named entity recognition tags corresponding to the tokens

In [9]:
from datasets import load_dataset
import random

# Set the random seed for reproducibility
random.seed(0)
# Load the dataset
dataset = load_dataset("eriktks/conll2003")

# Access the train, validation, and test splits
train_data = dataset["train"]
validation_data = dataset["validation"]
test_data = dataset["test"]

# Sample
print("Example:", train_data[0])
print("\nTraining set size:", len(train_data))
print("Validation set size:", len(validation_data))
print("Test set size:", len(test_data))

Example: {'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

Training set size: 14041
Validation set size: 3250
Test set size: 3453


For both the training and testing phases, we used seed 0 to ensure reproducibility, and the samples were taken from the training and test splits provided by the dataset.

# Libraries
## Google library
The Google Gen AI Python SDK provides an interface for integrating Google's generative models into Python applications. We used this library because we tested our prompts using Gemma 3. The SDK offers access to a variety of large language models from Google, through a user-friendly structure and flexible configuration options. It supports features such as adjusting the temperature for response creativity and sending batch requests for more efficient processing.

## Hugging Face
Hugging Face is a leading machine learning and data science platform and community that enables users to build, train, and deploy machine learning models with ease. It offers a wide range of pretrained large language models (LLMs) that can be used for tasks such as text generation, summarization, translation, classification, and more.

Developers can also fine-tune these models on custom datasets or even build new models from scratch using the tools provided by Hugging Face. A key component of the platform is the Transformers library, which we used in this project. It provides seamless integration with popular deep learning frameworks like PyTorch and TensorFlow, making it easier to experiment and scale models across different environments.

In addition to models, Hugging Face hosts datasets such as CoNLL2003.

## SeqEval
Seqeval is a Python library designed for evaluating sequence labeling tasks, such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and chunking. It provides a simple and efficient way to compute standard evaluation metrics, including precision, recall, and F1-score, specifically tailored for structured prediction tasks where labels follow formats like IOB/IOB2. We adopt this library to evaluate our results and compare the various methods.

## Ollama
The ollama library provides a Python interface to interact with large language models (LLMs) running locally via the Ollama server. It allows users to send prompts, receive model-generated responses, and manage inference workflows directly from Python code. This makes it easy to integrate advanced generative AI capabilities into data analysis pipelines, research experiments, or application development, all while keeping computation on the local machine.

## SpaCy
The SpaCy library is an open-source software library for advanced natural language processing (NLP), written in Python and Cython. Unlike libraries such as NLTK, which are often used for teaching and research, spaCy is designed specifically for production use, offering fast and robust tools for tasks like tokenization, part-of-speech tagging, dependency parsing, text categorization, and named entity recognition (NER).

spaCy supports deep learning workflows and can integrate with popular machine learning libraries such as TensorFlow and PyTorch via its own backend, Thinc. It provides prebuilt neural network models for 23 languages and supports tokenization for over 65 languages, making it suitable for both multilingual applications and custom model training.


In [5]:
%pip install -q -U google-genai
%pip install seqeval
%pip install google-generativeai
%pip install ipywidgets --upgrade
%pip install transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install pydantic
!pip install datasets[all]
!pip install pandas
!pip install ollama
!pip install spacy
!python -m spacy download en_core_web_sm






















Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -------- ------------------------------- 2.6/12.8 MB 54.5 MB/s eta 0:00:01
     ---------------------- ----------------- 7.2/12.8 MB 77.0 MB/s eta 0:00:01
     ------------------------------------ - 12.3/12.8 MB 110.0 MB/s eta 0:00:01
     --------------------------------------  12.8/12.8 MB 93.9 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 65.5 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\__init__.py", line 6, in <module>
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "c:\Users\lucal\.conda\envs\insetti\Lib\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util imp

# Models

For the development and evaluation of the prompts in this project, we used two different large language models: Gemma 3 (27B) and DeepSeek.

## Gemma 3 (27b)

We chose Gemma 3 because it is a recently released model by Google, offering several notable advantages:

* Computational efficiency: Gemma 3 is a relatively lightweight model that achieves reasonable performance compared to state-of-the-art (SOTA) models like Gemini 2.5 Pro, which has over ten times more parameters.
* API limits: Unlike many other models with free-tier APIs, Gemma 3 allows for a higher number of requests per day, making it more suitable for iterative development and testing.

However, these advantages come with a significant trade-off:

* Lower performance: In our tests, Gemma 3 underperformed compared to more advanced models. When evaluated against Gemini 2.0 Flash, we observed a performance gap of up to 20% in F1-score. Despite this, using higher-performing models was impractical due to API rate limitations, which would have required a considerable amount of additonal time to complete equivalent testing. 

The lower accuracy of Gemma 3 also influenced our implementation: model outputs required additional validation and post-processing. In some cases, responses did not adhere to the expected output format, necessitating extra checks and error handling in our code.

## DeepSeek-R1 (distilled)
In this section, we show and discuss our implementation on the DeepSeek-R1 model (distilled).

### How to run DeepSeek-R1 locally
Go to https://ollama.com, download and install ollama.
Then, search for DeepSeek-R1's available models: https://ollama.com/library/deepseek-r1.
Click on a model you wish to download and copy the command to run in the terminal (for instance: ollama run deepseek-r1:14b).


### Model selection
Our choice for the model was based on two factors: time consumption and performance.
We decided to use deepseek-r1:14b because it generated tokens relatively fast while still performing almost as good as gemma-3-27b-it, a model we used previously which had acceptable performance.
Trying to run bigger models such as deepseek-r1:32b was not feasible due to the very long token generation times.
The increase in performance compared to deepseek-r1:14b was also very minimal, which led us to try even smaller models, such as deepseek-r1:7b and deepseek-r1:8b, which performed much worse instead.



## **Implemented methods**
In this section, we elencate and explain our methods implemented using deepseek-r1:14b.

### - Word-level named entity reflection

In this method, we add to the vanilla prompt another instruction which asks the model to generate a short summary for each word in the sentence. The summary had to be very short (around ten words) and is supposed to give an explanation, motivating the reason why a certain word can or cannot be a potential candidate to be classified as a NER tag.

This approach explicitly leverages the **chain of thought** capability of large language models: by prompting the model to reflect and reason about each token individually, it is encouraged to articulate its decision-making process step by step. This not only helps the model to make more informed tagging decisions, but also provides interpretability, as the generated summaries reveal the rationale behind each prediction.

However, this method performed slightly worse compared to the baseline reported in the reference paper.

>This method was designed and developed by us specifically for this project.

### - Multi-turn adaptive refinement

In this method we asked the model to name the potential candidate words to be classified as NER tags. The model had to give an explanation for every word it believed to be a candidate NER tag, and then give the results in the correct format.

Similarly to the previously mentioned method, this approach explicitly leverages the **chain of thought** capabilities of large language models. By prompting the model to reason step by step about which words are likely entities and to justify its choices, we encourage a more structured and interpretable decision-making process. This not only helps the model focus on the most relevant tokens, but also reduces the overall size of the output text generated, since the prompt specifies to just list the candidate NER tags and their explanations.

This method performed slightly better compared to the baseline counterpart.

>This method was designed and developed by us specifically for this project.

### - Dependency-based entity validation
In this method we added the dependency tree in the prompt, obtained by using spaCy (the dependency tree was not part of the dataset).

The motivation behind this approach was to provide the model with additional syntactic information about the relationships between words in the sentence. By including the dependency tree, the model could potentially leverage the grammatical structure to better identify entities, especially in complex sentences where context and word dependencies play a crucial role.

To implement this, we used spaCy to generate the dependency tree for each sentence, representing the syntactic relations in a structured format. This tree was then inserted into the prompt alongside the tokens to be tagged. The expectation was that the model, with access to this extra layer of linguistic information, would be able to make more informed tagging decisions, particularly in cases where surface-level cues were insufficient.

However, in our experiments, we observed that the F1 score associated to this method was slightly worse.

It is also important to note that the dependency trees were not originally part of the dataset and had to be computed using spaCy. This introduces a potential source of error, as the quality of the dependency parsing depends on the accuracy of spaCy's model. If the dataset had included gold-standard dependency trees, or if the dependency information had been annotated and curated specifically for the NER task, the results could have been better and the model might have been able to exploit this information more effectively.

>This method was already used in the reference paper.

### - POS-guided named entity recognition
In this method we added the POS tags for every token of the sentence, which are part of the dataset.

The main idea behind this approach is to provide the model with explicit information about the grammatical role of each token in the sentence. Part-of-speech (POS) tags indicate whether a word is a noun, verb, adjective, proper noun, etc., and can be very helpful for named entity recognition because certain entity types are strongly associated with specific POS tags (for example, proper nouns are often persons, organizations, or locations).

By including the POS tags in the prompt, we aimed to help the model disambiguate cases where the token alone might not be sufficient to determine the correct entity type. For instance, the word "Apple" could refer to a fruit (common noun) or a company (proper noun), and the POS tag can provide a useful clue for the model to make the right choice.

In practice, we formatted the prompt so that the model received both the list of tokens and the corresponding POS tags in order. This additional information allowed the model to make more informed predictions, especially in sentences with ambiguous or rare words. Our experiments showed that this method generally improved the F1 score compared to the vanilla approach, confirming the usefulness of syntactic information for NER tasks.

The performance scores indicate that this method was better than the baseline method implemented in the reference paper.

>This method was already used in the reference paper.

### - POS-dependency hybrid NER
In this method we added to the prompt both the POS tags and the dependency tree.

The rationale behind this approach is to combine two complementary sources of linguistic information: the part-of-speech (POS) tags, which indicate the grammatical role of each token, and the dependency tree, which describes the syntactic relationships between words in the sentence. By providing both types of annotations, we aimed to give the model a richer context for making NER predictions, especially in cases where either POS or dependency information alone might not be sufficient.

In practice, the prompt was structured to include the list of tokens, their corresponding POS tags, and the full dependency tree (as computed by spaCy) for each sentence. This allowed the model to consider not only the type of each word but also how words are connected and which tokens are likely to form multi-word entities based on their syntactic structure.

However, our experiments showed that the performance was close to the dependency-based entity validation method. Additionally, as with the dependency-based method, the quality of the dependency tree depends on the accuracy of spaCy's parser. If gold-standard dependency annotations had been available in the dataset, the results might have been better and the model could have leveraged this information more effectively.

>This method was designed and developed by us specifically for this project.

### - Example-driven POS NER
In this method we added the POS tags for every token of the sentence. We also added three complete examples, which consisted of the sentence tokens, POS tags and NER tags associated to such tokens. Thus this can be considered as a **few-shot learning** method.

The main motivation for this approach was to provide the model with concrete, context-rich demonstrations of the NER task, making it easier for the model to generalize the tagging strategy to new sentences. By including three full examples in the prompt—each showing the tokens, their corresponding POS tags, and the correct NER tags—the model could observe how the tagging should be performed in practice, even for more complex or ambiguous cases.

In our experiments, this method generally led to similar performance compared to the original prompt without examples.

>This method was designed and developed by us specifically for this project.


## **Confrontation between our models, prompt switching and prompt mixing**
After evaluating the various prompts for our models, we chose to focus on those approaches that provided the highest F1 scores.
The next step consisted in applying the best-performing prompt from one model to the other model. In this way, we could directly compare whether the improvements were due to the prompt itself or to the specific model architecture.

In particular, we took the prompt that gave the best results with the first model and used it as input for the second model, and vice versa. This allowed us to assess the transferability and general effectiveness of each prompt, independently of the model for which it was originally designed.

In the end, we also experimented with combining the two best prompts, integrating their most effective elements into a single hybrid prompt. This "prompt mixing" approach aimed to leverage the strengths of both strategies, further improving the overall performance and robustness of our NER system.