# Students
Omar Arab, ID number: 2141200, Computer Engineering, Curriculum: AI & Robotics
\
Juri Farruku, ID number: 2157856, Computer Engineering, Curriculum: AI & Robotics

# Dataset
In this case of study we decided to use **CoNLL-2003**, which is a well known and widely used dataset in named entity recognition tasks.
This dataset doesn't concern a specific domain, indeed the entities contained in it can be classified in four generic groups: **"People", "Location", "Organization" and "Miscellaneous"**. Each sample is a dictionary containing 5 key-value pairs where each key is a string and each value is a list of different elements, depending on which is the key.
Here is an example of the key-value pairs that constitute a sample: <br>
* key = **chunk_tags** : value = **[11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0]**, which describes the syntactic structure in chunks
* key = **id** : value = **0**, which indicates the index associated to the sample in the entire dataset
* key = **ner_tags** : value = **[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]**, which contains NER labels that can be converted in BIO format
* key = **pos_tags** : value = **[12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7]**, which contains Part-of-Speech tags
* key = **token** : value **["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]**, which contains the tokens of the sentence

This kind of structure was helpful in developing the methods in the following sections. For example, in the **Tool Augmentation** method, we had already the list of true PoS tags, which are essential in order to obtain good results. <br>
In order to trasform the int values contained in the **ner_tags** list we didn't use the given dictionary of the dataset, which distinguishes between B and I labels, instead we created an ad-hoc dictionary which maps an integer value in one of the four possible labels aforementioned.

# Evaluation Metrics
For this project we have decided to use three metrics: **Precision**, **Recall** and the **F1-score**.
Once we have stored the results in a suitable format, we then begin to analyze them in order to evaluate the three aformentioned values. For this reason we have developed a function that calculates **True Positives**, **False Positives** and **False Negatives**. This function, called **confusion_matrix_calc**, takes the response of a specific prompt on a given sentence, the ground-truth associated with that sentence, the tokens of the sentence and the set of possible labels. <br>
The first step that the function does is preprocessing the given response. In fact we create two different lists from that: one used to compute true and false positives and one to compute false negatives. Elements of the first list are tuples of the kind (label, list of entities associated to the label), meanwhile the second list contains just the labeled entities, disregarding the label that the chatBot associated to them. In order to improve the correctness of the method we also lowercase both the strings in the response, both the tokens obtained from the dataset.

In order to compute true and false positives we iterate through the first aforementioned list, and for each element, which is a tuple, we iterate through each element in the list. For each element we check whether the string is contained in the tokens of the sentence ( this is done since sometimes the chatBot gives us back different tokens from what it was given) and find its associated index in the list of tokens. We then use this index and check whether the token has a label associated or not and this is done accessing the ground truth list. If the ground truth values equals the label which is being processed we have a true positive, otherwise we have a false positive. <br>

Instead, for the computation of the false negatives, we used the second list. We iterate through all the tokens and search for one that has a label associated to it, being different from zero (since this value means that it is not a named entity). Once we found one, we then iterate through the elements of the list and check whether we are able to find a string which is equal to the token of the original sentence: if we have a positive match, we don't have a false negative, otherwise yes. <br>

In the end, it returns a tuple of 3 elements that contains the aforementioned values.

In [None]:
from datasets import load_dataset
import google.generativeai as genai
import time
import json
import ast
import re


#Setting the interface with the chatBot used
genai.configure(api_key="key to be inserted")
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest")

def confusion_matrix_calc(labels, ground_truth, current_tokens, label_type):

    entities = []
    entities_for_f_n = []
    true_positive = 0
    false_positive = 0
    false_negative = 0

    for i in range (len(labels)): #For alla dictionaries contained in labels
        temp_list_string = labels[i]["response"] #Get list of entities labeled by the chatbot

        try:
            final_list = ast.literal_eval(temp_list_string) #Transform string in a real list
        except (SyntaxError, ValueError):
            final_list = re.findall(r"'([^']*)'", temp_list_string)

        final_list = [s.lower() for s in final_list] #Lower all words in the list of entities
        final_list = [word for s in final_list for word in s.split()] #Split strings made of 2 words
        tuple_label_entities = (label_type[i], final_list) #Create a tuple of the kind (label, list of entities)
        entities.append(tuple_label_entities)   #Append it in the list containing all lists
        entities_for_f_n = entities_for_f_n + final_list #Store entities recognised without labels, will be used for false negatives computation

    current_tokens = [s.lower() for s in current_tokens]


    for current_label, entity_list in entities:

        if not entity_list: #If list is empty, go to next element in the list of tuples
            continue

        for element in entity_list: #Check each entity in the list

            if element in current_tokens:   #If tokens given by responde are in the sample sentence
                index = current_tokens.index(element)       #Take index 
                if ground_truth[index] == current_label:    #Check label and see if it is the same as ground truth or not
                    true_positive += 1
                else:
                    false_positive += 1
            else:
                false_positive += 1
                


    for i in range(len(current_tokens)): #For each token 
        labeled = False #Used to indicate if the token has been labeled by the chatbot or not

        if ground_truth[i] != 0:    #If token has a labeled assigned
            if not entities_for_f_n:
                break

            labeled_token = current_tokens[i]   #Get the actual token which is labeled

            for entity in entities_for_f_n: #For each entity retrieved from chatbot
                if entity in current_tokens:
                    if entity == labeled_token:    #If it is equal 
                        labeled = True  #Set the label to true

            if not labeled: #If, after checking all entities retrieved by chatbot we haven't found it
                false_negative = false_negative + 1     #Chabot has not labeled it, but it has a ground-truth label assigned

            


    return(true_positive, false_positive, false_negative)

# Implementation


In our case study we decided to use Gemini as chatBot used to retrieve the responses and in particular the model we chose is gemini-1.5-flash.

### 1-Baseline method(Vanilla)



The baseline method, also referred to as the Vanilla approach, represents the most straightforward strategy to complete the Named Entity Recognition task in a zero shot setting. In fact in this method, unlike the others which we will see, no additional syntactic or structural information is provided to the language model. Instead, the prompt simply includes the raw sentence from the development set and a high level instruction which contains informations on the format of the output we would like to obtain and the set of labels based on which the task is computed.
In particular the set of prompts of the baseline method include a prompt with the label set given in the beginning, a second prompt in which the label set is given in the end of the instructions and a final prompt in which the tone of the request is more conversational.
It's important to notice that we ask the LLM to return all the named entities of all different labels at once. <br>

We also specify to the chatBot to give us back the named entities retrieved as a python list of the kind ['Entity_1',..., 'entity_n'], without giving us any other element, such as python code or json files. This is done in order to make the parsing and analysis of the responses more similar and easy. The prompts are the following:

In [4]:
vanilla_prompts = [
    "Given entity label set: Person, Organization, Location, Miscellaneous.\n"
    "Return a dictionary (not in JSON or Python code) with the label as key and a list of entities associated with that label as value, "
    "like the following: {{'Person': ['entity1', ...], 'Organization': ['entity5', ...], ...}}.\n"
    "The order of the keys of the dictionary must strictly be the following : Person, Organization, Location, Miscellaneous.\n"
    "DO NOT justify the output. DO NOT return JSON or Python code.\n"
    "Extract entities from the following sentence: {sentence}", #Prompt with labels in the beginning

    "You have to perform a NER task"
    "Return a dictionary (not in JSON or Python code) with the label as key and a list of entities associated with that label as value, "
    "like the following: {{'Person': ['entity1', ...], 'Organization': ['entity5', ...], ...}}.\n"
    "Extract entities from the following sentence: {sentence}.\n"
    "The entity label set is: Person, Organization, Location, Miscellaneous.\n"
    "The order of the keys of the dictionary must strictly be the following : Person, Organization, Location, Miscellaneous.\n" 
    "DO NOT justify the output. DO NOT return JSON or Python code.\n",#Prompts with labels in the end


 
    "Hi! Could you please help me identify named entities in the sentence below?\n"
    "We're interested in the following categories: Person, Organization, Location, Miscellaneous.\n"
    "Return a dictionary(not in JSON or python code) with the label as a key and a list of entities associated with that label as value. For example: {{'Person': ['entity1', ...], 'Organization': ['entity5',...], ...}}.\n"
    "The order of the keys of the dictionary must strictly be the following:  Person, Organization, Location, Miscellaneous.\n"
    "DO NOT justify the output and DO NOT return JSON or Python code.\n"
    "Here is the sentence: {sentence}. Thank you! \n" #Prompt with a conversational tone

] 

In [5]:
def Vanilla():

    dataset = load_dataset("conll2003", trust_remote_code=True) #Loader of dataset
    dev_samples = 50
    test_samples = 15
    dev_sentences = dataset["validation"][:dev_samples]  #Take "dev_samples" sentences from the dataset, in order to choose the best prompt
    test_sentences = dataset["test"][:test_samples]
    label_transformation = {0: 0, 1: "Person", 2: "Person", 3: "Organization", 4: "Organization", 5: "Location", 6: "Location", 7: "Miscellaneous", 8: "Miscellaenous"} #Dictionary used to transform ground truth
    

    
    label_type = ['Person', 'Organization', 'Location', 'Miscellaneous'] #Labels of dataset
    label_number = len(label_type)  #Number of labels
    prompt_number= len(vanilla_prompts)

    resulting_labels_tool = [[[] for _ in range(prompt_number)] for _ in range(dev_samples)]

    for i in range (dev_samples):   #For each "training" sample

        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
         
        sentence = " ".join(tokens) # sentence given to the prompt with speech of tags for each token
        
    
        
        for j in range(prompt_number):
            prompt = vanilla_prompts[j].format(sentence = sentence)
            print (f"[INFO] Sentence {i+1}/{dev_samples} | Prompt {j+1}/{prompt_number}") #For each possible prompt type
            response = model.generate_content(prompt)
            raw_response = response.text.strip()

            cleaned_response = re.sub(r"```(?:json)?\s*","", raw_response)#Response filtered in order to get the output format we requested
            cleaned_response = re.sub(r"\s*```","", cleaned_response)

            try:
                parsed_response = ast.literal_eval(cleaned_response.strip())
            except (SyntaxError, ValueError):
                print("[WARNING] Could not parse response:", cleaned_response.strip())
                parsed_response = {}

    

            for k in range(label_number): #For each possible entity label

                label= label_type[k]
                entities= parsed_response.get(label,[])# if current label sought not present in the dictionary return empty list, if not return list with entities

                resulting_labels_tool[i][j].append({
                "label": label_type[k],
                "response": str(entities)
                })

    print(resulting_labels_tool)

    


                


    true_p = [0] * prompt_number
    false_p = [0] * prompt_number
    false_n = [0] * prompt_number


    for i in range (dev_samples):
        ground_truth = dev_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics


        for j in range(prompt_number):
            prompt_response = resulting_labels_tool[i][j] #Get response of prompt j when given sentence i
            results_tool = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
            true_p[j] = true_p[j] + results_tool[0]    
            false_p[j] = false_p[j] + results_tool[1]
            false_n[j] = false_n[j] + results_tool[2]




    recall_tool= [0] * prompt_number
    precision_tool= [0] * prompt_number
    F1_score_tool = [0] * prompt_number


    for i in range(prompt_number):  #For each prompt compute metrics
        recall_tool[i] = true_p[i] / (true_p[i] + false_n[i])
        precision_tool[i] = true_p[i] / (true_p[i] + false_p[i])
        F1_score_tool[i] = ( 2 * precision_tool[i] * recall_tool[i] ) / ( recall_tool[i] + precision_tool[i] )


    print(f"Recall = {recall_tool}\n")
    print(f"Precision = {precision_tool}\n")
    print(f"F1-Score = {F1_score_tool}\n")

    best_prompt_index_tool = F1_score_tool.index(max(F1_score_tool))
    print(best_prompt_index_tool)




    #test best prompt tool augmentation 

        #TEST PART
    best_prompt_tool_resulting_labels =  [[] for _ in range(test_samples)]

    for i in range (test_samples):   #For each "training" sample

        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        sentence = " ".join(tokens) # sentence given to the prompt with 
        
        prompt = vanilla_prompts[best_prompt_index_tool].format(sentence = sentence)
        response = model.generate_content(prompt)

        raw_response = response.text.strip()

        cleaned_response = re.sub(r"```(?:json)?\s*","", raw_response)
        cleaned_response = re.sub(r"\s*```","", cleaned_response)

        try:
                parsed_response = ast.literal_eval(cleaned_response.strip())

        except (SyntaxError, ValueError):
                
                print("[WARNING] Could not parse response:", cleaned_response.strip())
                parsed_response = {}
        
        print(f"[INFO] Sentence {i+1}/{test_samples} | Best Prompt ")

        for k in range(label_number): #For each possible entity label
            label= label_type[k]
            entities= parsed_response.get(label,[])
            
            best_prompt_tool_resulting_labels[i].append({
            "label": label_type[k],
            "response": str(entities)
            })



    final_true_p = 0 
    final_false_p = 0 
    final_false_n = 0 

    for i in range (test_samples):
        ground_truth = test_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics

        prompt_response = best_prompt_tool_resulting_labels[i] #Get response of prompt j when given sentence i
        results = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
        final_true_p+= results[0]    
        final_false_p += results[1]
        final_false_n += results[2]

    final_recall_tool = final_true_p / (final_true_p + final_false_n)
    final_precision_tool = final_true_p / (final_true_p + final_false_p)
    final_F1_score_tool = ( 2 * final_precision_tool * final_recall_tool ) / ( final_recall_tool + final_precision_tool )


    print("\n=== FINAL RESULTS ===")
    print(f"Best prompt: {vanilla_prompts[best_prompt_index_tool]}")
    print(f"Precision: {final_precision_tool:.4f}")
    print(f"Recall:    {final_recall_tool:.4f}")
    print(f"F1-score:  {final_F1_score_tool:.4f}")


                


Vanilla()









[INFO] Sentence 1/50 | Prompt 1/3
[INFO] Sentence 1/50 | Prompt 2/3
[INFO] Sentence 1/50 | Prompt 3/3
[INFO] Sentence 2/50 | Prompt 1/3
[INFO] Sentence 2/50 | Prompt 2/3
[INFO] Sentence 2/50 | Prompt 3/3
[INFO] Sentence 3/50 | Prompt 1/3
[INFO] Sentence 3/50 | Prompt 2/3
[INFO] Sentence 3/50 | Prompt 3/3
[INFO] Sentence 4/50 | Prompt 1/3
[INFO] Sentence 4/50 | Prompt 2/3
[INFO] Sentence 4/50 | Prompt 3/3
[INFO] Sentence 5/50 | Prompt 1/3
[INFO] Sentence 5/50 | Prompt 2/3
[INFO] Sentence 5/50 | Prompt 3/3
[INFO] Sentence 6/50 | Prompt 1/3
[INFO] Sentence 6/50 | Prompt 2/3
[INFO] Sentence 6/50 | Prompt 3/3
[INFO] Sentence 7/50 | Prompt 1/3
[INFO] Sentence 7/50 | Prompt 2/3
[INFO] Sentence 7/50 | Prompt 3/3
[INFO] Sentence 8/50 | Prompt 1/3
[INFO] Sentence 8/50 | Prompt 2/3
[INFO] Sentence 8/50 | Prompt 3/3
[INFO] Sentence 9/50 | Prompt 1/3
[INFO] Sentence 9/50 | Prompt 2/3
[INFO] Sentence 9/50 | Prompt 3/3
[INFO] Sentence 10/50 | Prompt 1/3
[INFO] Sentence 10/50 | Prompt 2/3
[INFO] Sente

### 2-Syntactic Prompting
One of the methods that we decided to implement from the paper 'Empirical Study of Zero-Shot NER with ChatGPT' by Xie et al. 2023, was the **Syntactic Prompting**: prompts do not give any syntactic structure of the sentence from which entities should be extracted, instead we give to the chatBot some **syntactic reasoning hints**.<br>
We decided to test basically 4 kind of prompts, each one having two different variations, obtaining thus eight different prompts. The four kinds are the following: <br>
- Prompt in which we don't ask the chatBot to find any underlying syntactic in the sentence before extracting named entities
- Prompt in which we ask the chatBot to perform PoS tagging in the sentence before extracting named entities
- Prompt in which we ask the chatBot to create a dependency tree before extracting named entities
- Prompt in which we ask the chatBot to create a constituency tree before extracting named entities


For each one of the aforementioned prompts we created a variant in which we tell the chaBot to primarily focus on words beginning with capital letters, since they have higher probability of being named entities. Each prompt is given to the chatBot several times for the same sentence: this is done because we extract named entities of a label one at a time. In this case of study we therefore give to a chatBot the same prompt four times in order to analyze a given sentence. The format of the response of the chatBot is the same as the one described for the Vanilla method. <br>

The prompts are the following:

In [7]:
#List which contains the skeleton prompt
skeleton_syntactic_promp = [
    "You have to do an NER task. The label set is: Person, Organization, Location, Misc.\n"
    "Given the following sentence, extract the named entities and assign each to one of the categories.\n "
    "Sentence: {sentence}\n"
]

#List of differente prompts to be tested
variant_syntactic_prompts = [
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "Focus primarily on words that begin with a capital letter.\n" 
    "What are the named entities labeled as {label}?",

    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, do Part-of-Speech tagging and then using it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "Focus primarily on words that begin with a capital letter.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, do Part-of-Speech tagging and then using it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, create a dependency tree and then use it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "Focus primarily on words that begin with a capital letter.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, create a dependency tree and then use it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, create a constituency tree and then use it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "Focus primarily on words that begin with a capital letter.\n"
    "What are the named entities labeled as {label}?",

    "Before starting, create a constituency tree and then use it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "What are the named entities labeled as {label}?",

]

#### 2.1-Helper Functions
In order to create a prompt we defined a simple method **prompt_builder** which given in input one of the prompts of the variant list, a sentence and the label being processed, builds a prompt correctly formatted and ready to be given to the chatBot.




In [9]:
#Function that builds the complete prompt, given the second part, the sentence and the current label being processed
def prompt_builder ( second_part_prompt, sentence, label):

    prompt = skeleton_syntactic_promp[0] + second_part_prompt
    return prompt.format(sentence=sentence, label=label)

#### 2.2-Response retrieving
In the main part of the method, when we use our prompts on the development set, we extract a set of samples from the dataset. This number is stored in **dev_samples** and in our case it is set to 50. We then iterate on each of them, extracting the tokens and obtaining the original sentence. We then select one prompt from the list aforementioned and iterate on the number of labels that can be possibly associated to a named entity. We give the prompt, which is correctly formatted in order to contain the sample sentence and the label being processed, to the chatBot, and store the response, defined as a list of entity in **resulting_labels**, which is a 2-dimensional list with one entry corresponding to the index of the sentence and the second corresponding to a prompt given to the chatBot. Each entry is a list that will be filled with dictionaries, one for each label and contanining entities associated with the label of the dictionary.<br> 
Also, the dictionary **label_transformation** is used to transform the ground-truth NER labels of a sample in suitable values in order to analyze the results obtained.
<br> 

#### 2.3-Metrics Evaluation
Once we obtained the responses and stored them, we begin to compute the metrics used the aforementioned helper functions. We create a list, one for each element of the confusion matrix needed, where each list has **prompt_number** elements, thus element with index i of one of these lists contains metrics regarding the i-th prompt. <br>

We iterate through all developtment sentences and for each one we store two lists: the tokens and the ground-truth of the given sentence. We also transform the ground-truth using the **label_transformation** dictionary before computing the metrics. Then, we iterate through all prompts, invoke the function **confusion_matrix_cal** and store the results obtained in the lists defined before. <br>

In the last steps we simply compute **Precision**, **Recall** and **F1-score** for each prompt and in the end we obtain the index of the prompt we the highest F1-score.

#### 2.4-Test Part
In the last part we basically do the same steps of before but instead of using a set of prompt we use the best prompt retrieved before and test it on a new set of sentences, taken by the test portion of the dataset. <br>
In the end we compute the usual metrics and print them.

In [11]:



def Syntactic_Prompting():

    dataset = load_dataset("conll2003", trust_remote_code=True) #Loader of dataset
    dev_samples = 50
    test_samples = 15
    dev_sentences = dataset["validation"][:dev_samples]  #Take "dev_samples" sentences from the dataset, in order to choose the best prompt
    test_sentences = dataset["test"][:test_samples] #Same but for test set
    label_transformation = {0: 0, 1: "Person", 2: "Person", 3: "Organization", 4: "Organization", 5: "Location", 6: "Location", 7: "Miscellaneous", 8: "Miscellaenous"} #Dictionary used to transform ground truth


    #BEGIN OF 2.2
    label_type = ['Person', 'Organization', 'Location', 'Miscellaneous'] #Labels of dataset
    label_number = len(label_type)  #Number of labels
    prompt_number = len(variant_syntactic_prompts)  #Number of prompts to try
    resulting_labels = [[[] for _ in range(prompt_number)] for _ in range(dev_samples)] 


    for i in range (dev_samples):   #For each "training" sample

        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
        sentence = " ".join(tokens) #Create sentence
        
        for j in range(prompt_number):  #For each possible prompt type

            for k in range(label_number): #For each possible entity label
                prompt = prompt_builder(variant_syntactic_prompts[j], sentence, label_type[k])  #Create prompt with sentence and label
                print(f"[INFO] Sentence {i+1}/{dev_samples} | Prompt {j+1}/{prompt_number} | Label: {label_type[k]}") #LOG
                response = model.generate_content(prompt) #Generate response
                resulting_labels[i][j].append({
                "label": label_type[k],
                "response": response.text.strip()
                })
    #END OF 2.2   





    #BEGIN OF 2.3
    true_positive = [0] * prompt_number
    false_positive = [0] * prompt_number
    false_negative = [0] * prompt_number

    for i in range (dev_samples):
        ground_truth = dev_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics


        for j in range(prompt_number):
            prompt_response = resulting_labels[i][j] #Get response of prompt j when given sentence i
            results = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
            true_positive[j] = true_positive[j] + results[0]    
            false_positive[j] = false_positive[j] + results[1]
            false_negative[j] = false_negative[j] + results[2]




    recall = [0] * prompt_number
    precision = [0] * prompt_number
    F1_score = [0] * prompt_number

    for i in range(prompt_number):  #For each prompt compute metrics
        recall[i] = true_positive[i] / (true_positive[i] + false_negative[i])
        precision[i] = true_positive[i] / (true_positive[i] + false_positive[i])
        F1_score[i] = ( 2 * precision[i] * recall[i] ) / ( recall[i] + precision[i] )


    print(f"Recall = {recall}\n")
    print(f"Precision = {precision}\n")
    print(f"F1-Score = {F1_score}\n")

    best_prompt_index = F1_score.index(max(F1_score))
    #END OF 2.3



    #START OF 2.4
    best_prompt_resulting_labels =  [[] for _ in range(test_samples)]

    for i in range (test_samples):   #For each "training" sample

        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        sentence = " ".join(tokens) #Create sentence
        

        for k in range(label_number): #For each possible entity label
            prompt = prompt_builder(variant_syntactic_prompts[best_prompt_index], sentence, label_type[k])  #Create prompt with sentence and label
            print(f"[INFO] Sentence {i+1}/{test_samples} | Best Prompt | Label: {label_type[k]}") #LOG
            response = model.generate_content(prompt) #Generate response
            best_prompt_resulting_labels[i].append({
            "label": label_type[k],
            "response": response.text.strip()
            })



    final_true_positive = 0 
    final_false_positive = 0 
    final_false_negative = 0 

    for i in range (test_samples):
        ground_truth = test_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics

        prompt_response = best_prompt_resulting_labels[i] #Get response of prompt j when given sentence i
        results = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
        final_true_positive += results[0]    
        final_false_positive += results[1]
        final_false_negative += results[2]

    final_recall_synt = final_true_positive / (final_true_positive + final_false_negative)
    final_precision_synt = final_true_positive / (final_true_positive + final_false_positive)
    final_F1_score_synt = ( 2 * final_precision_synt * final_recall_synt ) / ( final_recall_synt + final_precision_synt )


    print("\n=== FINAL RESULTS ===")
    print(f"Best promp: {variant_syntactic_prompts[best_prompt_index]}")
    print(f"Precision: {final_precision_synt:.4f}")
    print(f"Recall:    {final_recall_synt:.4f}")
    print(f"F1-score:  {final_F1_score_synt:.4f}")

    #END OF 2.4

Syntactic_Prompting()

[INFO] Sentence 1/50 | Prompt 1/8 | Label: Person
[INFO] Sentence 1/50 | Prompt 1/8 | Label: Organization
[INFO] Sentence 1/50 | Prompt 1/8 | Label: Location
[INFO] Sentence 1/50 | Prompt 1/8 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 2/8 | Label: Person
[INFO] Sentence 1/50 | Prompt 2/8 | Label: Organization
[INFO] Sentence 1/50 | Prompt 2/8 | Label: Location
[INFO] Sentence 1/50 | Prompt 2/8 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 3/8 | Label: Person
[INFO] Sentence 1/50 | Prompt 3/8 | Label: Organization
[INFO] Sentence 1/50 | Prompt 3/8 | Label: Location
[INFO] Sentence 1/50 | Prompt 3/8 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 4/8 | Label: Person
[INFO] Sentence 1/50 | Prompt 4/8 | Label: Organization
[INFO] Sentence 1/50 | Prompt 4/8 | Label: Location
[INFO] Sentence 1/50 | Prompt 4/8 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 5/8 | Label: Person
[INFO] Sentence 1/50 | Prompt 5/8 | Label: Organization
[INFO] Sentence 1/50 | Prompt 5/8 

### 3-Tool Augmentation


The second method that we decided to implement is  **tool augmentation**. Along with the instructions, in this method we provide to the chatbot a sentence in which each token is followed by its part of speech tag. The idea is to enrich the input sentence with additional linguistic features, in this case **part of speech (POS) tags** to provide the language model with syntactic cues that may help disambiguate and identify entities more accurately.

Some prompts focus on basic instructions, others leverage capitalization or POS tags for better accuracy. One includes an example to demonstrate the expected output format. All the prompts emphasize strict output formatting, just a list of entities, without code or explanations. As in syntactic prompting each prompt is given to the chatbot exactly four times because we focus on retrieving entities one label per time.

In [12]:
#List of prompts used to get a response from the chatbot 

prompts = [
 "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
 "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
 "Sentence . {sentence}\n"
 "Retrieve the information just returning a list, without any comment or function and strictly adhering to the example output format, of the kind: ['Entity_1',..., 'entity_n']\n"
 "What are the named entities labeled as {label}?",


 "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
 "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
 "Sentence . {sentence}\n"
 "Retrieve the information just returning a list, without any comment or function and strictly adhering to the example output format, of the kind: ['Entity_1',..., 'entity_n']\n"
 "Focus primarily in on words that begin with a capital letter\n"
 "What are the named entities labeled as {label}?",


 "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
 "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
 "Sentence . {sentence}\n"
 "Retrieve the information just returning a list, without any comment or function and strictly adhering to the example output format, of the kind: ['Entity_1',..., 'entity_n']\n"
 "I'll give you an example on how it should work: Barack/NNP Obama/NNP visited/VBD Kenya/NNP./.\n"
 " You should return (if the label is Person) : ['Barack', 'Obama']\n"
 "If the label is Location you returns ['Kenya']\n"
 "What are the named entities labeled as {label}", 


 "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
 "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
 "Sentence . {sentence}\n"
 "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n",
 

 "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
 "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
 "Sentence . {sentence}\n"
 "Using POS tags, exclude coordinating conj (CCONJ) and determiners (DT). Extract entities with label: {label} from:\n"
 "{sentence}\n"
 "List them as : ['Entity_1', ...], DO NOT justify the output, DO NOT give json and DO NOT give python code",


 "You are a named entity recognizer.\n"
 "Each word in the sentence is tagged with its POS.Use the clues to detect entities. The following entities are sought: Person,Organization, Location, Miscellaneous.\n"
 "Sentence. {sentence}\n"
 "Search for the entities in the text which are labeled as {label} and return them in a list without any comment  or function and strictly adhering to the example output format ['Entity_1',..., 'entity_n']. \n "
 "DO NOT justify the output,  DO NOT give json and DO NOT give python code."
]

#### 3.1-Response retrieving


To retrieve the response of the chatbot we use the same steps exploited in syntactic prompting. The major difference can be seen in the construction of the sentence for each sample in the development set. In fact we introduce a dictionary **pos_id2tag** which transforms the id of each Pos tag associated with a token into a literal version understandable by the chatbot. We could have used a library to obtain the sentence with the PoS tags ready, but to provide a more accurate formulation we have decided to use the tags given in the dataset.
<br> 

#### 3.2 & 3.3-Metrics Evaluation and test part

These two steps are identical to the previous ones seen in syntactic prompting. Eventually we will get the best prompt and by using a test set different from the one used in the development phase we will return the metrics of tool augmentation method.

In [13]:
def Tool_Augmentation():

    dataset = load_dataset("conll2003", trust_remote_code=True) #Loader of dataset
    dev_samples = 50
    test_samples = 15
    dev_sentences = dataset["validation"][:dev_samples]  #Take "dev_samples" sentences from the dataset, in order to choose the best prompt
    test_sentences = dataset["test"][:test_samples]

    label_transformation = {0: 0, 1: "Person", 2: "Person", 3: "Organization", 4: "Organization", 5: "Location", 6: "Location", 7: "Miscellaneous", 8: "Miscellaenous"} #Dictionary used to transform ground truth
    
    #Dictionary used to retrieve for each token its PoS tag 
    pos_id2tag = {0: "''",1: "#",2: "$",3: "''",4: "(",5: ")",6: "," , 7: ".",8: ":",9 :"CC",10: "CD",11: "DT",12: "EX",13: "FW",14: "IN",15: "JJ",16: "JJR",17: "JJS",18: "LS",
        19: "MD", 20: "NN", 21: "NNP", 22: "NNPS", 23: "NNS", 24: "PDT", 25: "POS", 26: "PRP", 27: "PRP$", 28: "RB", 29: "RBR", 30: "RBS", 31: "RP", 32: "SYM", 33: "TO",34: "UH",
        35: "VB",36: "VBD",37: "VBG",38: "VBN",39: "VBP",40: "VBZ",41: "WDT",42: "WP",43: "WP$",44: "WRB",45: "``",46: "-NONE-"}

    

    #BEGINNING OF 3.1
    label_type = ['Person', 'Organization', 'Location', 'Miscellaneous'] #Labels of dataset
    label_number = len(label_type)  #Number of labels
    prompt_number_tool= len(prompts)
    resulting_labels_tool = [[[] for _ in range(prompt_number_tool)] for _ in range(dev_samples)]


    for i in range (dev_samples):   #For each "training" sample

        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
         #Create sentence
        pos_tags = dev_sentences["pos_tags"][i]
        for j in range(len(pos_tags)):
            pos_tags[j] = pos_id2tag[pos_tags[j]]#Transform the id tags of each token into a literal version understandable by the chatbot


        sentence = " ".join(f"{t}/{p}" for t, p in zip(tokens, pos_tags)) # Sentence given to the prompt with speech of tags for each token
        
    
        
        for j in range(prompt_number_tool):  #For each possible prompt type

            for k in range(label_number): #For each possible entity label
                prompt = prompts[j].format( sentence = sentence, label = label_type[k] )  #Create prompt with sentence and label
                print(f"[INFO] Sentence {i+1}/{dev_samples} | Prompt {j+1}/{prompt_number_tool} | Label: {label_type[k]}") 
                response = model.generate_content(prompt) #Generate response
                resulting_labels_tool[i][j].append({
                "label": label_type[k],
                "response": response.text.strip()
                })
        
    #END OF 3.1


    #BEGINNING OF 3.2
    true_p = [0] * prompt_number_tool
    false_p = [0] * prompt_number_tool
    false_n = [0] * prompt_number_tool



    for i in range (dev_samples):
        ground_truth = dev_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = dev_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics


        for j in range(prompt_number_tool):
            prompt_response = resulting_labels_tool[i][j] #Get response of prompt j when given sentence i
            results_tool = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
            true_p[j] = true_p[j] + results_tool[0]    
            false_p[j] = false_p[j] + results_tool[1]
            false_n[j] = false_n[j] + results_tool[2]




    recall_tool= [0] * prompt_number_tool
    precision_tool= [0] * prompt_number_tool
    F1_score_tool = [0] * prompt_number_tool


    for i in range(prompt_number_tool):  #For each prompt compute metrics
        recall_tool[i] = true_p[i] / (true_p[i] + false_n[i])
        precision_tool[i] = true_p[i] / (true_p[i] + false_p[i])
        F1_score_tool[i] = ( 2 * precision_tool[i] * recall_tool[i] ) / ( recall_tool[i] + precision_tool[i] )


    print(f"Recall = {recall_tool}\n")
    print(f"Precision = {precision_tool}\n")
    print(f"F1-Score = {F1_score_tool}\n")

    best_prompt_index_tool = F1_score_tool.index(max(F1_score_tool))

    #END OF 3.2



    #BEGINNING OF 3.3

    best_prompt_tool_resulting_labels =  [[] for _ in range(test_samples)]

    for i in range (test_samples):   #For each "training" sample

        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        pos_tags = test_sentences["pos_tags"][i]

        for j in range(len(pos_tags)):
            pos_tags[j] = pos_id2tag[pos_tags[j]]

        sentence = " ".join(f"{t}/{p}" for t, p in zip(tokens, pos_tags)) # sentence given to the prompt with 
        

        for k in range(label_number): #For each possible entity label
            prompt =  prompts[best_prompt_index_tool].format(sentence = sentence , label = label_type[k] )  #Create prompt with sentence and label
            print(f"[INFO] Sentence {i+1}/{test_samples} | Best Prompt | Label: {label_type[k]}") #LOG
            response = model.generate_content(prompt) #Generate response
            best_prompt_tool_resulting_labels[i].append({
            "label": label_type[k],
            "response": response.text.strip()
            })



    final_true_p = 0 
    final_false_p = 0 
    final_false_n = 0 

    for i in range (test_samples):
        ground_truth = test_sentences["ner_tags"][i] #Get labels of current sample sentence
        tokens = test_sentences["tokens"][i] #Get tokens of current sample
        
        for j in range(len(ground_truth)):
            ground_truth[j] = label_transformation[ground_truth[j]] #Transform ground truth in a more suitable way in order to evaluate metrics

        prompt_response = best_prompt_tool_resulting_labels[i] #Get response of prompt j when given sentence i
        results = confusion_matrix_calc( prompt_response, ground_truth, tokens, label_type) #Get tp, fp and fn
        final_true_p+= results[0]    
        final_false_p += results[1]
        final_false_n += results[2]

    final_recall_tool = final_true_p / (final_true_p + final_false_n)
    final_precision_tool = final_true_p / (final_true_p + final_false_p)
    final_F1_score_tool = ( 2 * final_precision_tool * final_recall_tool ) / ( final_recall_tool + final_precision_tool )


    print("\n=== FINAL RESULTS ===")
    print(f"Best promp: {prompts[best_prompt_index_tool]}")
    print(f"Precision: {final_precision_tool:.4f}")
    print(f"Recall:    {final_recall_tool:.4f}")   
    print(f"F1-score:  {final_F1_score_tool:.4f}")
    #END OF 3.3



                


Tool_Augmentation()





[INFO] Sentence 1/50 | Prompt 1/6 | Label: Person
[INFO] Sentence 1/50 | Prompt 1/6 | Label: Organization
[INFO] Sentence 1/50 | Prompt 1/6 | Label: Location
[INFO] Sentence 1/50 | Prompt 1/6 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 2/6 | Label: Person
[INFO] Sentence 1/50 | Prompt 2/6 | Label: Organization
[INFO] Sentence 1/50 | Prompt 2/6 | Label: Location
[INFO] Sentence 1/50 | Prompt 2/6 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 3/6 | Label: Person
[INFO] Sentence 1/50 | Prompt 3/6 | Label: Organization
[INFO] Sentence 1/50 | Prompt 3/6 | Label: Location
[INFO] Sentence 1/50 | Prompt 3/6 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 4/6 | Label: Person
[INFO] Sentence 1/50 | Prompt 4/6 | Label: Organization
[INFO] Sentence 1/50 | Prompt 4/6 | Label: Location
[INFO] Sentence 1/50 | Prompt 4/6 | Label: Miscellaneous
[INFO] Sentence 1/50 | Prompt 5/6 | Label: Person
[INFO] Sentence 1/50 | Prompt 5/6 | Label: Organization
[INFO] Sentence 1/50 | Prompt 5/6 

### Analysis of the metrics

We evaluate vanilla method, tool augmentation and syntactic prompting considering 10 runs for each one of them. For each method we fixed the development set to be composed of 50 samples while the test set consists of 15 samples. We report the average F1 score and the best prompt.

**F1 score** is the harmonic mean of precision and recall. **Precision** indicates how many of the predicted positives are actually correct while **recall** indicates how many of the actual positives the model correctly identified.
<br>
<br>



* Vanilla : **F1 = 0.6141** and the best prompt is:

    "You have to perform a NER task"
    "Return a dictionary (not in JSON or Python code) with the label as key and a list of entities associated with that label as value, "
    "like the following: {{'Person': ['entity1', ...], 'Organization': ['entity5', ...], ...}}.\n"
    "Extract entities from the following sentence: {sentence}.\n"
    "The entity label set is: Person, Organization, Location, Miscellaneous.\n"
    "The order of the keys of the dictionary must strictly be the following : Person, Organization, Location, Miscellaneous.\n" 
    "DO NOT justify the output. DO NOT return JSON or Python code.\n"
<br>
<br>


* Syntactic prompting: **F1 = 0.6336** and the best prompt is:

    "You have to do an NER task. The label set is: Person, Organization, Location, Misc.\n"
    "Given the following sentence, extract the named entities and assign each to one of the categories.\n "
    "Sentence: {sentence}\n"
    "Before starting, create a constituency tree and then use it to correctly recognise entities\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output, DO NOT give JSON and DO NOT give Python Code.\n"
    "Focus primarily on words that begin with a capital letter.\n"
    "What are the named entities labeled as {label}?"
    <br>
    <br>




* Tool augmentation: **F1 = 0.6588** and the best prompt is:

    "You have to do a NER task. The label set is : Person, Organization, Location, Miscellaneous.\n"
    "Given the following sentence, extract the named entities and assign each to one of the categories .\n"
    "Sentence . {sentence}\n"
    "Return a list with entities strictly associated ONLY to label {label}, like the following: ['Entity_1',..., 'entity_n']. DO NOT justify the output,  DO NOT give JSON and DO NOT give Python Code.\n"

<br>
<br>

In our experiments, Tool Augmentation slightly outperformed both Syntactic Prompting and the Vanilla method, although the differences in F1-scores were not dramatic. Several reasons could have led to the obtained results. First of all the instruction-tuned LLM used may already infer part of speech tags or syntactic relations without being instructed to do so. That makes the prompts created for the methods we implemented marginally useful.<br>

A possible reason on why Syntactic Prompting underperforms with respect to Tool Augmentation may derive from the fact that in the syntactic prompts we ask the API to perform more than one task for each sentence. In fact before extracting the named entities we are looking for, the model has to build costituency trees, derive grammatical and syntactic relations and obtain part of speech tags. If these first tasks are not performed well they may lead to a propagation of the error also in our goal task leading thus to a degradation of the performance. 
In zero-shot setting the effectiveness of the methods can be heavily influenced by the specific prompt used to represent it. In all three methods we can visualize a common trait. The best result are obtained with the prompts in which tha label placement is towards the end of the instruction. This may be due to the fact that the LLM is more likely to retain the task-specfic instruction reducing the chance of drifting off task. Also, another detail is that by suggesting to focus on words beginning with a capital letter we get better results. <br>

It is interesting to notice that all the different methods have a high value of **Recall**, meaning that the majority of the tokens which are actually a Named Entity are recognized. The primary problem with of all the prompts is that they, in most cases, label tokens which are not actually Named Entities, and this translates in a low **Precision** value.

We can conclude that in zero-shot settings the prompt design is a crucial component of NER performance. One should carefully consider the complexity and structure of the prompt, aiming to get balance between providing sufficient specific guidance and avoiding instruction overload that may distract the model from its primary task.









