# Named Entity Recognition with LLMs

In this notebook, we will show you how we applied NER to a set of literary-historical travelogues using [GPT4All](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python). 
This tool allows you to experiment with generative Large Language Models on your local computer. The tool is privacy-friendly; as your data is not sent through an API - and results are reproducible. 
Be aware that running this code on your local computer may take a long time. Consider hooking up to a supercomputer or external server!

Using the Jupyter Notebooks and data provided on our Github - you can choose to reproduce our results by re-running the code below, or (which is likely why you're here ;)) you can adapt the prompts to your own data!

This code was developed to experiment with and evaluate different models and prompts across the languages English, French, Dutch and German. 
The prompts incrementally got more complex:

1. **Base prompt**: The model (in this case mistral, hermes or llama) is given a simple task: extract the fauna and flora entities from this text and transform it to JSON. After that, we feed the output to mistral to validate the JSON output. We chose mistral in this case as our tests showed that this was the most trustworthy model out of the three options to render valid JSON output.
2. **Persona**: we add a persona to the model ("You are a NER system trained on historical text material".).
3. **Annotation guide**: we add a more detailed explanation of the entities we expect to be tagged. 
4. **Context**: we add more context (metadata) to the model by adding a title and author of the book the sentence came from.
5. **Few-shot examples**: we add a couple of labeled examples to the model to show it what we expect.

In this Notebook, you will find the code for the last prompt as this is the most elaborate prompt which contains all the parts of prompts I-IV.

## Load packages

In [1]:
import gpt4all
from gpt4all import GPT4All
import json
import pandas as pd
from sklearn.model_selection import train_test_split

import string
import re
import ast
from nervaluate import Evaluator
import nervaluate
import numpy as np

In [2]:
mistral_model = GPT4All("mistral-7b-instruct-v0.1.Q4_0.gguf")
hermes_model = GPT4All("nous-hermes-llama2-13b.Q4_0.gguf")
llama_model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")

## Prepare data

### Load data, split in dev, test and few-shot sample sets

In this section, we will split the data in development (dev), test and few-shot sample sets.
We stratify the data on the century column to ensure that all sets have a more or less equal distribution present of the centuries in the dataset (18, 19 and 20).
For this experiment, we extract 100 samples for the dev set, 25 for a separate test set, and 10 for the few-shot samples. 


**STEPS**
1. Load data (LANG_fauna_flora_df_context_merged.csv). These can be found on our GitHub in the **Data** file.
2. Split in dev_1 and test set, stratify on century
3. Split dev_1 in dev and few shot samples.

In [168]:
df = pd.read_csv("/home/tess/generative_exp/Deliverables/fauna_flora_data/NL_fauna_flora_df_context_merged.csv")

In [169]:
#df.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [154]:
df.sample(5)

Unnamed: 0,source_file,text_aspect,_sentence_text,aspect_cat,ID,century,title,text_full,author,context
216,BHL_7_sample_Dutch_19.0.txt,"['middendorens', 'Melocactus', 'middendorens']",In zeer enkele gevallen komt er een tweede der...,"['FLORA', 'FLORA', 'FLORA']",BHL_7,19,Botanische excursie naar Nederlandsch West-Ind...,"mm pil ii pr lir,iv m 7/rV,! m Ü50 m :: m ""'.‘...","Suringar, W. F. R.",als ééne soort . De Candolle heeft ( Revue de ...
106,BHL_794_sample_Dutch_18.0.txt,"['vifch', 'vleefch', 'dieren']",Van vifch en het vleefch van wilde dieren make...,"['FAUNA', 'FAUNA', 'FAUNA']",BHL_794,18,Reis door Noord Amerika,■^ » -. ■i^^'Z /Üè <=> UNIVERSITY OF PITTSBUR...,"Kalm, Pehr,","antwoord was , dat zy die ver naar 't Noorden ..."
120,BHL_794_sample_Dutch_18.0.txt,"['garft', 'vee']","Men zait maar wei- nig garft , en dat nog alle...","['FLORA', 'FAUNA']",BHL_794,18,Reis door Noord Amerika,■^ » -. ■i^^'Z /Üè <=> UNIVERSITY OF PITTSBUR...,"Kalm, Pehr,",geploegd . Men zait gemeenlyk omtrent den 15 ....
295,BHL_957_sample_Dutch_19.0.txt,['Sparus Lin'],Sparus Lin .,['FAUNA'],BHL_957,19,"Verslag van het land, de bewoners en voortbren...","( i4a ) VERSLAG van het Land, de Bewoners en ...","Macklot, H.","i s s c h e n. de meeste Visschcn , welke wij ..."
320,BHL_957_sample_Dutch_19.0.txt,['Sideroxylon orichalciwnZ'],Sideroxylon orichalciwnZ .,['FLORA'],BHL_957,19,"Verslag van het land, de bewoners en voortbren...","( i4a ) VERSLAG van het Land, de Bewoners en ...","Macklot, H.",11 ( ? ) monogyna . Sagus mierocarpa .. Licual...


In [170]:
# sentences annotated per century
df["century"].value_counts()

century
19    261
18     99
20     18
Name: count, dtype: int64

In [171]:
df["ID"].unique()

array(['DBNL_100', 'DBNL_102', 'DBNL_123', 'DBNL_149', 'DBNL_152',
       'DBNL_170', 'DBNL_176', 'DBNL_189', 'DBNL_194', 'DBNL_199',
       'DBNL_241', 'DBNL_259', 'DBNL_27', 'DBNL_33', 'DBNL_34', 'DBNL_43',
       'DBNL_53', 'DBNL_87', 'BHL_61', 'BHL_794', 'BHL_7', 'BHL_957',
       'DBNL_10', 'DBNL_151'], dtype=object)

In [172]:
# cast aspect cat and text_aspect to list objects 

df["aspect_cat"] = df.aspect_cat.apply(lambda x: eval(x))
df["text_aspect"] = df.text_aspect.apply(lambda x: eval(x))

In [178]:
# stratify by century
y = df[["century"]]

dev_1, test = train_test_split(df, test_size = 25, train_size=100, stratify = y, shuffle= True, random_state= 42) #add train_size=100 & test_size=25

In [179]:
dev, few_shot_samples = train_test_split(dev_1, test_size= 0.03)

In [None]:
# save test file
test.to_csv("test_Dutch.csv", index = False)

# save few-shot examples
few_shot_samples.to_csv("fewshot_Dutch.csv")

# save development set
dev.to_csv("dev_French.csv")

In [3]:
# read in our dev set
dev = pd.read_csv("dev_French.csv")

# Prompting

In this section, we'll start prompting! From our few-shot samples, we extract two examples manually. We add samples for both the fauna and the flora categories. 🐱🍀



In [5]:
pd.get_option("display.max_colwidth")  # this is for your info only

pd.set_option("display.width", None)

In [6]:
few_shot_samples

Unnamed: 0.1,Unnamed: 0,source_file,text_aspect,_sentence_text,aspect_cat,ID,century,title,text_full,author,context
0,107,BHL_466_sample_French_20.0.txt,"['CERATORHINÉS', 'Rhinocéros ( Ceratorhinus']",SoUS-FAMlLLE DES CERATORHINÉS Rhinocéros ( Cer...,"['FAUNA', 'FAUNA']",BHL_466,20,Étude sommaire des mammifères fossiles des Fal...,"QBT •■A ?•""■■ G- Mri ?^â 9'û 9 5-L^s- l^arbar...","Mayet, Lucien,","P2-M3 de R , siniorrensis conservée au Muséum ..."
1,84,BHL_466_sample_French_20.0.txt,"['Brachyodus onoideus', 'coquilles marines']",La présence du Brachyodus onoideus^ dans le fa...,"['FAUNA', 'FAUNA']",BHL_466,20,Étude sommaire des mammifères fossiles des Fal...,"QBT •■A ?•""■■ G- Mri ?^â 9'û 9 5-L^s- l^arbar...","Mayet, Lucien,",pas à les aborder ici . Fio . i8 . — Dicroceru...
2,11,BHL_1025_sample_French_20.0.txt,"[""pétales d'oran- ger à fruit aigre""]",Je voudrais pouvoir envoyer en Europe de l'eau...,['FLORA'],BHL_1025,20,"Aimé Bonpland, médecin et naturaliste, explo...",AIMÉ BONPLAND MÉDECIN ET NATURALISTE EXPLORAT...,"Hamy, E. T.",fin de sep- tembre je serais de retour ici . A...
3,90,BHL_466_sample_French_20.0.txt,['faune'],Nous rapportons plus volontiers . cette pièce ...,['FAUNA'],BHL_466,20,Étude sommaire des mammifères fossiles des Fal...,"QBT •■A ?•""■■ G- Mri ?^â 9'û 9 5-L^s- l^arbar...","Mayet, Lucien,",".3 , 4 , 5 , 6 a , 6 h , y , pi . VII , fig . ..."
4,99,BHL_466_sample_French_20.0.txt,"['Brachypodinés', 'Acérathérinés']",Les Brachypodinés forment un groupe bien diffé...,"['FAUNA', 'FAUNA']",BHL_466,20,Étude sommaire des mammifères fossiles des Fal...,"QBT •■A ?•""■■ G- Mri ?^â 9'û 9 5-L^s- l^arbar...","Mayet, Lucien,",". Collection Lecointre , à Grillemont . Grande..."
5,6,BHL_1025_sample_French_20.0.txt,['végétaux'],Il est rare pour moi depuis des an- nées de tr...,['FLORA'],BHL_1025,20,"Aimé Bonpland, médecin et naturaliste, explo...",AIMÉ BONPLAND MÉDECIN ET NATURALISTE EXPLORAT...,"Hamy, E. T.",tout ce qui lui restait de forces « pour répon...
6,140,GB-141_sample_French_18.txt,['barbet-là'],Ce barbet-là est probablement celui qui a suiv...,['FAUNA'],GB_141,18,"Souvenirs d'un sexagénaire, Tome IV",﻿The Project Gutenberg EBook of Souvenirs d'un...,Antoine Vincent Arnault,J'espérais qu'après tout son malheur n'était p...
7,25,BHL_1025_sample_French_20.0.txt,"['Mosquito', 'Culex pipiens']",Sans doute que Cuvier croit que le Mosquito es...,"['FAUNA', 'FAUNA']",BHL_1025,20,"Aimé Bonpland, médecin et naturaliste, explo...",AIMÉ BONPLAND MÉDECIN ET NATURALISTE EXPLORAT...,"Hamy, E. T.",à d'autres sans tenir parole ; je n'agis pas a...
8,216,GB-267_sample_French_20.txt,['feuilles de palmiers'],"Il est d'usage pour les femmes , à ce moment-l...",['FLORA'],GB_267,20,L'Égypte d'hier et d'aujourd'hui,"﻿The Project Gutenberg EBook of L'égypte, by W...",Walter Tyndale,fruits sont éclairés d'une douce lumière d'un ...
9,21,BHL_2292_sample_French_18.xmi,['bananes'],Un régime contient ordinairement depuis trente...,['FLORA'],BHL_2292,18,Nouveau voyage aux isles de l'Amerique : conte...,NOUVEAU ^VOYAGE AUX ISLES DE L'AMERIQUE....,"Labat, Jean Baptiste,",couvre prelque tout de petits boutons d'un jau...


In [7]:
few_shot_samples.iloc[9]["_sentence_text"]

"Un régime contient ordinairement depuis trente jusqu'à cinquante bananes selon la bonté du terrain."

In [11]:
# example 1 we'll feed to the model. 
example_sent_1 = few_shot_samples.iloc[0]["_sentence_text"]
example_output_1 = {
    "entities": {
        "fauna": ["CERATORHINÉS", "Rhinocéros ( Ceratorhinus"],
        "flora": [],
        },
}

In [12]:
# example 2 we'll feed to the model. 

example_sent_2= few_shot_samples.iloc[9]["_sentence_text"]
example_output_2 = {
    "entities": {
        "fauna": [],
        "flora": ["bananes"],
        },
}

## Global template strings

Below, we add some extra information which the model will need to make the output more predictable and easier to parse.
First, we'll give the model a JSON-schema which we expect it to replicate. Then, we'll add the categories (annotation guide), personality (persona), and the question (task). 

In [13]:
# Global template strings

schema_entity = {
    "entities": {
        "fauna": ["string"],
        "flora": ["string"],
        },
}

#category names with small global introduction/definition
categories = """FAUNA: common and scientific names of animals, taxa and animal species. 
FLORA: common and scientific names of plants, taxa and plant species."""

personality = "You are a named entity recognition system trained to recognize fauna and flora in historical texts."

question = "Extract the relevant named entities from the sentence."

## LLM functions

Here we construct our LLM pipe! The function ***text_extractor_llm*** is made to be fed a text, and (if all goes well), it will output the entities in a JSON-format.
Then we feed the resulting JSON-object to our function ***json_to_lists***. This will try to parse the object and return a tuple with (categories, output). 
Adding the output of the first LLM's output was a design choice here because we wanted to analyze the ramblings of the LLM. 
Remove it if this is not what you're interested in!

In [14]:
def json_to_lists(json_obj):
    """"This function transforms the JSON-output in two different lists: aspect_cat_llm and aspect_text_llm"""
    cats = []
    output = []
    try:
        for category, result in json_obj["entities"].items():    
            for text_result in result:
                cats.append(category.upper())
                output.append(text_result)

        return (cats, output)
    except:
        print(json_obj)
        return ("Failed to parse")

In [15]:
# Toepassen van LLM-pipe op alle zinnen
# Functie neemt tekst, doorloopt hele pipeline op tekst en output een JSON-response

def text_extractor_llm(txt, author, title, example_sent_1, example_output_1, example_sent_2, example_output_2, main_model = mistral_model):
    """This function takes a text and applies the following steps:
    1) PROMPT: extract the entities using a global template variable + model (global)
    2) PROMPT: extract the JSON string from the output
    3) PROMPT: validate the JSON string 
    4) JSON_LOADS: cast to JSON object
    5) apply parsing function
    """
    
    sentence = txt
    
    template = f"""{personality} Your task is to identify the named entities in a sentence. 
Named entities include {categories}. 
Structure the answer according to {schema_entity}. 
Only look at the sentence, do not add anything else.
The sentence is indicated by <<<>>>. 

The author of the text is {author}.
The text is titled {title}.

Here are examples to help you:
Sentence: {example_sent_1}
Answer: {example_output_1}

Sentence: {example_sent_2}
Answer: {example_output_2}

Question: {question}.
Sentence: <<<{sentence}>>>

Answer: """
    
    output = main_model.generate(template, max_tokens = 200, temp = 0.01)
    print(output) # activate this if you want to follow along with the output of our main model. 
    
    #output = output.split("\n")
    
    #first_output = output[0]
    
    #first_output = "$" + first_output + "$"
    
    temp_json = f"""Extract the first JSON from the string. The string is indicated by $$.
Output: {output}.
JSON: """

    extract_json = mistral_model.generate(temp_json, max_tokens = 200, temp = 0.01)
    print(extract_json) # activate this if you want to double-check in the terminal if the output was indeed transformed to a JSON-object.
    
    
    temp_validate = f"""Transform the string to valid JSON. 
    Do not hallucinate new sentences.
The string is indicated by <<<>>>.
String: <<<{extract_json}>>>.
Answer: """

    validate_json = mistral_model.generate(temp_validate, max_tokens= 200, temp = 0.01)
    print(validate_json) # activate this if you want to check if your returned JSON-object was indeed validated correctly.
    
    
    try:
        json_obj = json.loads(validate_json)
        return (json_to_lists(json_obj), output)
    except: # If the casting to a JSON-object fails, the model will return the JSON-object below and print a warning.
        json_obj =  """{
"entities": {
    "fauna": [
        "string"
    ]
}
}"""
        json_obj = json.loads(json_obj)
        print("PARSING FAILED. Original result:")
        print(validate_json)

## Apply the function to our Pandas Dataframe.

In [16]:
dev["results_LLM"] = dev.apply(lambda x: text_extractor_llm(x._sentence_text, x.author, x.title, example_sent_1, example_output_1, example_sent_2, example_output_2), axis = 1)


{'entities': {'fauna': ['dunes crevées', 'pierres énormes', 'maisons lézardées', 'gabares de pêche'], 'flora': ['palus', 'jardins maraîchers']}}

{"entities": {"fauna": ["dunes crevées", "pierres énormes", "maisons lézardées", "gabares de pêche"], "flora": ["palus", "jardins maraîchers"]}}

{
    "entities": {
        "fauna": [
            "dunes crevées",
            "pierres énormes",
            "maisons lézardées",
            "gabares de pêche"
        ],
        "flora": [
            "palus",
            "jardins maraîchers"
        ]
    }
}

{'entities': {'fauna': ['Macrotherium sansaniense'], 'flora': []}}

{"entities": {"fauna": ["Macrotherium sansaniense"], "flora": []}}

{
"entities": {
"fauna": [
"Macrotherium sansaniense"
],
"flora": []
}
}

{'entities': {'fauna': ['Rhinocéros ( Ceratorhinus)', 'cf. simorrensis'], 'flora': []}}

{"entities": {"fauna": ["Rhinoceros (Diceros bicornis)", "Crocodylus niloticus"], "flora": ["Euphorbia antiquorum"]}}

{
    "entities": {
   

In [19]:
dev.to_csv("/home/tess/generative_exp/Local_LLM/output_loops_LLM/dev_French_prompt_V_mistral.csv", index= False)

# Evaluate the results of the LLM

💡 Done! We have saved the results of our LLM NER extraction system to a .CSV.
If you are interested to see how we evaluated the LLM results, go to the Notebook ***Evaluate_LLM_output***.