# Named Entity Recognition for Ingredients Classification

Noah Meißner 18.05.2025

This Jupyter Notebook analyse four different Named Entity Recognition possibilities, to see which fits best for this Use Case.

We use:
1. Handlabelling as Ground Truth
2. ChatGPT 4o mini 
3. Gemini -- free version
4. Spacy (train our own model)

### Imports

In [None]:
from train_ner import train_model, test_model
from request.request_gemini import request
from request.request_gpt import chat_with_openai
import pandas as pd
from DataLoader_Ingredients import DataLoader
from prompts import ingredients_extraction
import json
import time
import os
import ast
from data_structure.DataType import DataType
from data_structure.model_name import ModelName
from loader.load_ner import ner_loader, ner_safer
import re
from evaluation.kappa_comparison import evaluate_annotations
from request.recipe_api import request_api
from request.request_modern_bert import Bert
from data_structure.paths import RAW_SET, NER_EVAL

  from .autonotebook import tqdm as notebook_tqdm


### Create Dataset
- The following function loads the dataset and gives unique ingreidents as test and train split
- Train 50000
- Test 1000

In [2]:
df_whole = pd.read_csv(RAW_SET)
loader_obj = DataLoader(df_whole)
train, test = loader_obj.get_set()

## Analysis

We now analyse different approaches, starting with Gemini. We split the test Data in 20 ingredients per prompt, for less requests. For the Gemini and OpenAI approach we are using the same prompt, to evaluate the system model

In [3]:
def pack_twenty(data):
    return [data[i:i+20] for i in range(0, len(data), 20)]

In [4]:
prompt = ingredients_extraction.prompt_ingredients

In [5]:
def clean_model_output(str):
    match = re.search(r'\[(.*)\]', str, re.DOTALL)
    if match:
        cleaned_json = "[" + match.group(1).strip() + "]"
    try:
        parsed_data = ast.literal_eval(cleaned_json)
        return parsed_data
    except Exception as e:
        print("Fehler beim Parsen:", e)

### Gemini 

We are using Gemini 2.0 Flash which gurantees free api access from google

In [6]:
if not os.path.exists(NER_EVAL/ "Gemini.json"):
    print("Start Annotation process")
    packages = pack_twenty(test)
    res = []
    for p in packages:
        try:
            new_prompt = prompt.replace("$DATA$", str(p))
            res.extend(clean_model_output(request(new_prompt)))
        except Exception as e:
            print("Model Error:", e)
    ner_safer(ModelName.Gemini,DataType.eval,res)

### ChatGPT

We are using GPT 4omini as a comparison model, which has a good trade off between accuracy and costs

In [7]:
if not os.path.exists(NER_EVAL/"OpenAI.json"):
    print("Start Annotation process")
    packages = pack_twenty(test)
    res = []
    for p in packages:
        try:
            new_prompt = prompt.replace("$DATA$", str(p))
            res.extend(clean_model_output(chat_with_openai(new_prompt)))
        except Exception as e:
            print("Model Error:", e)
    ner_safer(ModelName.OpenAI,DataType.eval,res)

### University API

The University of Regensburg has a open source API for ingredients classification. We are using that as well to identify how well this performs in comparison
https://smarthome.ur.de/naehrwertrechner/

**! Hint: Because of a different representation of the annotated Data, this approach will be evaluated in the Detection Part**

In [8]:
if not os.path.exists(NER_EVAL/"UNI.json"):
    print("Start Annotation process")
    res = []
    for p in test:
        try:
            output = request_api(str(p))
            res.append({p:output})
            time.sleep(0.5) 
        except Exception as e:
            print("Model Error:", e)
    ner_safer(ModelName.UNI,DataType.eval,res)

### Spacy and ModernBertModel

We annotated 50000 ingredients using Gemini Flash 2.0 and trained a Spacymodel / finetuned Modern Bert on the classification to see how well this is performing in comparison with Gemini and OpenAI

#### Spacy

In [9]:
if not os.path.exists(NER_EVAL/"Spacy.json"):
    print("Start Annotation process")
    res = []
    for obj in test:
        try:
            result = test_model(str(obj))
            df = {obj:{"entities":result}}
            res.append(df)
        except Exception as e:
            print("Model Error:", e)
    ner_safer(ModelName.Spacy,DataType.eval,res)

#### ModernBert

In [10]:
if not os.path.exists(NER_EVAL/"ModernBert.json"):
    print("Start Annotation process")
    res = []
    bert = Bert()
    for obj in test:
        try:
            result = bert.predict_and_return(str(obj))
            res.append(result)
        except Exception as e:
            print("Model Error:", e)
    ner_safer("ModernBert",DataType.eval,res)

## Evaluation

### How is the Evaluation working?
1. **Input:**
   - Two annotation lists: `ls_one`, `ls_two`

2. **make_one_dict()**
   - Converts each list into a dictionary:
     `{text: [entities]}`

3. **prepare_labels_for_texts()**
   - Tokenizes each text
   - Matches entities to tokens
   - Assigns labels for both annotators
   - Truncates label sequences to the same length

4. **Check class diversity**
   - Ensures both label lists contain at least two unique classes

5. **Compute Cohen's Kappa**
   - Uses `cohen_kappa_score(labels_one, labels_two)`
   - Measures inter-annotator agreement adjusted for chance

6. **Output:**
   - Final Kappa score

In [11]:
openai_ann = ner_loader(ModelName.OpenAI,DataType.eval)
spacy_ann = ner_loader(ModelName.Spacy,DataType.eval)
gemini_ann = ner_loader(ModelName.Gemini,DataType.eval)
modern_bert = ner_loader(ModelName.ModernBert, DataType.eval)
ground_truth = ner_loader(ModelName.GroundTruth, DataType.eval)

In [12]:
print("Ground Truth vs. Gemini")
evaluate_annotations(ground_truth,gemini_ann)


Ground Truth vs. Gemini

📈 Metriken:
╒═══════════════╤═════════╕
│ Metric        │   Score │
╞═══════════════╪═════════╡
│ Accuracy      │  0.8475 │
├───────────────┼─────────┤
│ F1            │  0.8461 │
├───────────────┼─────────┤
│ MCC           │  0.8101 │
├───────────────┼─────────┤
│ Jaccard       │  0.7417 │
├───────────────┼─────────┤
│ Cohen's Kappa │  0.806  │
╘═══════════════╧═════════╛

🧮 Konfusionsmatrix:

                O  ingredients  number  type  units
O            1208          153       9   176    143
ingredients    38         1263       4    48      0
number          7            4     971     3      1
type           17          183       2   525      8
units          19           21       9     5    758


In [13]:
print("OpenAi vs. Ground Truth")
evaluate_annotations(ground_truth,openai_ann)

OpenAi vs. Ground Truth

📈 Metriken:
╒═══════════════╤═════════╕
│ Metric        │   Score │
╞═══════════════╪═════════╡
│ Accuracy      │  0.7137 │
├───────────────┼─────────┤
│ F1            │  0.7018 │
├───────────────┼─────────┤
│ MCC           │  0.6389 │
├───────────────┼─────────┤
│ Jaccard       │  0.569  │
├───────────────┼─────────┤
│ Cohen's Kappa │  0.6295 │
╘═══════════════╧═════════╛

🧮 Konfusionsmatrix:

                O  ingredients  number  type  units
O            1139          353      21   136    148
ingredients   145         1172       4    31      1
number         76            5     897     8      0
type           63          503       1   161      7
units          78           21       1    25    687


In [14]:
print("ModernBert vs. Ground Truth")
evaluate_annotations(ground_truth, modern_bert)

ModernBert vs. Ground Truth

📈 Metriken:
╒═══════════════╤═════════╕
│ Metric        │   Score │
╞═══════════════╪═════════╡
│ Accuracy      │  0.8444 │
├───────────────┼─────────┤
│ F1            │  0.8392 │
├───────────────┼─────────┤
│ MCC           │  0.8028 │
├───────────────┼─────────┤
│ Jaccard       │  0.7406 │
├───────────────┼─────────┤
│ Cohen's Kappa │  0.7992 │
╘═══════════════╧═════════╛

🧮 Konfusionsmatrix:

                O  ingredients  number  type  units
O            1379          173       7   110     12
ingredients    86         1254       4     9      0
number          4            7     969     6      0
type          112          264       3   345     11
units          22           25       8     3    754


In [15]:
print("ModernBert vs. Spacy")
evaluate_annotations(ground_truth, spacy_ann)

ModernBert vs. Spacy

📈 Metriken:
╒═══════════════╤═════════╕
│ Metric        │   Score │
╞═══════════════╪═════════╡
│ Accuracy      │  0.788  │
├───────────────┼─────────┤
│ F1            │  0.7913 │
├───────────────┼─────────┤
│ MCC           │  0.7284 │
├───────────────┼─────────┤
│ Jaccard       │  0.6622 │
├───────────────┼─────────┤
│ Cohen's Kappa │  0.7252 │
╘═══════════════╧═════════╛

🧮 Konfusionsmatrix:

                O  ingredients  number  type  units
O            1421          101       1   153      5
ingredients   282          987       1    83      0
number        113            0     868     5      0
type           74          153       0   499      9
units         174           11       9     6    612


#### Cohens Kappa Interpretation

| Values(X)       | Interpretation  | 
|----------------|--------|
| X < 0.0      |  Poor Agreement |
| 0.0 < X < 0.2      |  Slight Agreement |
| 0.2 < X < 0.4     |  Fair Agreement |
| 0.41 < X < 0.6     |  Moderate Agreement |
| 0.61 < X < 0.8     |  Substantial Agreement |
| 0.81 < X < 1.0     |  Almost Perfect Agreement |

#### Results
| Cohen's Kappa         | Gemini Flash 2.0  | GPT 4omini         | Spacy (*) |Modern Bert (*) |
|----------------|--------|----------------|--------|--------|
| Ground Truth       | 0.806 |0.6295 |0.7252 | 0.7992|

(*) trained on Gemini Data

- Results show better results for Gemini as annotator in comparison with Gpt 4omini
- Gemini shows almost perfect agreement (Table: Cohens Kappa Interpretation)
- Modern Bert shows a Substantial Agreement with the Ground Truth but on the border to almost perfect agreement

=> We continue using Modern Bert to find the Entities in the Text