# 

# Test the LLM annotation with PyDantic and Instructor  ☑

### ⚠️ Background  
Previous attempts used **XML tags** (e.g., `<MOL>POPC</MOL>`) to annotate entities in Molecular Dynamics (MD) texts.  
However, this approach had several issues:  
- LLMs sometimes **modified the input text**  
- The **XML structure** was often **broken** (missing tags)  
- Parsing results were **fragile and inconsistent**


### 🎯 Objectives  
- 🧾 **Enforce structured JSON** outputs instead of XML  
- 🔍 **Control schema** using:
  - **Pydantic** → defines the expected JSON structure  
  - **Instructor** → forces the LLM to comply with that structure
- ✅ **Validate** the content of the output against the original text (to reduce hallucinated entities)
- 📊 **Compare the performance** of small open-weight models (e.g., Llama-8B) against larger proprietary models (e.g., OpenAI’s GPT-4o-mini)
__________

## Package version

In [1]:
%load_ext watermark
%watermark
%watermark --packages numpy,pandas,matplotlib,pydantic,openai,groq,instructor

Last updated: 2025-10-23T16:11:45.215184+02:00

Python implementation: CPython
Python version       : 3.13.7
IPython version      : 8.13.2

Compiler    : GCC 14.3.0
OS          : Linux
Release     : 6.14.0-32-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 32
Architecture: 64bit

numpy     : 2.3.1
pandas    : 2.3.1
matplotlib: 3.10.3
pydantic  : 2.11.7
openai    : 1.93.2
groq      : 0.29.0
instructor: 1.11.3



______

In [41]:

# Import libraries
import os
from tqdm import tqdm
import textwrap

import instructor
from groq import Groq
from dotenv import load_dotenv

# To auto-reload modules when they are changed
%load_ext autoreload  
%autoreload 2
# Import utility constants
from utils import PROMPT
# Import utility functions
from utils import annotate, visualize_entities, validate_annotation_output_format, compare_annotation_validation, report_hallucinated_entities, find_hallucinated_entities

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# Constants
MODEL_GROQ = "llama-3.1-8b-instant"
MODEL_OPENAI = "gpt-4o-mini-2024-07-18"
MODEL_OPENROUTER = "meta-llama/llama-3.1-8b-instruct"

EX_MD_TEXTS = [
    "The protein was simulated using AMBER with the TIP3P water model.",
    "Visualization was done with VMD, and the MD run used GROMOS96 parameters.",
    "A POPC:CHOL membrane was built using CHARMM-GUI.",
    """Improved Coarse-Grained Modeling of Cholesterol-Containing Lipid Bilayers
Cholesterol trafficking, which is an essential function in mammalian cells, is intimately connected to molecular-scale interactions through cholesterol modulation of membrane structure and dynamics and interaction with membrane receptors. Since these effects of cholesterol occur on micro- to millisecond time scales, it is essential to develop accurate coarse-grained simulation models that can reach these time scales. Cholesterol has been shown experimentally to thicken the membrane and increase phospholipid tail order between 0 and 40% cholesterol, above which these effects plateau or slightly decrease. Here, we showed that the published MARTINI coarse-grained force-field for phospholipid (POPC) and cholesterol fails to capture these effects. Using reference atomistic simulations, we systematically modified POPC and cholesterol bonded parameters in MARTINI to improve its performance. We showed that the corrections to pseudobond angles between glycerol and the lipid tails and around the oleoyl double bond particle (the angle-corrected model ) slightly improves the agreement of MARTINI with experimentally measured thermal, elastic, and dynamic properties of POPC membranes. The angle-corrected model improves prediction of the thickening and ordering effects up to 40% cholesterol but overestimates these effects at higher cholesterol concentration. In accordance with prior work that showed the cholesterol rough face methyl groups are important for limiting cholesterol self-association, we revised the coarse-grained representation of these methyl groups to better match cholesterol-cholesterol radial distribution functions from atomistic simulations. In addition, by using a finer-grained representation of the branched cholesterol tail than MARTINI, we improved predictions of lipid tail order and bilayer thickness across a wide range of concentrations. Finally, transferability testing shows that a model incorporating our revised parameters into DOPC outperforms other CG models in a DOPC/cholesterol simulation series, which further argues for its efficacy and generalizability. These results argue for the importance of systematic optimization for coarse-graining biologically important molecules like cholesterol with complicated molecular structure.
"""
]

In [5]:
# Configs
# Load environment variables from .env file
load_dotenv()

# Initialize the Groq client (https://python.useinstructor.com/integrations/groq/)
client_groq = Groq(api_key=os.getenv("GROQ_API_KEY"))
client_groq = instructor.from_provider(f"groq/{MODEL_GROQ}", mode=instructor.Mode.JSON)

# Initialize the OpenAI client
client_openai = instructor.from_provider(f"openai/{MODEL_OPENAI}", mode=instructor.Mode.JSON)

# Initialize the OpenRouter client (https://python.useinstructor.com/integrations/openrouter/)
client_openrouter = instructor.from_provider(f"openrouter/{MODEL_OPENROUTER}", mode=instructor.Mode.JSON)

_____________

# I. Simple example

## I.1 Without validation 🚫

The LLM output is returned **as-is**, without enforcing a schema.  
This helps check if the model produces **valid JSON** and adheres to the expected structure.

In [6]:
text_to_annotate = EX_MD_TEXTS[0]
response = annotate(text_to_annotate, MODEL_OPENROUTER, client=client_openrouter, validation=False)

print("=" * 80)
print("📝 INPUT TEXT ")
print("=" * 80)
print(text_to_annotate)
print("\n")

print("=" * 80)
print(f"🤖 MODEL RESPONSE ({MODEL_OPENROUTER})")
print("=" * 80)
print(response.choices[0].message.content)
print("\n")


if validate_annotation_output_format(response):
    print("Valid annotation format ✅")
else:
    print("Invalid annotation format ❌")

📝 INPUT TEXT 
The protein was simulated using AMBER with the TIP3P water model.


🤖 MODEL RESPONSE (meta-llama/llama-3.1-8b-instruct)
Here is the extracted entity list in the required JSON format:

{"entities": [{"label": "SOFTNAME", "text": "AMBER"}, {"label": "FFM", "text": "TIP3P"}]}


Invalid annotation format ❌


 ➡️ In this case, some of the response is valid, but the model adds "***Here is the extracted entity list in the required JSON format:***" instead of strictly following the instruction to output JSON.

## 2. With validation ✅

Now, we test it with pydantic and instructor !

In [7]:
text_to_annotate = EX_MD_TEXTS[0]
response = annotate(text_to_annotate, MODEL_OPENROUTER, client=client_openrouter, validation=True)
response_json = response.model_dump_json(indent=2)

print("=" * 80)
print("📝 INPUT TEXT ")
print("=" * 80)
print(text_to_annotate)
print("\n")

print("=" * 80)
print(f"🤖 MODEL RESPONSE ({MODEL_OPENROUTER})")
print("=" * 80)
print(response_json)
print("\n")

# test if the response is a valid JSON
if validate_annotation_output_format(response):
    print("Valid annotation format ✅")
else:
    print("Invalid annotation format ❌")

📝 INPUT TEXT 
The protein was simulated using AMBER with the TIP3P water model.


🤖 MODEL RESPONSE (meta-llama/llama-3.1-8b-instruct)
{
  "entities": [
    {
      "label": "SOFTNAME",
      "text": "AMBER"
    },
    {
      "label": "FFM",
      "text": "TIP3P"
    }
  ]
}


Valid annotation format ✅


In [8]:
print("=" * 80)
print("🧐 VISUALIZATION OF EXTRACTED ENTITIES")
print("=" * 80)
# Visualize entities using displacy
visualize_entities(text_to_annotate, response_json)

🧐 VISUALIZATION OF EXTRACTED ENTITIES


# II. Example from our annotation dataset

Now we want to validate our observations statistically. We’ll query the model 100 times and compare the results obtained with and without format validation.

In [49]:
text_to_annotate = EX_MD_TEXTS[3]
annotation_stats = compare_annotation_validation(text_to_annotate, MODEL_OPENROUTER, client_openrouter, num_iterations=100)

Running annotations without validation schema: 100%|██████████| 100/100 [11:03<00:00,  6.64s/it]
Running annotations with validation schema:  11%|█         | 11/100 [00:36<03:45,  2.53s/it]

⚠️ Validated annotation failed after 3 attempts.


Running annotations with validation schema: 100%|██████████| 100/100 [07:31<00:00,  4.52s/it]


📝 Input text:
Improved Coarse-Grained Modeling of Cholesterol-Containing Lipid Bilayers Cholesterol trafficking,
which is an essential function in mammalian cells, is intimately connected to molecular-scale
interactions through cholesterol modulation of membrane structure and dynamics and interaction with
membrane receptors. Since these effects of cholesterol occur on micro- to millisecond time scales,
it is essential to develop accurate coarse-grained simulation models that can reach these time
scales. Cholesterol has been shown experimentally to thicken the membrane and increase phospholipid
tail order between 0 and 40% cholesterol, above which these effects plateau or slightly decrease.
Here, we showed that the published MARTINI coarse-grained force-field for phospholipid (POPC) and
cholesterol fails to capture these effects. Using reference atomistic simulations, we systematically
modified POPC and cholesterol bonded parameters in MARTINI to improve its performance. We showed
that




➡️ We observe that, for the same query (annotating the same text with the same model and prompt), a few responses (**17%**) are valid JSON without hallucinated text. And as we expect, adding validation schema with PyDantic and Instructor significantly improves the JSON validity of the responses from the model to reach **49%** valid responses.

Here is an example of an hallucinated entities (one that doesn't exist in the original text to annotate) :

In [74]:
annotations = annotation_stats["without_validation"]["examples"]
hallucinated_entities = find_hallucinated_entities(annotations, text_to_annotate)
report_hallucinated_entities(hallucinated_entities, text_to_annotate)


📝 Original text:
Improved Coarse-Grained Modeling of Cholesterol-Containing Lipid Bilayers Cholesterol trafficking, which is an essential
function in mammalian cells, is intimately connected to molecular-scale interactions through cholesterol modulation of
membrane structure and dynamics and interaction with membrane receptors. Since these effects of cholesterol occur on
micro- to millisecond time scales, it is essential to develop accurate coarse-grained simulation models that can reach
these time scales. Cholesterol has been shown experimentally to thicken the membrane and increase phospholipid tail
order between 0 and 40% cholesterol, above which these effects plateau or slightly decrease. Here, we showed that the
published MARTINI coarse-grained force-field for phospholipid (POPC) and cholesterol fails to capture these effects.
Using reference atomistic simulations, we systematically modified POPC and cholesterol bonded parameters in MARTINI to
improve its performance. We showed t

In [82]:
print("=" * 80)
print("🧐 VISUALIZATION OF EXTRACTED ENTITIES")
print("=" * 80)
# Visualize entities using displacy
first_annotation = annotation_stats["with_validation"]["examples"][16]
visualize_entities(text_to_annotate, first_annotation)

🧐 VISUALIZATION OF EXTRACTED ENTITIES


## II.3 Testing with OpenAI models

In [83]:
text_to_annotate = EX_MD_TEXTS[3]
annotation_stats = compare_annotation_validation(text_to_annotate, MODEL_OPENAI, client_openai, num_iterations=100)

Running annotations without validation schema: 100%|██████████| 100/100 [07:08<00:00,  4.29s/it]
Running annotations with validation schema: 100%|██████████| 100/100 [04:21<00:00,  2.62s/it]


📝 Input text:
Improved Coarse-Grained Modeling of Cholesterol-Containing Lipid Bilayers Cholesterol trafficking, which is an essential
function in mammalian cells, is intimately connected to molecular-scale interactions through cholesterol modulation of
membrane structure and dynamics and interaction with membrane receptors. Since these effects of cholesterol occur on
micro- to millisecond time scales, it is essential to develop accurate coarse-grained simulation models that can reach
these time scales. Cholesterol has been shown experimentally to thicken the membrane and increase phospholipid tail
order between 0 and 40% cholesterol, above which these effects plateau or slightly decrease. Here, we showed that the
published MARTINI coarse-grained force-field for phospholipid (POPC) and cholesterol fails to capture these effects.
Using reference atomistic simulations, we systematically modified POPC and cholesterol bonded parameters in MARTINI to
improve its performance. We showed that




➡️ We can see that without the validation schema with an openAI model, 88% of the responses are corrects. And 100% are correct with validation ! 

# Conclusion 💡

The conclusion of the tests is that validation helps ensure the output follows the expected schema. 

But we notice that with OpenAI models, the JSON is often already well-structured. In contrast, smaller models like llama-3.1-8b-instant tend to produce less consistent outputs. The strong performance of larger models is also partly due to effective prompt engineering techniques, such as few-shot examples.


In short, combining good prompt design with schema validation with Pydantic and Instructor provide the structure and validation needed to ensure models generate consistent and trustworthy outputs 🌟