# 

# Test the LLM annotation with PyDantic and Instructor  ☑

### ⚠️ Background  
Previous attempts used **XML tags** (e.g., `<MOL>POPC</MOL>`) to annotate entities in Molecular Dynamics (MD) texts.  
However, this approach had several issues:  
- LLMs sometimes **modified the input text**  
- The **XML structure** was often **broken** (missing tags)  
- Parsing results were **fragile and inconsistent**


### 🎯 Objectives  
- ✅ **Improve** LLM annotation outputs for MD text  
- 🧾 **Enforce structured JSON** outputs instead of XML  
- 🔍 **Validate** and **control schema** using:
  - **Pydantic** → defines the expected JSON structure  
  - **Instructor** → forces the LLM to comply with that structure
__________

## Package version

In [None]:
%load_ext watermark
%watermark
%watermark --packages numpy,pandas,matplotlib,pydantic,openai,groq,instructor

______

In [None]:
# Import libraries
import os
import json
from tqdm import tqdm

import instructor
from groq import Groq
from openai import OpenAI
from dotenv import load_dotenv
from pydantic import BaseModel, Field

# Import utility functions and constants
from utils import annotate, check_json_validity, PROMPT, visualize_entities

In [None]:
# Constants
MODEL_GROQ = "llama-3.1-8b-instant"
MODEL_OPENAI = "o3-mini-2025-01-31"

EX_MD_TEXTS = [
    "The protein was simulated using AMBER with the TIP3P water model.",
    "Visualization was done with VMD, and the MD run used GROMOS96 parameters.",
    "A POPC:CHOL membrane was built using CHARMM-GUI.",
    """Improved Coarse-Grained Modeling of Cholesterol-Containing Lipid Bilayers
Cholesterol trafficking, which is an essential function in mammalian cells, is intimately connected to molecular-scale interactions through cholesterol modulation of membrane structure and dynamics and interaction with membrane receptors. Since these effects of cholesterol occur on micro- to millisecond time scales, it is essential to develop accurate coarse-grained simulation models that can reach these time scales. Cholesterol has been shown experimentally to thicken the membrane and increase phospholipid tail order between 0 and 40% cholesterol, above which these effects plateau or slightly decrease. Here, we showed that the published MARTINI coarse-grained force-field for phospholipid (POPC) and cholesterol fails to capture these effects. Using reference atomistic simulations, we systematically modified POPC and cholesterol bonded parameters in MARTINI to improve its performance. We showed that the corrections to pseudobond angles between glycerol and the lipid tails and around the oleoyl double bond particle (the angle-corrected model ) slightly improves the agreement of MARTINI with experimentally measured thermal, elastic, and dynamic properties of POPC membranes. The angle-corrected model improves prediction of the thickening and ordering effects up to 40% cholesterol but overestimates these effects at higher cholesterol concentration. In accordance with prior work that showed the cholesterol rough face methyl groups are important for limiting cholesterol self-association, we revised the coarse-grained representation of these methyl groups to better match cholesterol-cholesterol radial distribution functions from atomistic simulations. In addition, by using a finer-grained representation of the branched cholesterol tail than MARTINI, we improved predictions of lipid tail order and bilayer thickness across a wide range of concentrations. Finally, transferability testing shows that a model incorporating our revised parameters into DOPC outperforms other CG models in a DOPC/cholesterol simulation series, which further argues for its efficacy and generalizability. These results argue for the importance of systematic optimization for coarse-graining biologically important molecules like cholesterol with complicated molecular structure.
"""
]

In [None]:
# Configs
# Load environment variables from .env file
load_dotenv()

# Initialize the Groq client (https://python.useinstructor.com/integrations/groq/)
client_groq = Groq(api_key=os.getenv("GROQ_API_KEY"))
client_groq = instructor.from_provider(f"groq/{MODEL_GROQ}", mode=instructor.Mode.JSON)

# Initialize the OpenAI client
client_openai = instructor.from_provider(f"openai/{MODEL_OPENAI}", mode=instructor.Mode.JSON)

_____________

# I. Simple example

## I.1 Without validation 🚫

The LLM output is returned **as-is**, without enforcing a schema.  
This helps check if the model produces **valid JSON** and adheres to the expected structure.

In [None]:
texts_to_annotate = EX_MD_TEXTS[0]
response = annotate(texts_to_annotate, MODEL_GROQ, client=client_groq, validation=False)

print("=" * 80)
print("📝 INPUT TEXT ")
print("=" * 80)
print(texts_to_annotate)
print("\n")

print("=" * 80)
print(f"🤖 MODEL RESPONSE ({MODEL_GROQ})")
print("=" * 80)
print(response)
print("\n")

# test if the response is a valid JSON
if check_json_validity(response):
    print("Valid JSON ✅")
else:
    print("Invalid JSON ❌")

## 2. With validation ✅

Now, we test it with pydantic and instructor !

In [None]:
texts_to_annotate = EX_MD_TEXTS[0]
response = annotate(texts_to_annotate, MODEL_OPENAI, client=client_openai, validation=True)

print(f"Input:\n{texts_to_annotate}\n")
print(f"Response with validation ({MODEL_OPENAI}):\n{response}")
# test if the response is a valid JSON
if check_json_validity(response):
    print("Valid JSON ✅")
else:
    print("Invalid JSON ❌")

In [None]:
# Visualize entities using displacy
visualize_entities(texts_to_annotate, response)  

In [None]:
texts_to_annotate = EX_MD_TEXTS[0]
response = annotate(texts_to_annotate, MODEL_GROQ, client=client_groq, validation=True)

print(f"Input:\n{texts_to_annotate}\n")
print(f"Response with validation ({MODEL_GROQ}):\n{response}")
# test if the response is a valid JSON
if check_json_validity(response):
    print("Valid JSON ✅")
else:
    print("Invalid JSON ❌")


# II. Example from our annotation dataset

Now we want to validate our observations statistically. We’ll query the model 100 times and compare the results obtained with and without format validation.

## II.1 Without validation 🚫


In [None]:
texts_to_annotate = EX_MD_TEXTS[3]
stat_num_iterations = 100
valid_count = 0
invalid_count = 0
entities_list = []

for _ in tqdm(range(stat_num_iterations), desc="Running annotations"):
    entities_list.append(annotate(texts_to_annotate, MODEL_GROQ, client_groq, validation=False))
    if check_json_validity(entities_list[-1]):
        valid_count += 1
    else:
        invalid_count += 1

print("\n" + "="*80)
print(f"📝 Input text:\n{texts_to_annotate}\n")
print(f"📊 Summary after {stat_num_iterations} runs with {MODEL_GROQ}:")
print(f"✔ Valid responses:   {valid_count}")
print(f"❌ Invalid responses: {invalid_count}")
print("="*80)

In [None]:
entities_list

## II.2 With validation ✅


In [None]:
texts_to_annotate = EX_MD_TEXTS[3]
stat_num_iterations = 3
valid_count = 0
invalid_count = 0
entities_validated_list = []

for _ in tqdm(range(stat_num_iterations), desc="Running annotations"):
    entities_validated_list.append(annotate(texts_to_annotate, MODEL_GROQ, client_groq, validation=True))
    #time.sleep(30)  # To avoid rate limiting
    if check_json_validity(entities_validated_list[-1]):
        valid_count += 1
    else:
        invalid_count += 1

print("\n" + "="*80)
print(f"📝 Input text:\n{texts_to_annotate}\n")
print(f"📊 Summary after {stat_num_iterations} runs with {MODEL_GROQ}:")
print(f"✔ Valid responses:   {valid_count}")
print(f"❌ Invalid responses: {invalid_count}")

In [None]:
entities_validated_list

In [None]:
len(entities_validated_list)

# III. OpenAI models

In [None]:
texts_to_annotate = EX_MD_TEXTS[3]
stat_num_iterations = 100
valid_count = 0
invalid_count = 0

for _ in tqdm(range(stat_num_iterations), desc="Running annotations"):
    response = annotate(texts_to_annotate, MODEL_OPENAI, client_openai, validation=False)
    if check_json_validity(response):
        valid_count += 1
    else:
        invalid_count += 1

print("\n" + "="*80)
print(f"📝 Input text:\n{texts_to_annotate}\n")
print(f"📊 Summary after {stat_num_iterations} runs with {MODEL_OPENAI}:")
print(f"✔ Valid responses:   {valid_count}")
print(f"❌ Invalid responses: {invalid_count}")

In [None]:
texts_to_annotate = EX_MD_TEXTS[3]
stat_num_iterations = 100
valid_count = 0
invalid_count = 0

for _ in tqdm(range(stat_num_iterations), desc="Running annotations"):
    response = annotate(texts_to_annotate, MODEL_OPENAI, client_openai, validation=True)
    if check_json_validity(response):
        valid_count += 1
    else:
        invalid_count += 1

print("\n" + "="*80)
print(f"📝 Input text:\n{texts_to_annotate}\n")
print(f"📊 Summary after {stat_num_iterations} runs with {MODEL_OPENAI}:")
print(f"✔ Valid responses:   {valid_count}")
print(f"❌ Invalid responses: {invalid_count}")

# Conclusion 💡

The conclusion of the tests is that validation helps ensure the output follows the expected schema. 

But we notice that with OpenAI models, the JSON is often already well-structured. In contrast, smaller models like llama-3.1-8b-instant tend to produce less consistent outputs. The strong performance of larger models is also partly due to effective prompt engineering techniques, such as few-shot examples.


In short, combining good prompt design with schema validation with Pydantic and Instructor provide the structure and validation needed to ensure models generate consistent and trustworthy outputs 🌟