![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Medical_Language_Models/Healthcare_NLP_Agents_with_LLMs.ipynb)

# **JohnSnowLabs Healthcare NLP Agents with LLMs (Certification Trainings, April 2025)**

# Coding an LLM Agent with John Snow Labs Library (Healthcare NLP)  
This notebook demonstrates how to build intelligent healthcare information extraction agents by combining general-purpose LLMs with specialized John Snow Labs medical NLP models. We explore entity extraction from clinical texts, build tools for analyzing medical conditions and medications, and create sophisticated agents that can filter, analyze, and highlight critical findings in patient records—showcasing a practical approach to healthcare text processing using modern agent architectures.

# Initial Setup with John Snow Labs Libraries

In [None]:
import json
import os
from google.colab import files

# Upload license keys if needed
if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

# Installing required packages
!pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION
!pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET
!pip install -q spark-nlp-display

# Installing modern LLM and agent packages
!pip install -q langchain pydantic~=2.5 transformers accelerate langchain-huggingface langchain-community langchain-core langchain-openai

Now let's set up the Spark session and import necessary libraries:

In [None]:
# Set up Spark session
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)
spark

Spark NLP Version : 5.5.0
Spark NLP_JSL Version : 5.5.0


# **Information Extraction with LLMs**

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import json
import re

# Set random seed for reproducibility
torch.random.manual_seed(0)

# Load the model
model_id = "microsoft/Phi-4-mini-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Create the pipeline with generation arguments
generation_args = {
    "max_new_tokens": 2048,
    "return_full_text": False,
    "temperature": 0.1,
    "do_sample": False,
    "return_full_text": False,
    "repetition_penalty": 1.1,
    "pad_token_id": tokenizer.eos_token_id,
}

text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


## **Simple Entity Extraction with LLM**

In [None]:
# Sample clinical text for demonstration
clinical_text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''

# Define a simple prompt for entity extraction
simple_prompt = """
You are a highly experienced and skilled medical annotator who has been working on medical texts to label medical and pharma related entities.
Given the following text, return all the Problem, Test and Treatment entities:

"""

# Run the simple extraction
user_prompt = f"{simple_prompt} {clinical_text}"
output = text_generation_pipeline(user_prompt, **generation_args)
print(output[0]['generated_text'])

The patient underwent endoscopic retrograde cholangiopancreatography that revealed multiple dilated intrahepatic bile ducts without evidence of stones; however, there appeared to be some mild inflammation in both common hepatic duct branches as well as distal right posterior pancreatic ductal system.
Ultrasound showed normal gallbladder wall thickness but diffuse increased echogenicity throughout liver parenchyma consistent with steatosis. A CT scan confirmed these results showing marked fatty infiltration into hepatocytes bilaterally along with moderate splenomegaly.

Problem:
- Gestational Diabetes Mellitus
- Type Two Diabetes Mellitus (T2DM)
- Hypertriglyceridemia

Test:
- Metformin
- Glipizide
- Dapagliflozin
- Atorvastatin
- Gemfibrozil
- Serum glucose test result - 111 mg/dl
- Creatinine level - 0.4 mg/dL
- Triglycerides measurement - 508 mg/dL
- Total Cholesterol Level - 122 mg/dL
- Venous blood gas analysis - pH 7.27
- Endoscopic Retrograde Cholangio Pancreatography revealing i

# **Detailed Entity Extraction with Structured Format**

In [None]:
def extract_medical_entities(text, pipe):
    """
    Extract medical entities (Problem, Test, Treatment) from clinical text

    Args:
        text: Clinical text to analyze
        pipe: HuggingFace pipeline for text generation

    Returns:
        Dictionary containing categorized medical entities
    """
    # More explicit prompt with specific instructions and examples
    prompt = f"""As a medical annotator, extract all the Problem, Test, and Treatment entities from this text.
Be thorough and identify ALL entity types:

- Problems: medical conditions, symptoms, diagnoses (e.g., diabetes, hypertension)
- Tests: laboratory values, diagnostic procedures (e.g., blood glucose, MRI, physical examination)
- Treatments: medications, therapies, procedures (e.g., metformin, surgery, therapy)

Text to analyze:
{text}

Respond ONLY with a single, valid JSON object in this exact format:
{{
  "entities": [
    {{"entity_type": "Problem", "chunk": "relevant text"}},
    {{"entity_type": "Test", "chunk": "relevant text"}},
    {{"entity_type": "Treatment", "chunk": "relevant text"}}
  ]
}}
"""

    # Get model response
    response = pipe(prompt, max_new_tokens=2048, return_full_text=False, temperature=0.1, do_sample=False)
    output_text = response[0]['generated_text']

    # Extract the first valid JSON object
    try:
        # Look for the first complete JSON object
        pattern = r'\{(?:[^{}]|(?:\{(?:[^{}]|(?:\{[^{}]*\}))*\}))*\}'
        match = re.search(pattern, output_text)

        if match:
            json_str = match.group(0)
            parsed_data = json.loads(json_str)

            # Verify we have the expected format
            if "entities" in parsed_data:
                # Check if we have all entity types
                entity_types = [e["entity_type"] for e in parsed_data["entities"]]

                if "Problem" not in entity_types:
                    print("Warning: No Problem entities identified")
                if "Test" not in entity_types:
                    print("Warning: No Test entities identified")
                if "Treatment" not in entity_types:
                    print("Warning: No Treatment entities identified")

                return parsed_data
            else:
                return {"error": "Invalid format", "entities": []}
        else:
            return {"error": "No JSON found", "raw": output_text[:200]}
    except Exception as e:
        return {"error": f"Parsing error: {str(e)}", "raw": output_text[:200]}

# Test the structured extraction function
result = extract_medical_entities(clinical_text, text_generation_pipeline)

# Pretty print the result
if "error" not in result:
    print(json.dumps(result, indent=2))
else:
    print(f"Error: {result['error']}")
    if "raw" in result:
        print(f"Raw output preview: {result['raw']}...")

{
  "entities": [
    {
      "entity_type": "Problem",
      "chunk": "gestational diabetes mellitus, type two diabetes mellitus (T2DM), HTG-induced pancreatitis, acute hepatitis, polyuria, poor appetite, vomiting"
    },
    {
      "entity_type": "Test",
      "chunk": "serum glucose, creatinine, triglycerides, total cholesterol, venous pH"
    },
    {
      "entity_type": "Treatment",
      "chunk": "metformin, glipizide, dapagliflozin, atorvastatin, gemfibrozil"
    }
  ]
}


# **Information Extraction with Healthcare NLP by John Snow Labs**

## **Set up the JSL NLP pipeline for clinical entity extraction**

In [None]:

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
nlp_model = nlpPipeline.fit(empty_data)

# Check the available entity classes
print("Available entity classes:")
print(clinical_ner.getClasses())

# Create a light model for easier usage
light_model = LightPipeline(nlp_model)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
[OK!]
Available entity classes:
['O', 'B-TREATMENT', 'I-TREATMENT', 'B-PROBLEM', 'I-PROBLEM', 'B-TEST', 'I-TEST']


Define a clean entity extraction function

In [None]:
def extract_entities(text, light_model):
    """
    Extract entities from text using JSL models and return structured results

    Args:
        text: Clinical text to analyze
        light_model: The LightPipeline model to use for extraction

    Returns:
        List of dictionaries containing the extracted entities and metadata
    """
    results = light_model.fullAnnotate(text)
    entities = []

    for chunk in results[0]['ner_chunk']:
        entity = {
            "chunk": chunk.result,
            "begin": int(chunk.begin),
            "end": int(chunk.end),
            "ner_label": chunk.metadata['entity'],
            "confidence": chunk.metadata['confidence']
        }
        entities.append(entity)

    return entities

In [None]:
# Test the extraction function
clinical_entities = extract_entities(clinical_text, light_model)
clinical_df = pd.DataFrame(clinical_entities)

# Display the results
print("\nExtracted Clinical Entities:")
display(clinical_df)

# Count entities by type
print("\nEntity Types Distribution:")
display(clinical_df.value_counts("ner_label"))


Extracted Clinical Entities:


Unnamed: 0,chunk,begin,end,ner_label,confidence
0,gestational diabetes mellitus,40,68,PROBLEM,0.91976666
1,subsequent type two diabetes mellitus,118,154,PROBLEM,0.75924003
2,T2DM,157,160,PROBLEM,0.9917
3,HTG-induced pancreatitis,185,208,PROBLEM,0.97535
4,an acute hepatitis,265,282,PROBLEM,0.9440667
5,polyuria,322,329,PROBLEM,0.9728
6,poor appetite,332,344,PROBLEM,0.9934
7,vomiting,351,358,PROBLEM,0.9854
8,metformin,372,380,TREATMENT,0.9998
9,glipizide,383,391,TREATMENT,0.9999



Entity Types Distribution:


Unnamed: 0_level_0,count
ner_label,Unnamed: 1_level_1
PROBLEM,14
TEST,7
TREATMENT,6


## **Setting Up Posology NER Model for Medication Extraction**
Now, let's set up a specialized model for medication/posology extraction:

In [None]:
# NER model for medication/posology extraction
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

posology_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    posology_ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")
posology_model = nlpPipeline.fit(empty_data)

# Create a light model for easier usage
light_posology_model = LightPipeline(posology_model)

# Check available entity classes
print("Available posology entity classes:")
print(posology_ner.getClasses())

# Test the posology pipeline on the same clinical text
posology_entities = extract_entities(clinical_text, light_posology_model)
posology_df = pd.DataFrame(posology_entities)

# Display the results
print("\nExtracted Medication/Posology Entities:")
display(posology_df)

ner_posology download started this may take some time.
[OK!]
Available posology entity classes:
['O', 'B-DOSAGE', 'B-STRENGTH', 'I-STRENGTH', 'B-ROUTE', 'B-FREQUENCY', 'I-FREQUENCY', 'B-DRUG', 'I-DRUG', 'B-FORM', 'I-DOSAGE', 'B-DURATION', 'I-DURATION', 'I-FORM', 'I-ROUTE']

Extracted Medication/Posology Entities:


Unnamed: 0,chunk,begin,end,ner_label,confidence
0,metformin,372,380,DRUG,1.0
1,glipizide,383,391,DRUG,0.9995
2,dapagliflozin,398,410,DRUG,0.9996
3,atorvastatin,425,436,DRUG,1.0
4,gemfibrozil,442,452,DRUG,0.9997
5,dapagliflozin,479,491,DRUG,0.9989
6,for six months,493,506,DURATION,0.7467


# **Building Modern NER Tools and Agents**

In [None]:
# Import required libraries for tools and agents
from pydantic import BaseModel, Field
from typing import List, Dict, Any, Optional, Union
from langchain.tools import BaseTool, StructuredTool, tool
from langchain_core.tools import Tool
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.memory import ConversationBufferMemory
from langchain_core.messages import HumanMessage, SystemMessage

# Set up the OpenAI model for agent use
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

## **Creating Clinical NER Tool**

In [None]:
@tool
def clinical_ner_tool(text: str) -> Dict[str, List[Dict[str, Any]]]:
    """
    Extract clinical named entities (Problem, Test, Treatment) from medical text.
    Useful for identifying medical conditions, diagnostic procedures, and treatments in clinical notes.

    Args:
        text: The medical text to analyze

    Returns:
        A dictionary containing extracted clinical entities
    """
    entities = extract_entities(text, light_model)
    # Format as required for the agent
    return {"entities": entities}

# Test the tool
test_text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
'''

clinical_result = clinical_ner_tool(test_text)
print("Clinical NER Tool Test Results:")
print(json.dumps(clinical_result, indent=2))

Clinical NER Tool Test Results:
{
  "entities": [
    {
      "chunk": "gestational diabetes mellitus",
      "begin": 40,
      "end": 68,
      "ner_label": "PROBLEM",
      "confidence": "0.91976666"
    },
    {
      "chunk": "subsequent type two diabetes mellitus",
      "begin": 118,
      "end": 154,
      "ner_label": "PROBLEM",
      "confidence": "0.75924003"
    },
    {
      "chunk": "T2DM",
      "begin": 157,
      "end": 160,
      "ner_label": "PROBLEM",
      "confidence": "0.9917"
    },
    {
      "chunk": "HTG-induced pancreatitis",
      "begin": 185,
      "end": 208,
      "ner_label": "PROBLEM",
      "confidence": "0.97535"
    },
    {
      "chunk": "an acute hepatitis",
      "begin": 265,
      "end": 282,
      "ner_label": "PROBLEM",
      "confidence": "0.9440667"
    },
    {
      "chunk": "polyuria",
      "begin": 322,
      "end": 329,
      "ner_label": "PROBLEM",
      "confidence": "0.9728"
    },
    {
      "chunk": "poor appetite",
      "b

## **Creating Posology NER Tool**

In [None]:
@tool
def medication_ner_tool(text: str) -> Dict[str, List[Dict[str, Any]]]:
    """
    Extract medication/posology related entities (drug, dosage, frequency, etc.) from medical text.
    Useful for identifying medications, dosages, frequencies, and routes in clinical notes.

    Args:
        text: The medical text to analyze

    Returns:
        A dictionary containing extracted medication entities
    """
    # Extract entities using the posology model
    entities = extract_entities(text, light_posology_model)
    return {"entities": entities}

# Test the tool
medication_text = '''
The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely for 3 months.
'''

medication_result = medication_ner_tool(medication_text)
print("\nMedication NER Tool Test Results:")
print(json.dumps(medication_result, indent=2))


Medication NER Tool Test Results:
{
  "entities": [
    {
      "chunk": "1",
      "begin": 28,
      "end": 28,
      "ner_label": "DOSAGE",
      "confidence": "0.9992"
    },
    {
      "chunk": "capsule",
      "begin": 30,
      "end": 36,
      "ner_label": "FORM",
      "confidence": "0.9897"
    },
    {
      "chunk": "Advil",
      "begin": 41,
      "end": 45,
      "ner_label": "DRUG",
      "confidence": "0.997"
    },
    {
      "chunk": "for 5 days",
      "begin": 47,
      "end": 56,
      "ner_label": "DURATION",
      "confidence": "0.71383333"
    },
    {
      "chunk": "40 units",
      "begin": 126,
      "end": 133,
      "ner_label": "DOSAGE",
      "confidence": "0.84085"
    },
    {
      "chunk": "insulin glargine",
      "begin": 138,
      "end": 153,
      "ner_label": "DRUG",
      "confidence": "0.87525"
    },
    {
      "chunk": "at night",
      "begin": 155,
      "end": 162,
      "ner_label": "FREQUENCY",
      "confidence": "0.76839995"
   

## **Creating a Healthcare NLP Agent**

In [None]:
# Define tools for the agent
tools = [
    Tool(
        name="clinical_entity_extraction",
        func=clinical_ner_tool,
        description="Extract Problem, Test, and Treatment entities from clinical notes. Use this when the request is about medical conditions, procedures, or clinical treatments."
    ),
    Tool(
        name="medication_entity_extraction",
        func=medication_ner_tool,
        description="Extract medication-related entities like drug names, dosages, frequencies, etc. Use this when the request is specifically about medications, drug regimens, or pharmaceutical information."
    )
]

# Define a prompt for the healthcare agent
healthcare_agent_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert healthcare AI assistant specialized in extracting and analyzing medical entities from text.

You have access to the following tools:

1. clinical_entity_extraction: Extracts clinical entities (Problem, Test, Treatment) from medical text
2. medication_entity_extraction: Extracts medication entities (drug, dosage, frequency, etc.) from medical text

When a user wants to extract clinical entities like problems, tests, or treatments, use clinical_entity_extraction.
When a user wants to extract medication information, use medication_entity_extraction.

Always think step by step about the user request:
1. Determine what entities the user is looking for
2. Choose the appropriate extraction tool
3. Present the results in a clear, readable table format

Be precise and professional in your responses.
"""),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Create the agent
agent = create_openai_functions_agent(llm, tools, healthcare_agent_prompt)

# Create the agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True,
    handle_parsing_errors=True
)

# Test the agent with a clinical entity extraction request
clinical_query = """Extract all clinical problems from the following text:

A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting."""

print("\nTesting Healthcare Agent with Clinical Query:")
clinical_agent_result = agent_executor.invoke({"input": clinical_query})
print(clinical_agent_result['output'])


Testing Healthcare Agent with Clinical Query:


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `clinical_entity_extraction` with `A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.`


[0m[36;1m[1;3m{'entities': [{'chunk': 'gestational diabetes mellitus', 'begin': 39, 'end': 67, 'ner_label': 'PROBLEM', 'confidence': '0.91976666'}, {'chunk': 'subsequent type two diabetes mellitus', 'begin': 117, 'end': 153, 'ner_label': 'PROBLEM', 'confidence': '0.75924003'}, {'chunk': 'T2DM', 'begin': 156, 'end': 159, 'ner_label': 'PROBLEM', 'confidence': '0.9917'}, {'chunk': 'HTG-induced pancreatitis', 'begin': 184, 'end': 207, 'ner_label': 'PROBLEM', 'confidence': '0.9753

In [None]:
# Test the agent with a follow-up query
followup_query = """From the same clinical text, can you extract information about duration of the problems mentioned?"""

print("\nTesting Healthcare Agent with Follow-up Query:")
followup_result = agent_executor.invoke({"input": followup_query})
print(followup_result['output'])


Testing Healthcare Agent with Follow-up Query:


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThe extracted information regarding the duration of the problems mentioned in the text is as follows:

| Clinical Problem                        | Duration                     |
|-----------------------------------------|------------------------------|
| Gestational diabetes mellitus            | Eight years prior to presentation |
| Type two diabetes mellitus (T2DM)       | Not specified                |
| HTG-induced pancreatitis                | Three years prior to presentation |
| Acute hepatitis                         | Not specified                |
| Polyuria                                | One week                     |
| Poor appetite                           | One week                     |
| Vomiting                                | One week                     |[0m

[1m> Finished chain.[0m
The extracted information regarding the duration of the problems mentio

# **Building a Multi-Purpose Healthcare Agent**

Let's create a more robust, simplified approach by implementing a combined extraction and filtering tools:

In [None]:
from langchain.agents import Tool, AgentExecutor
from langchain.tools.render import format_tool_to_openai_function
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.tools import Tool

In [None]:
def extract_and_filter_clinical_entities(text: str, entity_types: Optional[str] = None) -> Dict:
    """
    Extract clinical entities from text and optionally filter by type.

    Args:
        text: The clinical text to analyze
        entity_types: Optional comma-separated string of entity types to filter by (e.g., "PROBLEM,TEST")
                     If not provided, all entities will be returned

    Returns:
        Dictionary with extracted and optionally filtered entities
    """
    try:
        # First extract all entities
        entities = extract_entities(text, light_model)

        # If entity_types is provided, filter the entities
        if entity_types:
            # Split the entity_types string by comma
            types_to_keep = [t.strip().upper() for t in entity_types.split(",")]

            # Filter entities
            filtered_entities = [
                entity for entity in entities
                if entity.get("ner_label", "").upper() in types_to_keep
            ]

            return {
                "entities": filtered_entities,
                "filtered_by": types_to_keep,
                "total_filtered": len(filtered_entities),
                "total_original": len(entities)
            }

        # Return all entities if no filtering is requested
        return {"entities": entities, "total": len(entities)}

    except Exception as e:
        return {"error": str(e)}

In [None]:
def extract_and_filter_medication_entities(text: str, entity_types: Optional[str] = None) -> Dict:
    """
    Extract medication entities from text and optionally filter by type.

    Args:
        text: The clinical text to analyze
        entity_types: Optional comma-separated string of entity types to filter by (e.g., "DRUG,DOSAGE")
                     If not provided, all entities will be returned

    Returns:
        Dictionary with extracted and optionally filtered entities
    """
    try:
        # First extract all entities
        entities = extract_entities(text, light_posology_model)

        # If entity_types is provided, filter the entities
        if entity_types:
            # Split the entity_types string by comma
            types_to_keep = [t.strip().upper() for t in entity_types.split(",")]

            # Filter entities
            filtered_entities = [
                entity for entity in entities
                if entity.get("ner_label", "").upper() in types_to_keep
            ]

            return {
                "entities": filtered_entities,
                "filtered_by": types_to_keep,
                "total_filtered": len(filtered_entities),
                "total_original": len(entities)
            }

        # Return all entities if no filtering is requested
        return {"entities": entities, "total": len(entities)}

    except Exception as e:
        return {"error": str(e)}

In [None]:
# Create tools from these functions
clinical_extraction_tool = Tool(
    name="extract_clinical_entities",
    func=extract_and_filter_clinical_entities,
    description="Extract clinical entities (PROBLEM, TEST, TREATMENT) from text and optionally filter by type. Use entity_types parameter to filter (e.g., 'PROBLEM,TEST')."
)

medication_extraction_tool = Tool(
    name="extract_medication_entities",
    func=extract_and_filter_medication_entities,
    description="Extract medication entities (DRUG, DOSAGE, FREQUENCY, etc.) from text and optionally filter by type. Use entity_types parameter to filter."
)

# Update our tools list
tools = [clinical_extraction_tool, medication_extraction_tool]

# Update functions for the agent
functions = [format_tool_to_openai_function(t) for t in tools]

# Update the system prompt with simpler instructions
multi_healthcare_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert healthcare AI assistant specializing in medical information extraction.

You can analyze clinical texts using these specialized tools:

1. extract_clinical_entities: Extract clinical entities like PROBLEM, TEST, and TREATMENT from text
   - You can optionally filter by entity types by adding the entity_types parameter
   - Example: extract_clinical_entities(text="patient note", entity_types="PROBLEM,TEST")

2. extract_medication_entities: Extract medication entities like DRUG, DOSAGE, and FREQUENCY
   - You can optionally filter by entity types by adding the entity_types parameter
   - Example: extract_medication_entities(text="patient note", entity_types="DRUG")

Choose the appropriate tool based on what the user is asking for. If they want clinical entities, use extract_clinical_entities. If they want medication information, use extract_medication_entities.

Present your results in a clear, organized format, grouping entities by their type.
"""),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

Recreate the agent with updated prompt and tools

In [None]:
def create_advanced_healthcare_agent():
    agent = (
        {
            "input": lambda x: x["input"],
            "chat_history": lambda x: x.get("chat_history", []),
            "agent_scratchpad": lambda x: format_to_openai_functions(x.get("intermediate_steps", [])),
        }
        | multi_healthcare_prompt
        | llm.bind(functions=functions)
        | OpenAIFunctionsAgentOutputParser()
    )

    return AgentExecutor(
        agent=agent,
        tools=tools,
        memory=memory,
        verbose=True,
        handle_parsing_errors=True
    )

# Recreate the agent
advanced_agent = create_advanced_healthcare_agent()

## **Testing the Multi-Purpose Healthcare Agent**

In [None]:
# Test with a comprehensive query that could use either tool
comprehensive_query = """Analyze this patient note and tell me about both clinical conditions and medications:

A 65-year-old male with a history of congestive heart failure and type 2 diabetes mellitus presented with increasing shortness of breath. He reported chest pain radiating to his left arm. The patient has been on lisinopril 10mg daily, metformin 1000mg twice daily, and was recently started on furosemide 40mg daily for peripheral edema. Laboratory tests showed elevated troponin levels and an ECG revealed ST-segment elevation in leads V1-V4."""

print("\nTesting Multi-Purpose Healthcare Agent with Comprehensive Query:")
comprehensive_result = advanced_agent.invoke({"input": comprehensive_query})
print(comprehensive_result['output'])


Testing Multi-Purpose Healthcare Agent with Comprehensive Query:


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `extract_clinical_entities` with `A 65-year-old male with a history of congestive heart failure and type 2 diabetes mellitus presented with increasing shortness of breath. He reported chest pain radiating to his left arm. The patient has been on lisinopril 10mg daily, metformin 1000mg twice daily, and was recently started on furosemide 40mg daily for peripheral edema. Laboratory tests showed elevated troponin levels and an ECG revealed ST-segment elevation in leads V1-V4.`


[0m[36;1m[1;3m{'entities': [{'chunk': 'congestive heart failure', 'begin': 37, 'end': 60, 'ner_label': 'PROBLEM', 'confidence': '0.9814334'}, {'chunk': 'type 2 diabetes mellitus', 'begin': 66, 'end': 89, 'ner_label': 'PROBLEM', 'confidence': '0.88175005'}, {'chunk': 'increasing shortness of breath', 'begin': 106, 'end': 135, 'ner_label': 'PROBLEM', 'confidence': '0.8943'}, {'c

In [None]:
# Test with a specific entity filtering request
filtered_query = """From the following text, extract only Problem and Test entities:

A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27."""

print("\nTesting Multi-Purpose Healthcare Agent with Filtered Query:")
filtered_result = advanced_agent.invoke({"input": filtered_query})
print(filtered_result['output'])


Testing Multi-Purpose Healthcare Agent with Filtered Query:


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `extract_clinical_entities` with `A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.`


[0m[36;1m[1;3m{'entities': [{'chunk': 'gestational diabetes mellitus', 'begin': 39, 'end': 67, 'ner_label': 'PROBLEM', 'confiden

# **Handling Complex Entity Filtering Requirements**  
Let's upgrade our agent to handle more complex requirements like entity statistics and analysis:

In [None]:
# Function to create a summary of entity statistics
@tool
def analyze_medical_entities(entities_data: Union[str, Dict]) -> Dict:
    """
    Generate statistics and analysis about extracted medical entities.

    Args:
        entities_data: Output from extract_clinical_entities or extract_medication_entities
                      (can be a dictionary or a JSON string)

    Returns:
        Dictionary with statistics and analysis about entities
    """
    try:
        # Handle string input (convert to dict)
        if isinstance(entities_data, str):
            try:
                entities_data = json.loads(entities_data)
            except json.JSONDecodeError:
                return {"error": f"Invalid JSON string: {entities_data[:50]}..."}

        # Now we should have a dictionary
        if not isinstance(entities_data, dict):
            return {"error": f"Invalid input type: {type(entities_data)}, expected dictionary or JSON string"}

        entities = entities_data.get("entities", [])

        if not entities:
            return {"statistics": {"total": 0, "message": "No entities found"}}

        # Count by entity type
        type_counts = {}
        confidence_by_type = {}
        entities_by_type = {}

        for entity in entities:
            entity_type = entity.get("ner_label", "UNKNOWN")
            confidence = float(entity.get("confidence", 0.0))

            if entity_type not in type_counts:
                type_counts[entity_type] = 0
                confidence_by_type[entity_type] = []
                entities_by_type[entity_type] = []

            type_counts[entity_type] += 1
            confidence_by_type[entity_type].append(confidence)
            entities_by_type[entity_type].append(entity)

        # Calculate average confidence by type
        avg_confidence = {
            entity_type: sum(confidence_by_type[entity_type]) / len(confidence_by_type[entity_type])
            for entity_type in confidence_by_type
        }

        # Find highest and lowest confidence entities
        all_entities_with_confidence = [(e.get("chunk"), float(e.get("confidence", 0))) for e in entities]
        highest_confidence = max(all_entities_with_confidence, key=lambda x: x[1]) if all_entities_with_confidence else None
        lowest_confidence = min(all_entities_with_confidence, key=lambda x: x[1]) if all_entities_with_confidence else None

        # Prepare result
        result = {
            "statistics": {
                "total": len(entities),
                "by_type": type_counts,
                "avg_confidence": avg_confidence,
                "highest_confidence": {
                    "entity": highest_confidence[0],
                    "confidence": highest_confidence[1]
                } if highest_confidence else None,
                "lowest_confidence": {
                    "entity": lowest_confidence[0],
                    "confidence": lowest_confidence[1]
                } if lowest_confidence else None
            },
            "entities_by_type": entities_by_type
        }

        return result

    except Exception as e:
        return {"error": f"Error analyzing entities: {str(e)}"}

In [None]:
# Create a tool to highlight potentially critical medical findings
@tool
def highlight_critical_findings(clinical_entities: Union[str, Dict]) -> Dict:
    """
    Analyze clinical entities and highlight potentially critical medical findings.

    Args:
        clinical_entities: Output from extract_clinical_entities
                         (can be a dictionary or a JSON string)

    Returns:
        Dictionary with highlighted critical findings
    """
    try:
        # Handle string input (convert to dict)
        if isinstance(clinical_entities, str):
            try:
                clinical_entities = json.loads(clinical_entities)
            except json.JSONDecodeError:
                return {"error": f"Invalid JSON string: {clinical_entities[:50]}..."}

        # Now we should have a dictionary
        if not isinstance(clinical_entities, dict):
            return {"error": f"Invalid input type: {type(clinical_entities)}, expected dictionary or JSON string"}

        entities = clinical_entities.get("entities", [])

        # Define critical keywords
        critical_problems = [
            "infarction", "failure", "arrhythmia", "stroke", "hemorrhage", "sepsis",
            "shock", "emboli", "aneurysm", "ischemia", "elevation", "troponin", "stenosis"
        ]

        # Find potentially critical entities
        critical_findings = []
        for entity in entities:
            chunk = entity.get("chunk", "").lower()
            label = entity.get("ner_label", "")

            # Check for critical problems
            if label == "PROBLEM" and any(keyword in chunk.lower() for keyword in critical_problems):
                critical_findings.append({
                    "entity": entity.get("chunk"),
                    "type": label,
                    "confidence": entity.get("confidence"),
                    "reason": "Contains critical keyword"
                })

            # Check for any extremely high confidence problems
            if label == "PROBLEM" and float(entity.get("confidence", 0)) > 0.98:
                if not any(finding["entity"] == entity.get("chunk") for finding in critical_findings):
                    critical_findings.append({
                        "entity": entity.get("chunk"),
                        "type": label,
                        "confidence": entity.get("confidence"),
                        "reason": "High confidence problem"
                    })

        return {
            "critical_findings": critical_findings,
            "total_critical": len(critical_findings)
        }

    except Exception as e:
        return {"error": f"Error highlighting critical findings: {str(e)}"}

Update our tools list with the new analysis tools

In [None]:

enhanced_tools = [
    clinical_extraction_tool,
    medication_extraction_tool,
    Tool(
        name="analyze_medical_entities",
        func=analyze_medical_entities,
        description="Generate statistics and analysis about extracted medical entities. Works with the direct output from entity extraction."
    ),
    Tool(
        name="highlight_critical_findings",
        func=highlight_critical_findings,
        description="Analyze clinical entities and highlight potentially critical medical findings. Works with the direct output from entity extraction."
    )
]


# Update functions for the agent
enhanced_functions = [format_tool_to_openai_function(t) for t in enhanced_tools]

# Create an enhanced system prompt for the advanced agent
# Update the system prompt to explicitly mention handling of JSON strings
enhanced_healthcare_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert healthcare AI assistant specializing in medical information extraction and analysis.

You can use these specialized healthcare tools:

1. extract_clinical_entities: Extract clinical entities (PROBLEM, TEST, TREATMENT)
   - Can filter: extract_clinical_entities(text="note", entity_types="PROBLEM,TEST")

2. extract_medication_entities: Extract medication entities (DRUG, DOSAGE, etc.)
   - Can filter: extract_medication_entities(text="note", entity_types="DRUG,FREQUENCY")

3. analyze_medical_entities: Generate statistics about extracted entities
   - Will automatically handle the output from extraction tools
   - Just pass the entire result: analyze_medical_entities(entities_data=result)

4. highlight_critical_findings: Identify potentially critical medical issues
   - Will automatically handle the output from extraction tools
   - Just pass the entire result: highlight_critical_findings(clinical_entities=result)

Follow this workflow for advanced analysis:
1. First extract entities using the appropriate extraction tool
2. Then analyze the results using analyze_medical_entities or highlight_critical_findings
3. Present results in a clear, organized clinical format

When analyzing patient notes, look for patterns and relationships between clinical problems and medications. Highlight critical findings that may require immediate attention.
"""),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

In [None]:
# Create an enhanced healthcare agent
def create_enhanced_healthcare_agent():
    agent = (
        {
            "input": lambda x: x["input"],
            "chat_history": lambda x: x.get("chat_history", []),
            "agent_scratchpad": lambda x: format_to_openai_functions(x.get("intermediate_steps", [])),
        }
        | enhanced_healthcare_prompt
        | llm.bind(functions=enhanced_functions)
        | OpenAIFunctionsAgentOutputParser()
    )

    return AgentExecutor(
        agent=agent,
        tools=enhanced_tools,
        memory=memory,
        verbose=True,
        handle_parsing_errors=True
    )

# Create the enhanced agent
enhanced_agent = create_enhanced_healthcare_agent()

## **Testing the Multi-Purpose Healthcare Agent**

In [None]:
# Test with a complex analysis request
analysis_query = """Analyze this clinical note, extract all medical entities, then provide statistics and highlight any critical findings:

A 58-year-old female with hypertension, hyperlipidemia, and type 2 diabetes presented with sudden onset chest pain and shortness of breath. ECG showed ST elevation in leads II, III, and aVF. Cardiac enzymes were elevated with troponin I of 2.3 ng/mL. The patient was treated with aspirin 325mg, clopidogrel 600mg loading dose, and started on heparin drip. She underwent emergent cardiac catheterization which revealed 90% stenosis in the right coronary artery. A drug-eluting stent was placed. Post-procedure, she was prescribed aspirin 81mg daily, clopidogrel 75mg daily, atorvastatin 40mg at night, metoprolol 25mg twice daily, and was instructed to continue her home medications including metformin 1000mg twice daily and lisinopril 10mg daily."""

print("\nTesting Enhanced Healthcare Agent with Complex Analysis Query:")
analysis_result = enhanced_agent.invoke({"input": analysis_query})
print(analysis_result['output'])


Testing Enhanced Healthcare Agent with Complex Analysis Query:


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `extract_clinical_entities` with `A 58-year-old female with hypertension, hyperlipidemia, and type 2 diabetes presented with sudden onset chest pain and shortness of breath. ECG showed ST elevation in leads II, III, and aVF. Cardiac enzymes were elevated with troponin I of 2.3 ng/mL. The patient was treated with aspirin 325mg, clopidogrel 600mg loading dose, and started on heparin drip. She underwent emergent cardiac catheterization which revealed 90% stenosis in the right coronary artery. A drug-eluting stent was placed. Post-procedure, she was prescribed aspirin 81mg daily, clopidogrel 75mg daily, atorvastatin 40mg at night, metoprolol 25mg twice daily, and was instructed to continue her home medications including metformin 1000mg twice daily and lisinopril 10mg daily.`


[0m[36;1m[1;3m{'entities': [{'chunk': 'hypertension', 'begin': 26, 'end': 37

# **Conclusion**

In this notebook, we've successfully integrated John Snow Labs' specialized healthcare NLP models with modern LLM frameworks to create a comprehensive medical information extraction system.   
We demonstrated how to extract clinical entities (problems, tests, treatments) and medication-specific information (drugs, dosages, frequencies) from medical texts, first using direct LLM prompting and then using John Snow Labs' purpose-built medical NER models.   
By leveraging agent framework, we transformed these extraction capabilities into interactive tools that can intelligently process, filter, and analyze medical text. The resulting system showcases advanced features including entity filtering by type, statistical analysis of extracted information, and identification of potentially critical medical findings—all presented in a clinically relevant format.   
This approach demonstrates how domain-specific NLP models can be enhanced with modern agent architectures to create powerful, practical tools for healthcare information processing that could assist medical professionals in quickly analyzing patient records and clinical documentation.  
