# Multi Agents: Clinical Metadata Extraction

This project provides a powerful system for extracting and structuring clinical metadata from a given set of text chunks, such as those found in medical case studies.

The system is composed of two main components:

## 1. RetrieverAgent

The `RetrieverAgent` is responsible for iteratively querying a vector database to retrieve the relevant text chunks for a given metadata category. It takes the following inputs:

- A language model (LLM) for generating the queries
- A `RetrieverManager`: contain tools for querying the vector database and ranking the retrieved chunks
- A `ChunkManager`: contain tools for managing the storage of retrieved text chunks

The `RetrieverAgent` follows an iterative process to extract the relevant text chunks:

1. The agent uses initial query and the `RetrieverManager` to retrieve and rank the relevant chunks.
2. The agent then analyze if retrieved chunks are relevant and then stores them using the `ChunkManager` for later use by the `MetadataExtractionAgent`.
3. If retrieved chunks are not relevant then the agent refine the initial query and retrieve the chunks again.
4. The process continues until the agent is confident that it has extracted all relevant chunks.

## 2. MetadataExtractionAgent

The `MetadataExtractionAgent` is responsible for the overall process of extracting and structuring the clinical metadata. It takes the following inputs:

- A language model (LLM) for generating the metadata
- A `ChunkManager`: contain tools for managing the storage of retrieved text chunks
- A `MetadataManager`: contain tools for managing the stored sentence snippets from each chunks
- An `OutputValidator`: Tool for validating the final structured metadata output

The `MetadataExtractionAgent` follows a multi-step process to produce the final metadata:

1. It generates a detailed prompt that instructs the LLM to analyze the text chunks and extract relevant information for the specified metadata categories.
2. The agent analyze each chunks and extract sentence snippets which are relevant for a speicific metadata categories. 
3. It uses `MetadataManager` tool for maintain all extracted sentence snippets and later generates the final structured output.
3. The agent validates the generated output using the `OutputValidator` tool, and if errors or missing fields are detected, the agent will retry the process until the valid output is generated.
4. The final, validated metadata is returned as a structured dictionary.

## How the System Works

The complete system works as follows:

1. The `RetrieverAgent` is used to extract relevant text chunks for a subset of the metadata categories.
2. The `MetadataExtractionAgent` is then used to generate the final structured metadata, leveraging the text chunks retrieved by the `RetrieverAgent`.
3. The `OutputValidator` is used to ensure the generated metadata conforms to the expected data model.

This two-step approach, with the `RetrieverAgent` handling the chunk extraction and the `MetadataExtractionAgent` handling the final metadata generation, allows the system to efficiently and accurately extract clinical metadata from the given text data.


## Imports

In [18]:
import os
import json

from typing import Dict
from dotenv import load_dotenv

from llama_index.llms.openai import OpenAI
from llama_index.core.tools import FunctionTool
from llama_index.core.agent import ReActAgent 

from clinical_ie.tools import ChunkManager, RetrieverManager, MetadataManager, OutputValidator, ClinicalMetadata
from clinical_ie.simple_rag_pipeline.rag_utils import get_retriever, get_embedding_model, load_reranker_model

In [2]:
load_dotenv()

base_path = "clinical_ie/MACCR/"

pdf_fn = "Tsuchiya et al. - 2017 - A case of concomitant colitic cancer and intrahepa.pdf"
pdf_file = base_path+pdf_fn

top_k = 3 # for ranking
node_parsing_method = "semantic"

OPEN_AI_MODEL_NAME = "gpt-4o-2024-08-06" # "gpt-4o-mini-2024-07-18" # 
KEY = os.environ.get("OPENAI_API_KEY")


## Model Setup

In [3]:
llm = OpenAI(model=OPEN_AI_MODEL_NAME,api_key=KEY)

embed_model = get_embedding_model()

retriever = get_retriever(pdf_file, embed_model, node_parsing_method,top_k=10)

reranker_model = load_reranker_model()

Found 6 total number of pages from the document clinical_ie/MACCR/Tsuchiya et al. - 2017 - A case of concomitant colitic cancer and intrahepa.pdf, after cleaning 5 left


## Retriever Agent: Extract all relevant piece of chunk

In [4]:
class RetrieverAgent(ReActAgent):
    def __init__(self, retriever_manager : RetrieverManager, llm, chunk_manager : ChunkManager):
        self.llm = llm
        self.chunk_manager = chunk_manager
        self.retriever_manager = retriever_manager
        
        tools = [
            FunctionTool.from_defaults(
                fn=self.chunk_manager.save_chunks
            ),
            FunctionTool.from_defaults(
                fn=self.chunk_manager.get_chunks
            ),
            FunctionTool.from_defaults(
                fn=self.retriever_manager.retrieve_chunks
            ),
            
        ]
        super().__init__(tools=tools, llm=self.llm, memory=None, max_iterations=20, verbose=True)

    def extract_metadata_chunks(self, metadata_category: str, initial_query) -> None:
        prompt = f"""You are an advanced clinical information retrieval system with expertise in analyzing medical case studies. Your task is to iteratively generate queries to extract relevant chunks of information for specific metadata categories from a clinical case study. You will later use these chunks to compile comprehensive metadata for each category.

        Metadata Category:
        {metadata_category}

        Initial Query:
        {initial_query}

        Instructions:
        1. Start with a initial query for the metadata category {metadata_category}. Use the vector_db_query tool to search for relevant chunks.

        2. Retrieved chunk:
        a. Analyze its relevance to the current metadata category.
        b. If relevant, use the save_chunks tool to store it.
        c. If not relevant, briefly explain why and use this information to refine your next query.

        4. Iterative Query Refinement Process:
        a. If initial queries don't yield sufficient relevant information:
            - Use synonyms or related medical terms.
            - Consider indirect references or contextual clues.
            - Broaden the search to include related concepts.
        b. If queries yield too much irrelevant information:
            - Make your queries more specific.
        c. If you find partial information:
            - Generate follow-up queries to find missing details.

        5. Consider these aspects when searching:
        - Direct mentions of the category or related terms.
        - Contextual information that implies relevant data.

        6. Continue this process until:
        a. You are confident you have extracted all relevant information.
        b. Multiple consecutive refined queries yield no new unique chunks.

        Remember:
        - Accuracy and comprehensiveness are crucial.
        """
        _ = self.chat(prompt)


## Metadata Extraction Agent: Extract important metadata from retrieved chunks

In [5]:
class MetadataExtractionAgent(ReActAgent):
    def __init__(self, llm, chunk_manager:ChunkManager, metadata_manager:MetadataManager, output_validator:OutputValidator):
        self.llm = llm
        self.chunk_manager = chunk_manager
        self.metadata_manager = metadata_manager
        self.output_validator = output_validator
        tools = [
            FunctionTool.from_defaults(
                fn=self.chunk_manager.get_chunk
            ),
            FunctionTool.from_defaults(
                fn=self.metadata_manager.add_metadatas
            ),
            FunctionTool.from_defaults(
                fn=self.metadata_manager.get_all_metadata
            ),
            FunctionTool.from_defaults(
                fn=self.output_validator.validate_generated_metadata
            )

        ]
        super().__init__(tools=tools, llm=self.llm, memory=None, max_iterations=25, verbose=True)

    def extract_metadata(self, categories: str, chunk_ids: str) -> Dict[str, str]:
        prompt = f"""You are an expert clinical information specialist with extensive experience in analyzing medical case studies and extracting precise metadata.
        
        Your task is to analyze clinical case study chunks and extract sentence snippets relevant to {categories}. Follow these steps:
        
        1. Use the `get_chunk` tool to get relevant chunks using the chunk id.

        2. Analyze each chunk sequentially, focusing on extractive rather than generative approaches:
        - Extract sentence snippet or piece of text that fit into the following clinical metadata categories: {categories}.
        - Use the sentence saving tool `add_metadatas` to store relevant snippets all categories.

        3. After analyzing all chunks:
        - Get all sentence snippets saved in previous step using the tool `get_all_metadata`.
        - Now combine sentence snippets which makes sense and refine others with minimal paraphrasing if they don't make sense.
        - Discard any information that doesn't make sense in the context of the category.

        4. Final output should be in json format containing category as a key and list of metadata as a value.
        Use the `generate_clinical_metadata`, a parser tool to structure your final output.

        5. If the parser indicates errors or missing fields:
        - Review the extracted information.
        - Make necessary adjustments.
        - Try the parser again.
        - Repeat this process up to 3 times if needed.

        ## Important Notes:

        - Prioritize extraction over generation. Only paraphrase when necessary for clarity.
        - Ensure all extracted information is relevant to its assigned category.
        - Be thorough in your analysis, but avoid redundancy in the final output.
        - The final output must conform to the `generate_clinical_metadata` model structure.

        Remember, your goal is to produce accurate, structured clinical metadata from the given case study chunks.


        Chunks ids to analyze:
        {chunk_ids}
        """
        
        response = self.chat(prompt)
        return response.response

In [6]:
# Describtion of metadata
 
metadata_categories = {
    'Life Style': 'Describe the patient\'s lifestyle, including smoking, alcohol consumption, diet, and exercise habits.',
    'Family History': 'Provide the patient\'s family medical history, including any hereditary conditions or diseases.',
    'Social History': 'Detail the patient\'s social history, including occupation, living situation, and social support systems.',
    'Medical/Surgical History': 'Outline the patient\'s past medical and surgical history, including previous diagnoses, treatments, and surgeries.',
    'Signs and Symptoms': 'List the signs and symptoms the patient presented with.',
    'Comorbidities': 'List any comorbid conditions the patient has.',
    'Diagnostic Techniques and Procedures': 'Describe the diagnostic techniques and procedures used to evaluate the patient.',
    'Diagnosis': 'What was the final diagnosis given to the patient?',
    'Laboratory Values': 'List the laboratory values obtained from tests conducted on the patient.',
    'Pathology': 'Provide details about any pathology findings from the patient\'s case.',
    'Pharmacological Therapy': 'Detail the prescribed pharmacological therapy, including medications and dosages.',
    'Interventional Therapy': 'Describe any interventional therapies administered to the patient.',
    'Patient Outcome Assessment': 'Describe the outcomes following the patient\'s treatment.',
    'Age': 'What is the patient\'s age?',
    'Gender': 'What is the patient\'s gender?'
}


categories_subset = ["Diagnostic Techniques and Procedures","Medical/Surgical History"]
chunk_manager = ChunkManager()
retriever_manager = RetrieverManager(retriever,reranker_model,top_k=top_k)

for category in categories_subset:
    query_agent = RetrieverAgent(retriever_manager,llm,chunk_manager)
    query_agent.extract_metadata_chunks(category,metadata_categories[category])

with open('clinical_ie/chunks.json','w') as f:
    json.dump(chunk_manager.chunks,f)


> Running step af244fed-8745-47b2-870d-a8054b67297c. Step input: You are an advanced clinical information retrieval system with expertise in analyzing medical case studies. Your task is to iteratively generate queries to extract relevant chunks of information for specific metadata categories from a clinical case study. You will later use these chunks to compile comprehensive metadata for each category.

        Metadata Category:
        Diagnostic Techniques and Procedures

        Initial Query:
        Describe the diagnostic techniques and procedures used to evaluate the patient.

        Instructions:
        1. Start with a initial query for the metadata category Diagnostic Techniques and Procedures. Use the vector_db_query tool to search for relevant chunks.

        2. Retrieved chunk:
        a. Analyze its relevance to the current metadata category.
        b. If relevant, use the save_chunks tool to store it.
        c. If not relevant, briefly explain why and use this infor

In [13]:
with open('clinical_ie/chunks.json','r') as f:
    chunks_dict = json.load(f)


chunk_manager = ChunkManager()
chunk_manager.save_chunks(chunks_dict=[(k,v) for k,v in chunks_dict.items()])

'All chunks saved.'

In [17]:
all_categories = ['Life Style', 'Family History', 'Social History', 'Medical/Surgical History', 'Signs and Symptoms',
  'Comorbidities', 'Diagnostic Techniques and Procedures', 'Diagnosis', 'Laboratory Values', 'Pathology',
    'Pharmacological Therapy', 'Interventional Therapy', 'Patient Outcome Assessment', 'Age', 'Gender']

output_validator = OutputValidator(ClinicalMetadata)
metadata_manager = MetadataManager(categories=all_categories)

if len(chunk_manager.chunks) > 0:
    extraction_agent = MetadataExtractionAgent(llm, chunk_manager, metadata_manager, output_validator)
    concated_chunks_ids = "\n".join([f"id: {k}" for k,_ in chunk_manager.chunks.items()])
    metadata = extraction_agent.extract_metadata(all_categories, concated_chunks_ids)
else:
    print('No relevant chunks found')

> Running step f10026b6-fe17-42e9-aeea-23a6ebfedb87. Step input: You are an expert clinical information specialist with extensive experience in analyzing medical case studies and extracting precise metadata.
        
        Your task is to analyze clinical case study chunks and extract sentence snippets relevant to ['Life Style', 'Family History', 'Social History', 'Medical/Surgical History', 'Signs and Symptoms', 'Comorbidities', 'Diagnostic Techniques and Procedures', 'Diagnosis', 'Laboratory Values', 'Pathology', 'Pharmacological Therapy', 'Interventional Therapy', 'Patient Outcome Assessment', 'Age', 'Gender']. Follow these steps:
        
        1. Use the `get_chunk` tool to get relevant chunks using the chunk id.

        2. Analyze each chunk sequentially, focusing on extractive rather than generative approaches:
        - Extract sentence snippet or piece of text that fit into the following clinical metadata categories: ['Life Style', 'Family History', 'Social History', 'Med

SyntaxError: invalid syntax (<string>, line 1)