## **Library and Module Imports**  
  
The following code block imports all the essential libraries and frameworks required for the Clinical Intelligence System. These libraries support environment configuration, AI model integration, vector storage, data loading, evaluation, and performance metrics.  
  
### **Key Imports and Their Purpose**  
  
1. **Core AI and Environment Setup**  
   - `import openai` – Provides access to OpenAI's API for natural language processing and model integration.  
   - `from dotenv import load_dotenv` – Loads environment variables from a `.env` file to securely store API keys and configuration values.  
  
2. **Vector Storage and Embeddings**  
   - `from langchain.vectorstores import Chroma` – Manages vector databases for semantic search and retrieval.  
   - `from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI` – Integrates Azure-hosted OpenAI models for embeddings and chat-based language models.  
  
3. **Document Handling**  
   - `from langchain.schema import Document` – Defines structured document objects for processing.  
   - `from langchain_community.document_loaders.csv_loader import CSVLoader` – Loads CSV files into a document structure for further processing.  
  
4. **Utilities and Data Structures**  
   - `from typing import List, Tuple, Dict` – Provides type hints for function parameters and return values.  
   - `import numpy as np` – Supports numerical operations, arrays, and mathematical computations.  
   - `import time` – Enables time tracking and performance measurement.  
  
5. **Retry Mechanisms**  
   - `from tenacity import retry, stop_after_attempt, wait_random_exponential` – Implements robust retry logic to handle API timeouts and transient failures.  
  
6. **Evaluation Framework**  
   - `from deepeval.models.base_model import DeepEvalBaseLLM` – Base class for evaluating large language models.  
   - `from deepeval.test_case import LLMTestCase, LLMTestCaseParams` – Defines structured test cases for model evaluation.  
   - `from deepeval import evaluate as deepeval_evaluate` – Runs evaluation processes for AI model outputs.  
  
7. **Evaluation Metrics**  
   - `from deepeval.metrics import (ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric, AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric)`    
     - **ContextualPrecisionMetric** – Measures the accuracy of retrieved information within the given context.    
     - **ContextualRecallMetric** – Measures how much relevant context is retrieved.    
     - **ContextualRelevancyMetric** – Evaluates the contextual fit of retrieved information.    
     - **AnswerRelevancyMetric** – Assesses how relevant the generated answer is to the question.    
     - **FaithfulnessMetric** – Ensures the answer is factually grounded in the source material.    
     - **HallucinationMetric** – Detects fabricated or unsupported information in responses.  
  
---  
  
**Summary:**    
This set of imports lays the groundwork for:  
- **NLP processing** (OpenAI, Azure OpenAI, LangChain)  
- **Data management** (Chroma DB, CSV loaders, Document schema)  
- **Robustness** (retry mechanisms)  
- **Evaluation** (DeepEval metrics and test cases)    
These tools collectively enable the system to process clinical queries, retrieve relevant data, generate responses, and evaluate their quality.  

In [None]:
import openai
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain.schema import Document
from langchain_community.document_loaders.csv_loader import CSVLoader

from typing import List, Tuple, Dict
import numpy as np  
import time

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import evaluate as deepeval_evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
)

### Create Model Client and Set Up Authentication

The following code initializes the UAIS environment to establish a secure connection with the Azure OpenAI service. It handles authentication by retrieving the necessary access token and configures the embedding function to generate vector representations for input text. This setup enables downstream tasks such as semantic search, similarity comparison, and other embedding-based applications.

| Requirement           | Description                                                        |  
|-----------------------|--------------------------------------------------------------------|  
| Large Language Models (LLM) | OpenAI LLM API (`gpt-4o-mini_2024-07-18`)                           |  
| Embedding Models      | Preferred embedding model is `text-embedding-3-small_1`            |  

In [None]:
# Authentication:
import httpx

auth = "https://api.com/oauth2/token"
client_id = dbutils.secrets.get(scope = "AIML", key = "client_id")
client_secret = dbutils.secrets.get(scope = "AIML", key = "client_secret")
scope = "https://api.com/.default"
grant_type = "client_credentials"
async with httpx.AsyncClient() as client:
    body = {
        "grant_type": grant_type,
        "scope": scope,
        "client_id": client_id,
        "client_secret": client_secret,
    }
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    resp = await client.post(auth, headers=headers, data=body, timeout=120)
    token = resp.json()["access_token"]


load_dotenv("./Data/vars.env")

AZURE_OPENAI_ENDPOINT = os.environ["MODEL_ENDPOINT"]
OPENAI_API_VERSION = os.environ["API_VERSION"]
CHAT_DEPLOYMENT_NAME = os.environ["CHAT_MODEL_NAME"]
PROJECT_ID = os.environ["PROJECT_ID"]
EMBEDDINGS_DEPLOYMENT_NAME = os.environ["EMBEDDINGS_MODEL_NAME"]

chat_client = openai.AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=OPENAI_API_VERSION,
    azure_deployment=CHAT_DEPLOYMENT_NAME,
    azure_ad_token=token,
    default_headers={
        "projectId": PROJECT_ID
    }
)

embeddings_client = openai.AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=OPENAI_API_VERSION,
    azure_deployment=EMBEDDINGS_DEPLOYMENT_NAME,
    azure_ad_token=token,
    default_headers={ 
        "projectId": PROJECT_ID
    }
)

## Azure OpenAI Model & Embeddings Setup  
  
This section initializes the Azure OpenAI resources required for the RAG pipeline:  
  
- **`AzureChatOpenAI`** – Configures a chat-based LLM endpoint using Azure OpenAI, enabling conversational interactions and contextual responses.  
- **`AzureOpenAIEmbeddings`** – Sets up an embedding model to convert text into high-dimensional vectors for semantic search and retrieval.  
- Both components share:  
  - The same **API version** and **Azure endpoint**.  
  - **Azure AD token authentication** for secure access.  
  - Custom **`projectId`** in request headers for project-level tracking.  

In [None]:
chat_model = AzureChatOpenAI(
    openai_api_version=OPENAI_API_VERSION,
    azure_deployment=CHAT_DEPLOYMENT_NAME,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_ad_token=token,
    default_headers={"projectId": PROJECT_ID},
)

embeddings = AzureOpenAIEmbeddings(
    azure_deployment=EMBEDDINGS_DEPLOYMENT_NAME,
    api_version=OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_ad_token=token,
    default_headers={
        "projectId": PROJECT_ID
    }
)

### Tiktoken Cache Configuration
 
> This code sets up a custom cache directory for Tiktoken by defining `TIKTOKEN_CACHE_DIR` as an environment variable.  
> Local caching of tokenization results enhances performance by avoiding repeated computation during recurring embedding or tokenization tasks.

In [None]:
tiktoken_cache_dir = os.path.abspath("./.setup/tiktoken_cache/")
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

# we have to disable telemetry to use ChromaDB
# See here for more information: https://docs.trychroma.com/docs/overview/telemetry
os.environ["ANONYMIZED_TELEMETRY"]="False"

## DocumentProcessor Class  
  
The `DocumentProcessor` class leverages **LangChain's CSVLoader** to efficiently ingest CSV datasets and convert them into LangChain `Document` objects. This design ensures:  
  
- **Built-in CSV parsing** with automatic conversion to Document objects.    
- **Native LangChain document structure** for smooth integration into RAG pipelines.    
- **Standardized metadata extraction** adhering to LangChain conventions.    
- **Direct compatibility** with LangChain’s ecosystem of loaders, retrievers, and vector stores.  


In [None]:
class DocumentProcessor:
    """
    Streamlined Document Processing Engine for RAG pipeline using LangChain CSVLoader.
    """
    
    def __init__(self):
        """
        Initialize the DocumentProcessor.
        """
        self.langchain_docs = []  # List of LangChain Document objects for RAG
        
    def load_csv_with_langchain(self, csv_path: str) -> List[Document]:
        """
        Load CSV data using LangChain CSVLoader.
        
        Args:
            csv_path (str): Path to the CSV file to load
            
        Returns:
            List[Document]: List of LangChain Document objects
        """
        # Configure CSVLoader with our dataset structure
        loader = CSVLoader(
            file_path=csv_path,
            source_column="document_url",  # Use document_url as source
            metadata_columns=["document_id", "document_url"],  # Include these in metadata
            content_columns=["context"]  # Use context as main content
        )
        
        # Load documents using LangChain
        documents = loader.load()
        
        print(f"Successfully loaded {len(documents)} document chunks from CSV using LangChain CSVLoader.")
        return documents
        
    def load_dataset(self, dataset_path: str) -> List[Document]:
        """
        Load documents from CSV using LangChain CSVLoader and return Documents ready for vector storage.
        
        Args:
            dataset_path (str): Path to the CSV dataset file
            
        Returns:
            List[Document]: List of LangChain Document objects
            
        Raises:
            FileNotFoundError: If the dataset file is not found
            Exception: If there's an error loading the CSV file
        """
        try:
            # Check if file exists
            if not os.path.exists(dataset_path):
                raise FileNotFoundError(f"Dataset file not found: {dataset_path}")
            
            print(f"📁 Loading dataset from: {dataset_path}")
            
            # Use LangChain CSVLoader 
            self.langchain_docs = self.load_csv_with_langchain(dataset_path)
            
            print(f"✅ Dataset loaded successfully using LangChain CSVLoader")
            print(f"📚 Loaded {len(self.langchain_docs)} LangChain Document objects")
            
            # Display first document for verification
            if self.langchain_docs:
                print(f"\n🔍 Sample LangChain Document:")
                sample_doc = self.langchain_docs[0]
                print(f"   Content: {sample_doc.page_content[:100]}...")
                print(f"   Metadata: {sample_doc.metadata}")
            
            return self.langchain_docs
            
        except FileNotFoundError as e:
            print(f"❌ File not found: {e}")
            raise
        except Exception as e:
            print(f"❌ Error loading dataset with CSVLoader: {e}")
            raise
    
    def get_langchain_documents(self) -> List[Document]:
        """
        Get the loaded LangChain Documents ready for vector storage.
        
        Returns:
            List[Document]: List of LangChain Document objects
        """
        return self.langchain_docs
    
    def get_document_count(self) -> int:
        """Get the number of loaded documents."""
        return len(self.langchain_docs)
    
    def display_dataset_info(self) -> None:
        """Display essential information about the loaded dataset for RAG."""
        if not self.langchain_docs:
            print("⚠️ No documents loaded. Call load_dataset() first.")
            return
        
        print(f"\n📊 === Dataset Information for RAG Pipeline ===")
        print(f"📚 Total documents: {len(self.langchain_docs)}")
        
        if self.langchain_docs:
            # Show sample LangChain document structure
            print(f"\n📄 === Sample LangChain Document ===")
            sample_doc = self.langchain_docs[0]
            print(f"page_content: {sample_doc.page_content[:200]}...")
            print(f"metadata: {sample_doc.metadata}")
            
            # Show content length statistics
            content_lengths = [len(doc.page_content) for doc in self.langchain_docs]
            print(f"\n📊 === Content Statistics ===")
            print(f"Average content length: {sum(content_lengths) / len(content_lengths):.0f} characters")
            print(f"Shortest document: {min(content_lengths)} characters")
            print(f"Longest document: {max(content_lengths)} characters")

print("✅ LangChain CSVLoader-based DocumentProcessor defined successfully!")