# Procedure Comparison Chatbot - Overview

This Jupyter Notebook implements a bilingual procedure comparison chatbot that processes PDF documents in English and French, extracts procedures, and enables querying and comparison through a chatbot interface. The project follows the **CRISP-DM methodology**, a structured and iterative framework for data mining, ensuring clear objectives, systematic development, and thorough evaluation. The notebook is organized into six CRISP-DM phases: **Business Understanding**, **Data Understanding**, **Data Preparation**, **Modeling**, **Evaluation**, and **Deployment**.

## Project Goals

The primary goal is to create a robust chatbot capable of:
- **Processing Bilingual PDFs**: Extract procedures from PDF documents in English and French, handling both text-based and scanned documents.
- **Query Handling**: Support user queries to list, describe, execute, or compare procedures with accurate intent recognition.
- **Bilingual Support**: Provide seamless responses in English and French, respecting user language preferences.
- **Scalable Deployment**: Deliver the chatbot through a web interface that is user-friendly and production-ready.
- **Performance Evaluation**: Assess the system’s accuracy in intent classification, response quality, and response time to ensure reliability.

The system aims to streamline access to procedural information, making it easier for users to understand and compare processes across languages.

## CRISP-DM Methodology and Process

The project leverages the **CRISP-DM methodology**, which provides a structured approach to data mining and ensures iterative improvement. The process is divided into the following phases:
- **Business Understanding**: Define objectives, configure logging, set up data models, and establish bilingual prompt templates for consistent responses.
- **Data Understanding**: Verify and explore PDF files in English and French directories to understand the data structure and availability.
- **Data Preparation**: Process PDFs to extract text and procedures, optimize text chunking parameters, and store data for efficient retrieval.
- **Modeling**: Implement a query handler that uses intent classification, document retrieval, and large language model (LLM) integration to generate responses.
- **Evaluation**: Assess chatbot performance using metrics like precision, recall, F1-score for intent classification, cosine similarity for response quality, and response times.
- **Deployment**: Launch the chatbot as a web application using Flask and Waitress, ensuring scalability and user accessibility.

Each phase builds on the previous one, with iterative refinements to improve performance and usability.

## Choice of Methods

The project employs a combination of methods to achieve its goals:
- **PDF Processing**: Uses `PyPDFLoader` for extracting text from text-based PDFs and `pdf2image` with `pytesseract` for OCR on scanned documents, ensuring robust handling of diverse PDF formats.
- **Language Detection**: Utilizes `fasttext` for accurate detection of English and French in documents and queries, enabling bilingual support.
- **Intent Classification**: Employs `DistilBERT`, a lightweight transformer model, for efficient and accurate classification of user intents (e.g., list, question, detail, execute, compare).
- **Text Chunking and Retrieval**: Combines `RecursiveCharacterTextSplitter` for splitting documents into manageable chunks and `EnsembleRetriever` (combining BM25 and Chroma) for hybrid search, balancing keyword-based and semantic retrieval.
- **Parameter Optimization**: Applies `optuna` to optimize chunking and retrieval parameters, ensuring efficient document processing and query performance.
- **LLM Integration**: Integrates `HuggingFaceEndpoint` with Mixtral-8x7B for generating natural and contextually relevant responses.
- **Web Interface**: Uses Flask for a lightweight web framework and Waitress for production-grade serving, providing a scalable and user-friendly interface.

These methods were chosen for their efficiency, scalability, and ability to handle bilingual procedural data effectively.

## Choice of Packages

The following packages were selected to support the project’s technical requirements:
- **langchain**: Facilitates document loading (`PyPDFLoader`), text splitting (`RecursiveCharacterTextSplitter`), and retrieval (`EnsembleRetriever`), streamlining data processing and search.
- **fasttext**: Provides fast and accurate language detection for bilingual support.
- **transformers**: Powers intent classification (`DistilBERT`) and LLM integration (`HuggingFaceEndpoint`), leveraging state-of-the-art NLP models.
- **pdf2image** and **pytesseract**: Enable OCR for scanned PDFs, ensuring text extraction from non-text-based documents.
- **spacy**: Supports NLP tasks like tokenization and entity recognition for intent classification and query expansion in both English and French.
- **optuna**: Offers hyperparameter optimization for chunking and retrieval, improving system performance.
- **sentence-transformers**: Computes cosine similarity for evaluating response quality, ensuring responses align with expected outputs.
- **flask** and **waitress**: Provide a lightweight web framework and production-ready server for deploying the chatbot interface.
- **pydantic**: Ensures structured data handling with robust models for procedures and responses, enhancing code reliability.

These packages were chosen for their compatibility, performance, and community support, ensuring a robust and maintainable system.

# Phase 1: Business Understanding - Configuration and Utilities

## CRISP-DM Phase Overview
The **Business Understanding** phase in CRISP-DM sets the project's goals and foundation. For the bilingual procedure comparison chatbot, this phase defines the need to process English and French PDFs, handle user queries, and ensure a scalable, reliable system with performance evaluation.

## Code Purpose
This code cell sets up the core infrastructure for the chatbot by initializing logging, defining file paths, creating data models, and providing utility functions for language detection and evaluation.

### What the Code Does
- **Logging**: Configures a log file (`procedure_comparison.log`) to track system events and errors.
- **Paths**: Defines directories for English/French PDFs, test data, and evaluation results, plus paths for tools like Tesseract and Poppler.
- **Data Models**: Uses Pydantic to create `Procedure` (for extracted PDF procedures) and `ChatbotResponse` (for query responses) models, ensuring structured data.
- **Utility Functions**:
  - `detect_language_from_prompt`: Identifies if a query is in English or French based on keywords, defaulting to English.
  - `evaluate_intent_classification`: Calculates precision, recall, and F1-score for intent classification accuracy.
  - `evaluate_response_quality`: Measures response quality using cosine similarity with `SentenceTransformer`.

### Why It Matters
This cell establishes the groundwork for a reliable, traceable system that supports bilingual processing and performance monitoring, enabling the chatbot to handle PDFs and queries effectively.

In [None]:
import logging
import os
import re
from pathlib import Path
from pydantic import BaseModel
from typing import List, Optional, Dict, Any
from sklearn.metrics import precision_score, recall_score, f1_score
from statistics import mean
import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize logging
logging.basicConfig(
    level=logging.INFO,
    filename='procedure_comparison.log',
    filemode='a',
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Define paths
BASE_DIR = Path('data').absolute()
ENGLISH_DIR = BASE_DIR / 'english'
FRENCH_DIR = BASE_DIR / 'french'
TEST_DATA_DIR = BASE_DIR / 'test_data'
EVAL_RESULTS_DIR = BASE_DIR / 'evaluation_results'
POPPLER_PATH = r"C:\Users\khalf\Downloads\Release-24.08.0-0\poppler-24.08.0\Library\bin"
TESSERACT_PATH = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
TESSDATA_PREFIX = r"C:\Program Files\Tesseract-OCR\tessdata"
FASTTEXT_MODEL_PATH = r"path/to/fasttext/model"  # Update with actual path
os.makedirs(EVAL_RESULTS_DIR, exist_ok=True)

# Pydantic models
class Procedure(BaseModel):
    id: int
    title: str
    section: str
    steps: List[str]
    responsible: str
    source: str
    filename: str
    language: str

class ChatbotResponse(BaseModel):
    query: str
    intent: str
    response: str
    language: str
    procedure_name: Optional[str] = None
    error: Optional[str] = None
    analytics: Optional[Dict[str, Any]] = None

# Utility functions
def detect_language_from_prompt(text: str) -> str:
    try:
        if not isinstance(text, str) or not text.strip():
            logger.warning("Invalid or empty text for language detection. Defaulting to 'en'.")
            return 'en'
        text_lower = text.lower().strip()
        if 'answer in french' in text_lower or 'répondez en français' in text_lower:
            return 'fr'
        if 'answer in english' in text_lower or 'répondez en anglais' in text_lower:
            return 'en'
        return 'en'  # Default to English if no instruction found
    except Exception as e:
        logger.warning(f"Language detection failed: {str(e)}. Defaulting to 'en'.")
        return 'en'

def evaluate_intent_classification(true_intents: List[str], predicted_intents: List[str]) -> Dict[str, float]:
    try:
        precision = precision_score(true_intents, predicted_intents, average='weighted', zero_division=0)
        recall = recall_score(true_intents, predicted_intents, average='weighted', zero_division=0)
        f1 = f1_score(true_intents, predicted_intents, average='weighted', zero_division=0)
        return {'precision': precision, 'recall': recall, 'f1_score': f1}
    except Exception as e:
        logger.error(f"Intent classification evaluation error: {str(e)}")
        return {'precision': 0.0, 'recall': 0.0, 'f1_score': 0.0}

def evaluate_response_quality(responses: List[str], expected_responses: List[str]) -> Dict[str, float]:
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        response_embeddings = model.encode(responses)
        expected_embeddings = model.encode(expected_responses)
        similarities = [np.dot(r, e) / (np.linalg.norm(r) * np.linalg.norm(e))
                        for r, e in zip(response_embeddings, expected_embeddings)]
        return {
            'avg_cosine_similarity': mean(similarities) if similarities else 0.0,
            'std_cosine_similarity': np.std(similarities) if similarities else 0.0
        }
    except Exception as e:
        logger.error(f"Response quality evaluation error: {str(e)}")
        return {'avg_cosine_similarity': 0.0, 'std_cosine_similarity': 0.0}

logger.info("Business Understanding: Initialized logging, paths, models, and utilities.")
print("Configuration and utilities initialized successfully.")

# Phase 1: Business Understanding - Bilingual Prompt Templates

## CRISP-DM Phase Overview
The **Business Understanding** phase in CRISP-DM clarifies the project's goals, ensuring the bilingual procedure comparison chatbot delivers consistent, accurate, and user-tailored responses in English and French. This phase defines how the chatbot interacts with users by creating a single, flexible prompt template that dynamically handles any user intent and sentiment without relying on predefined intent lists, mappings, or keyword matching.

## Code Purpose
This code cell defines bilingual (English and French) prompt templates using LangChain’s `PromptTemplate` and `FewShotPromptTemplate`. A single, general-purpose template supports dynamic intent and sentiment classification, processing queries based on inferred intents (via zero-shot classification) and sentiment (via transformer-based analysis). The template ensures seamless bilingual interaction, tailoring response tone to user sentiment (e.g., empathetic for negative, enthusiastic for positive).

### What the Code Does
- **Imports**: Uses `langchain.prompts` for `PromptTemplate` and `FewShotPromptTemplate` to create structured prompts.
- **Example Prompt**: Defines a reusable `EXAMPLE_PROMPT` template for formatting query-response pairs.
- **Bilingual Templates**: Creates a dictionary (`PROMPT_TEMPLATES`) with a single `general` template for each language (English and French):
  - **General Template**: Handles all intents and sentiments dynamically by instructing the LLM to interpret intent and sentiment from the query and context, using `metadata_index` for procedure-specific information and chat history for context.
  - **Structure**: Includes:
    - **Examples**: Sample query-response pairs for various intents (e.g., listing, questioning, comparing) and sentiments (e.g., positive, negative, neutral) to guide the LLM.
    - **Prefix**: Provides instructions for the chatbot, context, chat history, inferred intent, and sentiment-driven tone.
    - **Suffix**: Standardizes the response start (e.g., “Answer: ” or “Réponse: ”).
    - **Input Variables**: Supports dynamic inputs like `context`, `chat_history`, `question`, `procedure_name`, `intent`, and `sentiment`.
- **Logging**: Logs template initialization and prints a confirmation message.

### Why It Matters
The single, flexible prompt template ensures consistent, language-appropriate, and sentiment-tailored responses for any user intent, eliminating the need for predefined intent or sentiment categories. This supports dynamic intent and sentiment classification using zero-shot and transformer-based methods, enabling the chatbot to handle diverse and novel queries with an adaptive tone while processing procedural data from PDFs.

In [None]:
from langchain.prompts import PromptTemplate, FewShotPromptTemplate

# Bilingual prompt templates
EXAMPLE_PROMPT = PromptTemplate(
    input_variables=["query", "response"],
    template="Query: {query}\nResponse: {response}"
)

PROMPT_TEMPLATES = {
    'en': {
        'general': {
            'general': FewShotPromptTemplate(
                examples=[
                    {"query": "List all procedures", "response": "I'm excited to help! Here are the procedures: 1. Sample Procedure\n2. Test Procedure"},
                    {"query": "What is the purpose of Sample Procedure?", "response": "Great question! The purpose of Sample Procedure is to demonstrate the chatbot."},
                    {"query": "Details of Sample Procedure", "response": "Happy to provide details! **Procedure**: Sample Procedure\n**Purpose**: Demonstrate chatbot\n**Steps**:\n1. Start process\n2. Complete task\n**Responsible**:\n- Not specified"},
                    {"query": "I'm frustrated, why is Sample Procedure so unclear?", "response": "I'm sorry to hear you're frustrated. Let me clarify: Sample Procedure aims to demonstrate the chatbot with steps like starting the process and completing the task. Can I assist further?"},
                    {"query": "Compare Sample Procedure in English and French", "response": "I'm thrilled to compare for you! | Aspect | English | French |\n|--------|---------|--------|\n| Purpose | Demonstrate chatbot | Démontrer le chatbot |\n| Steps | 2 | 2 |"},
                    {"query": "Tell me about the system", "response": "Glad you're curious! The system is a bilingual chatbot designed to process and compare procedural documents in English and French."}
                ],
                example_prompt=EXAMPLE_PROMPT,
                prefix="""You are an advanced bilingual chatbot. Process the query in English based on the inferred intent: '{intent}' and sentiment: '{sentiment}'.
Provide a clear, accurate, and concise response using the provided context and chat history.
Adjust the tone based on sentiment: empathetic and supportive for negative, enthusiastic for positive, neutral for neutral.
If no procedure is found, respond: "No procedure found."
Context: {context}
Chat History: {chat_history}
Question: {question}
Procedure Name (if applicable): {procedure_name}
""",
                suffix="Answer: ",
                input_variables=["context", "chat_history", "question", "intent", "sentiment", "procedure_name"]
            )
        }
    },
    'fr': {
        'general': {
            'general': FewShotPromptTemplate(
                examples=[
                    {"query": "Liste toutes les procédures", "response": "Je suis ravi de vous aider ! Voici les procédures : 1. Procédure Exemple\n2. Procédure Test"},
                    {"query": "Quel est le but de la Procédure Exemple ?", "response": "Excellente question ! Le but de la Procédure Exemple est de démontrer le chatbot."},
                    {"query": "Détails de la Procédure Exemple", "response": "Heureux de fournir des détails ! **Procédure** : Procédure Exemple\n**Objectif** : Démontrer le chatbot\n**Étapes** :\n1. Démarrer le processus\n2. Compléter la tâche\n**Responsable** :\n- Non spécifié"},
                    {"query": "Je suis agacé, pourquoi la Procédure Exemple est-elle si vague ?", "response": "Je suis désolé que vous soyez agacé. Permettez-moi de clarifier : la Procédure Exemple vise à démontrer le chatbot avec des étapes comme démarrer le processus et compléter la tâche. Puis-je vous aider davantage ?"},
                    {"query": "Comparez la Procédure Exemple en anglais et français", "response": "Ravi de comparer pour vous ! | Aspect | Anglais | Français |\n|--------|---------|----------|\n| Objectif | Démontrer le chatbot | Démontrer le chatbot |\n| Étapes | 2 | 2 |"},
                    {"query": "Parlez-moi du système", "response": "Content que vous soyez curieux ! Le système est un chatbot bilingue conçu pour traiter et comparer des documents procéduraux en anglais et français."}
                ],
                example_prompt=EXAMPLE_PROMPT,
                prefix="""Vous êtes un chatbot bilingue avancé. Traitez la requête en français en fonction de l'intention déduite : '{intent}' et du sentiment : '{sentiment}'.
Fournissez une réponse claire, précise et concise en utilisant le contexte et l'historique de conversation fournis.
Adaptez le ton selon le sentiment : empathique et soutenant pour négatif, enthousiaste pour positif, neutre pour neutre.
Si aucune procédure n'est trouvée, répondez : "Aucune procédure trouvée."
Contexte : {context}
Historique : {chat_history}
Question : {question}
Nom de la procédure (si applicable) : {procedure_name}
""",
                suffix="Réponse : ",
                input_variables=["context", "chat_history", "question", "intent", "sentiment", "procedure_name"]
            )
        }
    }
}

logger.info("Business Understanding: Defined bilingual prompt templates with sentiment support.")
print("Bilingual prompt templates with sentiment support defined.")

# Phase 2: Data Understanding - Explore Directories

## CRISP-DM Phase Overview
The **Data Understanding** phase in CRISP-DM focuses on exploring and assessing the available data to ensure it meets the project's needs. For the bilingual procedure comparison chatbot, this phase verifies the presence and organization of PDF files in English and French directories, confirming the data foundation for procedure extraction.

## Code Purpose
This code cell defines a function to check and list PDF files in the `data/english` and `data/french` directories, ensuring the chatbot has access to the necessary documents. It logs and displays the results for transparency and debugging.

### What the Code Does
- **Directory Check**: Verifies that the `ENGLISH_DIR` and `FRENCH_DIR` (defined earlier) exist, raising a `FileNotFoundError` if either is missing.
- **File Collection**: Gathers and sorts all `.pdf` files from both directories using `pathlib.Path.iterdir()` and filters by file extension.
- **Logging**: Logs the number of files and their paths in both directories to `procedure_comparison.log` for traceability.
- **Output**: Prints the count and names of PDF files in each directory for user visibility.
- **Return**: Returns lists of English and French PDF file paths for use in subsequent phases.

### Why It Matters
This function ensures the chatbot’s data sources (PDFs) are accessible and properly organized before processing. By logging and displaying file details, it aids debugging and confirms the data is ready for extraction, supporting the project’s goal of processing bilingual procedural documents.

In [None]:
def explore_directories():
    if not ENGLISH_DIR.exists():
        logger.error(f"English directory missing: {ENGLISH_DIR}")
        raise FileNotFoundError("English directory not found")
    if not FRENCH_DIR.exists():
        logger.error(f"French directory missing: {FRENCH_DIR}")
        raise FileNotFoundError("French directory not found")

    english_files = sorted({f for f in ENGLISH_DIR.iterdir() if f.is_file() and f.suffix.lower() == '.pdf'})
    french_files = sorted({f for f in FRENCH_DIR.iterdir() if f.is_file() and f.suffix.lower() == '.pdf'})

    logger.info(f"English directory ({ENGLISH_DIR}): {len(english_files)} files found: {[str(f) for f in english_files]}")
    logger.info(f"French directory ({FRENCH_DIR}): {len(french_files)} files found: {[str(f) for f in french_files]}")

    print(f"English PDFs: {len(english_files)}")
    for f in english_files:
        print(f" - {f.name}")
    print(f"French PDFs: {len(french_files)}")
    for f in french_files:
        print(f" - {f.name}")

    return english_files, french_files

english_files, french_files = explore_directories()

# Phase 3: Data Preparation - Parameter Optimizer

## CRISP-DM Phase Overview
The **Data Preparation** phase in CRISP-DM focuses on transforming raw data into a suitable format for modeling. For the bilingual procedure comparison chatbot, this phase optimizes text chunking parameters to ensure efficient document processing, enhancing the chatbot's ability to extract and query procedural information from PDFs.

## Code Purpose
This code cell defines the `ParameterOptimizer` class, which optimizes chunk size and overlap for splitting PDF text into manageable pieces for retrieval and processing. It uses document structure analysis and embeddings to tailor parameters, ensuring accurate and efficient handling of bilingual (English and French) documents.

### What the Code Does
- **Imports**: Includes libraries for JSON handling (`json`), NLP (`spacy`), embeddings (`HuggingFaceEmbeddings`), clustering (`hdbscan`), and statistical analysis (`statistics`).
- **ParameterOptimizer Class**:
  - **Initialization**: Sets up embeddings, a parameter cache, performance history, and a JSON file (`parameters.json`) for storing optimized parameters.
  - **`load_parameters`**: Loads cached parameters from `parameters.json` if available, logging success or errors.
  - **`save_parameters`**: Saves optimized parameters to `parameters.json` for reuse, with error handling.
  - **`analyze_document_structure`**: Analyzes text using `spacy` (English or French models) to extract sentences, paragraphs, headings, and list items, returning metrics like average sentence/paragraph length, heading frequency, and list density.
  - **`optimize_chunk_params`**: Determines optimal chunk size and overlap based on document length, language, structure (e.g., sentence length, heading frequency), and clustering of sentence embeddings using `hdbscan`. Returns defaults (1000, 200) on errors or empty text.
  - **`get_optimized_params`**: Retrieves or computes chunking parameters, caching results to avoid redundant processing.
- **Logging and Output**: Logs and prints confirmation of `ParameterOptimizer` initialization.

### Why It Matters
The `ParameterOptimizer` ensures text is split into chunks that balance context preservation and processing efficiency, critical for accurate retrieval in the chatbot’s RAG pipeline. By tailoring chunking to document structure and language, it enhances the quality of procedure extraction and query responses.

In [None]:
import json
import spacy
from langchain_huggingface import HuggingFaceEmbeddings
import hdbscan
from statistics import mean, median

class ParameterOptimizer:
    def __init__(self, embeddings: HuggingFaceEmbeddings):
        self.embeddings = embeddings
        self.parameter_cache = {}
        self.performance_history = []
        self.parameter_file = 'parameters.json'
        self.load_parameters()

    def load_parameters(self):
        try:
            if os.path.exists(self.parameter_file):
                with open(self.parameter_file, 'r') as f:
                    self.parameter_cache = json.load(f)
                logger.info("Loaded parameters from parameters.json")
        except Exception as e:
            logger.error(f"Failed to load parameters: {str(e)}")

    def save_parameters(self):
        try:
            with open(self.parameter_file, 'w') as f:
                json.dump(self.parameter_cache, f, indent=2)
            logger.info("Saved parameters to parameters.json")
        except Exception as e:
            logger.error(f"Failed to save parameters: {str(e)}")

    def analyze_document_structure(self, text: str, language: str) -> dict:
        try:
            nlp = spacy.load('en_core_web_sm', disable=['ner']) if language == 'en' else spacy.load('fr_core_news_sm', disable=['ner'])
            doc = nlp(text[:15000])
            sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
            sentence_lengths = [len(sent) for sent in sentences]
            paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
            paragraph_lengths = [len(p) for p in paragraphs]
            heading_pattern = r'^(?:[A-Z][\w\s\-:]{5,50}(?:\n|$)|(?:Section|Chapitre|Part|Partie)\s*\d+[\:\-\.]?\s*.+?(?:\n|$))'
            headings = [m.group(0).strip() for m in re.finditer(heading_pattern, text, re.MULTILINE)]
            heading_freq = len(headings) / (len(text) / 1000) if len(text) > 0 else 0
            list_pattern = r'^(?:\d+\.|\d+[\-\:]|[a-zA-Z]\.|[\-\*\+>·◦•✓✔]|[IVXLCDM]+\.)\s*(.*?)(?:\n|$)' 
            list_items = [m.group(0).strip() for m in re.finditer(list_pattern, text, re.MULTILINE)]
            list_density = len(list_items) / len(sentences) if sentences else 0
            return {
                'sentences': sentences,
                'sentence_lengths': sentence_lengths,
                'avg_sentence_length': mean(sentence_lengths) if sentence_lengths else 100,
                'median_sentence_length': median(sentence_lengths) if sentence_lengths else 100,
                'paragraphs': paragraphs,
                'avg_paragraph_length': mean(paragraph_lengths) if paragraph_lengths else 500,
                'heading_freq': heading_freq,
                'list_density': list_density
            }
        except Exception as e:
            logger.warning(f"Document structure analysis failed: {str(e)}. Using defaults.")
            return {
                'sentences': [],
                'sentence_lengths': [],
                'avg_sentence_length': 100,
                'median_sentence_length': 100,
                'paragraphs': [],
                'avg_paragraph_length': 500,
                'heading_freq': 0,
                'list_density': 0
            }

    def optimize_chunk_params(self, text: str) -> tuple[int, int]:
        try:
            if not text.strip():
                logger.warning("Empty text provided. Using default params.")
                return 1000, 200
            language = detect_language_from_prompt(text)
            doc_length = len(text)
            structure = self.analyze_document_structure(text, language)
            sentences = structure['sentences']
            avg_sentence_length = structure['avg_sentence_length']
            median_sentence_length = structure['median_sentence_length']
            avg_paragraph_length = structure['avg_paragraph_length']
            heading_freq = structure['heading_freq']
            list_density = structure['list_density']
            if len(sentences) < 3:
                logger.info("Too few sentences for clustering. Using structure-based params.")
                chunk_size = min(max(int(avg_paragraph_length), 250), 2000)
                chunk_overlap = min(int(chunk_size * 0.2) + (50 if language == 'fr' else 0), chunk_size // 2)
                return chunk_size, chunk_overlap
            sentence_embeddings = self.embeddings.embed_documents(sentences)
            if not sentence_embeddings:
                logger.warning("Failed to generate embeddings. Using structure-based params.")
                chunk_size = min(max(int(avg_paragraph_length), 250), 2000)
                chunk_overlap = min(int(chunk_size * 0.2) + (50 if language == 'fr' else 0), chunk_size // 2)
                return chunk_size, chunk_overlap
            clusterer = hdbscan.HDBSCAN(min_cluster_size=3, min_samples=2, cluster_selection_method='eom')
            cluster_labels = clusterer.fit_predict(sentence_embeddings)
            unique_clusters = set(cluster_labels) - {-1}
            n_clusters = len(unique_clusters)
            cluster_sizes = [sum(cluster_labels == c) for c in unique_clusters]
            largest_cluster_size = max(cluster_sizes) if cluster_sizes else len(sentences)
            cluster_density = largest_cluster_size / len(sentences) if sentences else 0.1
            if unique_clusters:
                largest_cluster = np.argmax(cluster_sizes)
                cluster_indices = [i for i, c in enumerate(cluster_labels) if c == largest_cluster]
                if len(cluster_indices) > 1:
                    cluster_embeddings = [sentence_embeddings[i] for i in cluster_indices]
                    distances = [
                        np.linalg.norm(np.array(cluster_embeddings[i]) - np.array(cluster_embeddings[i+1]))
                        for i in range(len(cluster_embeddings)-1)
                    ]
                    avg_embedding_distance = mean(distances) if distances else 0.5
                else:
                    avg_embedding_distance = 0.5
            else:
                avg_embedding_distance = 0.5
            chunk_sizes = [250, 500, 750, 1000, 1250, 1500, 2000]
            if list_density > 0.5 or heading_freq > 0.5:
                chunk_size = chunk_sizes[0] if avg_sentence_length < 75 else chunk_sizes[1]
            elif avg_paragraph_length < 500 or n_clusters > 3:
                chunk_size = chunk_sizes[2] if avg_sentence_length < 150 else chunk_sizes[3]
            else:
                chunk_size = chunk_sizes[4] if avg_sentence_length < 200 else chunk_sizes[5 if avg_sentence_length < 300 else 6]
            if doc_length < 1500:
                chunk_size = min(chunk_size, 500)
            elif doc_length > 15000:
                chunk_size = max(chunk_size, 1000)
            base_overlap = 100 if cluster_density < 0.5 else 150
            if avg_embedding_distance > 0.7:
                base_overlap += 50
            if heading_freq > 0.5:
                base_overlap = min(base_overlap, 100)
            if language == 'fr':
                base_overlap += 50
            chunk_overlap = min(base_overlap, chunk_size // 2)
            logger.debug(
                f"Chunk params: size={chunk_size}, overlap={chunk_overlap}, "
                f"doc_length={doc_length}, n_clusters={n_clusters}, "
                f"avg_sentence_length={avg_sentence_length:.2f}, "
                f"median_sentence_length={median_sentence_length:.2f}, "
                f"avg_paragraph_length={avg_paragraph_length:.2f}, "
                f"heading_freq={heading_freq:.2f}, list_density={list_density:.2f}, "
                f"cluster_density={cluster_density:.2f}, "
                f"avg_embedding_distance={avg_embedding_distance:.2f}, language={language}"
            )
            return chunk_size, chunk_overlap
        except Exception as e:
            logger.error(f"Error in optimize_chunk_params: {str(e)}. Returning default parameters.")
            return 1000, 200  # Default chunk size and overlap
    def get_optimized_params(self, param_type: str, input_data: Any) -> Any:
        cache_key = f"{param_type}_{hash(str(input_data))}"
        if cache_key in self.parameter_cache:
            logger.debug(f"Using cached params for {cache_key}: {self.parameter_cache[cache_key]}")
            return self.parameter_cache[cache_key]
        if param_type == 'chunk':
            best_params = self.optimize_chunk_params(input_data)
            self.parameter_cache[cache_key] = best_params
            self.performance_history.append({'param_type': param_type, 'objective': None})
            self.save_parameters()
            return best_params
        else:
            raise ValueError(f"Unknown parameter type: {param_type}")

logger.info("Data Preparation: Initialized ParameterOptimizer.")
print("ParameterOptimizer initialized.")

# Phase 3: Data Preparation - PDF Processor

## CRISP-DM Phase Overview
The **Data Preparation** phase in CRISP-DM transforms raw data into a usable format for modeling. For the bilingual procedure comparison chatbot, this phase processes PDF documents in English and French to extract procedures and store them for efficient retrieval, enabling the chatbot to query and compare procedural information.

## Code Purpose
This code cell defines the `PDFProcessor` class, which extracts text from PDFs (using direct text extraction or OCR for scanned documents), identifies procedures, and stores them in `Chroma` (for semantic search) and `BM25` (for keyword search) for hybrid retrieval. It ensures robust handling of bilingual documents.

### What the Code Does
- **Imports**: Includes libraries for JSON handling (`json`), text processing (`re`, `spacy`), async I/O (`aiofiles`, `asyncio`), PDF processing (`PyPDFLoader`, `pdf2image`, `pytesseract`), embeddings (`HuggingFaceEmbeddings`), and document storage (`Chroma`).
- **PDFProcessor Class**:
  - **Initialization**: Sets up Tesseract OCR, `spacy` models (English/French), embeddings (`all-MiniLM-L6-v2`), `Chroma` vectorstore, `BM25` document list, and a `ParameterOptimizer` instance. Validates tool paths and logs initialization.
  - **`optimize_chunk_size`**: Uses `ParameterOptimizer` to determine optimal text chunk size and overlap for splitting documents.
  - **`infer_document_type`**: Analyzes text structure (using regex and clustering) to classify documents as list-heavy, long-text, or generic, aiding procedure extraction.
  - **`preprocess_image`**: Enhances PDF images for OCR using OpenCV (grayscale, thresholding).
  - **`extract_text_with_ocr`**: Converts PDFs to images and extracts text using Tesseract, supporting English/French.
  - **`extract_procedures`**: Identifies procedures in text using regex patterns for lists or sections, creating `Procedure` objects with metadata.
  - **`process_pdf`**: Extracts text from a PDF (via `PyPDFLoader` or OCR fallback), splits it into chunks, extracts procedures, and stores them in `Chroma` and `BM25` with metadata.
  - **`process_all_pdfs`**: Asynchronously processes multiple PDFs, handling errors and logging results.
- **Logging and Output**: Logs and prints initialization and processing outcomes for traceability.

### Why It Matters
The `PDFProcessor` ensures that PDF content is accurately extracted, structured, and stored for retrieval, supporting the chatbot’s ability to handle bilingual queries. Its hybrid storage approach (`Chroma` + `BM25`) and robust error handling make it efficient and reliable for processing diverse document types.

In [None]:
import json
import uuid
import re
import spacy
import aiofiles
import asyncio
from pathlib import Path
from typing import List, Dict, Any
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
import pdf2image
import pytesseract
import cv2
import numpy as np
from PIL import Image

class PDFProcessor:
    def __init__(self):
        try:
            pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
            if not os.path.exists(TESSERACT_PATH):
                raise FileNotFoundError(f"Tesseract executable not found at: {TESSERACT_PATH}")
            if not os.path.exists(TESSDATA_PREFIX):
                raise FileNotFoundError(f"Tessdata directory not found at: {TESSDATA_PREFIX}")
            os.environ['TESSDATA_PREFIX'] = TESSDATA_PREFIX
            self.nlp_en = spacy.load('en_core_web_sm', disable=['ner'])
            self.nlp_fr = spacy.load('fr_core_news_sm', disable=['ner'])
            self.embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
            self.vectorstore = Chroma(collection_name='procedures', embedding_function=self.embeddings)
            self.bm25_docs = []
            self.metadata_index = {}
            self.performance_metrics = {'chunk_sizes': [], 'retrieval_scores': [], 'latencies': []}
            self.query_cache = {}
            self.optimizer = ParameterOptimizer(self.embeddings)
            self.text_splitter = None
            if not os.path.exists(POPPLER_PATH):
                raise FileNotFoundError(f"Poppler path does not exist: {POPPLER_PATH}")
            logger.info("PDFProcessor initialized successfully.")
            print("PDFProcessor initialized successfully.")
        except Exception as e:
            logger.error(f"PDFProcessor init error: {str(e)}")
            raise ValueError(f"Failed to initialize PDFProcessor: {str(e)}")

    def optimize_chunk_size(self, text: str) -> tuple[int, float]:
        chunk_size, chunk_overlap = self.optimizer.get_optimized_params('chunk', text)
        self.performance_metrics['chunk_sizes'].append(chunk_size)
        logger.info(f"Optimized chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap}")
        return chunk_size, chunk_overlap

    def infer_document_type(self, text: str) -> str:
        try:
            language = detect_language_from_prompt(text)
            nlp = self.nlp_en if language == 'en' else self.nlp_fr
            section_pattern = r'^(?:Functioning|Fonctionnement)\s*[:\-]?\s*(.*?)(?=\n\n|\n|$|\s{2,})'
            if re.search(section_pattern, text, re.I | re.M):
                logger.info("Found 'functioning' or 'fonctionnement' section. Type: procedure")
                return 'procedure'
            structure = self.optimizer.analyze_document_structure(text, language)
            sentences = structure['sentences']
            list_density = structure['list_density']
            heading_freq = structure['heading_freq']
            avg_paragraph_length = structure['avg_paragraph_length']
            avg_sentence_length = structure['avg_sentence_length']
            sentence_embeddings = self.embeddings.embed_documents(sentences) if sentences else []
            if not sentence_embeddings or len(sentences) < 3:
                logger.info("Insufficient sentences for clustering. Using structure-based inference.")
                if list_density > 0.5:
                    return f'type_list_heavy_{int(list_density * 100)}'
                elif avg_paragraph_length > 500:
                    return f'type_long_text_{int(avg_paragraph_length // 100)}'
                return f'type_generic_{int(avg_sentence_length // 50)}'
            features = []
            for i, sent in enumerate(sentences):
                sent_features = [
                    len(sent) / 1000.0,
                    list_density,
                    heading_freq,
                    avg_paragraph_length / 1000.0,
                    1 if re.match(r'^(?:\d+\.|[-*+>·◦•✓✔]|\d+[\-\:]|[a-zA-Z]\.|[IVXLCDM]+\.)', sent) else 0
                ]
                features.append(sent_features + sentence_embeddings[i])
            clusterer = hdbscan.HDBSCAN(min_cluster_size=3, min_samples=2, cluster_selection_method='eom')
            cluster_labels = clusterer.fit_predict(features)
            unique_clusters = set(cluster_labels) - {-1}
            if not unique_clusters:
                logger.info("No clusters formed. Using structure-based inference.")
                if list_density > 0.5:
                    return f'type_list_heavy_{int(list_density * 100)}'
                elif avg_paragraph_length > 500:
                    return f'type_long_text_{int(avg_paragraph_length // 100)}'
                return f'type_generic_{int(avg_sentence_length // 50)}'
            cluster_properties = []
            for c in unique_clusters:
                cluster_indices = [i for i, label in enumerate(cluster_labels) if label == c]
                cluster_sentences = [sentences[i] for i in cluster_indices]
                cluster_list_density = sum(1 for s in cluster_sentences if re.match(r'^(?:\d+\.|[-*+>·◦•✓✔]|\d+[\-\:]|[a-zA-Z]\.|[IVXLCDM]+\.)', s)) / len(cluster_sentences) if cluster_sentences else 0
                cluster_headings = sum(1 for s in cluster_sentences if re.match(r'^[A-Z][\w\s\-:]{5,50}(?:\n|$)', s)) / len(cluster_sentences) if cluster_sentences else 0
                cluster_avg_length = mean([len(s) for s in cluster_sentences]) if cluster_sentences else avg_sentence_length
                cluster_properties.append({
                    'size': len(cluster_indices),
                    'list_density': cluster_list_density,
                    'heading_density': cluster_headings,
                    'avg_sentence_length': cluster_avg_length
                })
            max_cluster = max(cluster_properties, key=lambda x: x['size']) if cluster_properties else {'list_density': 0, 'avg_sentence_length': avg_sentence_length}
            if max_cluster['list_density'] > 0.5:
                return f'type_list_heavy_{int(max_cluster['list_density'] * 100)}'
            elif max_cluster['avg_sentence_length'] > 200:
                return f'type_long_text_{int(max_cluster['avg_sentence_length'] // 50)}'
            return f'type_generic_{int(max_cluster['avg_sentence_length'] // 50)}'
        except Exception as e:
            logger.error(f"Document type inference error: {str(e)}")
            return 'type_generic_100'

    async def preprocess_image(self, image: Image.Image) -> Image.Image:
        try:
            img_array = np.array(image)
            gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
            thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
            return Image.fromarray(thresh)
        except Exception as e:
            logger.error(f"Image preprocessing error: {str(e)}")
            return image

    async def extract_text_with_ocr(self, file_path: str) -> str:
        try:
            images = pdf2image.convert_from_path(file_path, poppler_path=POPPLER_PATH)
            text = ''
            for img in images:
                processed_img = await self.preprocess_image(img)
                language = detect_language_from_prompt(text[:1000]) if text else 'en'
                text += pytesseract.image_to_string(processed_img, lang='eng' if language == 'en' else 'fra') + '\n'
            return text.strip()
        except Exception as e:
            logger.error(f"OCR extraction error for {file_path}: {str(e)}")
            return ''

    def extract_procedures(self, text: str, filename: str, language: str) -> List[Procedure]:
        try:
            doc_type = self.infer_document_type(text)
            procedures = []
            if doc_type.startswith('type_list_heavy'):
                pattern = r'^(\d+\.|[-*+>·◦•✓✔]|[a-zA-Z]\.|[IVXLCDM]+\.)\s*(.*?)(?=\n(?:\d+\.|[-*+>·◦•✓✔]|[a-zA-Z]\.|[IVXLCDM]+\.)|\n\n|$)' 
                matches = re.finditer(pattern, text, re.MULTILINE | re.DOTALL)
                for i, match in enumerate(matches):
                    step_text = match.group(2).strip()
                    procedures.append(Procedure(
                        id=i,
                        title=f"Procedure {i+1}",
                        section='Unknown',
                        steps=[step_text],
                        responsible='Unknown',
                        source=filename,
                        filename=filename,
                        language=language
                    ))
            else:
                pattern = r'(?:(?:Section|Chapitre|Procedure|Procédure)\s*\d+[\:\-\.]?\s*(.*?)(?=\n\n|\n|$|\s{2,}))'
                matches = re.finditer(pattern, text, re.MULTILINE | re.DOTALL)
                for i, match in enumerate(matches):
                    section = match.group(1).strip()
                    procedures.append(Procedure(
                        id=i,
                        title=section,
                        section=section,
                        steps=['Unknown'],
                        responsible='Unknown',
                        source=filename,
                        filename=filename,
                        language=language
                    ))
            return procedures
        except Exception as e:
            logger.error(f"Procedure extraction error: {str(e)}")
            return []

    async def process_pdf(self, file_path: Path) -> List[Document]:
        try:
            loader = PyPDFLoader(str(file_path))
            text = ''
            try:
                pages = loader.load()
                text = '\n'.join(page.page_content for page in pages)
            except Exception as e:
                logger.warning(f"PyPDFLoader failed for {file_path}: {str(e)}. Falling back to OCR.")
                text = await self.extract_text_with_ocr(str(file_path))
            if not text.strip():
                logger.warning(f"No text extracted from {file_path}")
                return []
            language = detect_language_from_prompt(text[:1000])
            chunk_size, chunk_overlap = self.optimize_chunk_size(text)
            self.text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap,
                length_function=len
            )
            chunks = self.text_splitter.split_text(text)
            documents = [
                Document(page_content=chunk, metadata={'source': str(file_path), 'language': language})
                for chunk in chunks
            ]
            procedures = self.extract_procedures(text, file_path.name, language)
            for proc in procedures:
                proc_id = str(uuid.uuid4())
                self.metadata_index[proc_id] = proc.dict()
                documents.append(Document(
                    page_content='\n'.join(proc.steps),
                    metadata={'source': proc.source, 'language': proc.language, 'procedure_id': proc_id}
                ))
            self.vectorstore.add_documents(documents)
            self.bm25_docs.extend([doc.page_content for doc in documents])
            logger.info(f"Processed {file_path}: {len(documents)} documents added")
            return documents
        except Exception as e:
            logger.error(f"Error processing {file_path}: {str(e)}")
            return []

    async def process_all_pdfs(self, english_files: List[Path], french_files: List[Path]):
        tasks = [self.process_pdf(file) for file in english_files + french_files]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        for file, result in zip(english_files + french_files, results):
            if isinstance(result, Exception):
                logger.error(f"Failed to process {file}: {str(result)}")
            else:
                logger.info(f"Successfully processed {file}")


# Phase 4: Modeling - Query Handler

## CRISP-DM Phase Overview
The **Modeling** phase in CRISP-DM focuses on building and applying models to achieve project goals. For the bilingual procedure comparison chatbot, this phase implements the `QueryHandler` class to process user queries using Retrieval-Augmented Generation (RAG), fully dynamic intent classification, and sentiment analysis, enabling accurate, context-aware, and user-tailored responses in English and French without predefined intent lists, mappings, or keyword matching.

## Code Purpose
This code cell defines the `QueryHandler` class, which handles user queries by classifying their intent dynamically using zero-shot classification (`facebook/bart-large-mnli`), analyzing sentiment using a transformer-based model (`distilbert-base-uncased-finetuned-sst-2-english`), retrieving relevant documents, and generating responses using a large language model (LLM). It optimizes retrieval and LLM parameters to ensure efficient and accurate query processing, with intent and sentiment classification driven by query content and `metadata_index`.

### What the Code Does
- **Imports**: Includes libraries for regex (`re`), timing (`time`), async processing (`asyncio`), NLP (`spacy`), retrieval (`Chroma`, `BM25Retriever`, `EnsembleRetriever`), sentiment analysis (`transformers.pipeline`), and LLMs (`HuggingFaceEndpoint`).
- **QueryHandler Class**:
  - **Initialization**: Sets up `Chroma` vectorstore, `BM25Retriever`, and an `EnsembleRetriever`. Initializes `spacy` models (English/French), a `facebook/bart-large-mnli` classifier for intent, a `distilbert-base-uncased-finetuned-sst-2-english` classifier for sentiment, a Mixtral-8x7B LLM, and stores prompt templates, metadata index, and analytics.
  - **`detect_language`**: Identifies query language using a predefined function.
  - **`classify_intent`**: Uses `facebook/bart-large-mnli` to classify query intents dynamically by generating candidate labels from `metadata_index` (e.g., 'query about Cryptography Policy') and a 'general query' fallback.
  - **`classify_sentiment`**: Uses `distilbert-base-uncased-finetuned-sst-2-english` to classify query sentiment (positive, negative, neutral) with confidence scores, returning the predicted sentiment.
  - **`optimize_retrieval_params`**: Adjusts retrieval parameters (`k`, weights) based on query and document count.
  - **`optimize_temperature`**: Sets LLM temperature based on query complexity (0.7 for complex, 0.5 for simple) and sentiment (0.6 for negative to reduce creativity).
  - **`expand_query`**: Rephrases queries using the LLM to improve retrieval accuracy.
  - **`translate_response`**: Translates fallback messages (e.g., 'No procedure found') to match the query language.
  - **`retrieve_documents`**: Asynchronously retrieves relevant documents using `EnsembleRetriever`, filtering by language.
  - **`generate_response`**: Generates responses using the LLM with a single, general prompt template, incorporating context, chat history, inferred intent, and sentiment-driven tone.
  - **`handle_query`**: Orchestrates query processing: cleans query, classifies intent and sentiment, retrieves documents, generates responses, and tracks analytics (e.g., response time, success rate).
- **Logging and Output**: Logs and prints initialization and query handling outcomes for traceability.

### Why It Matters
The `QueryHandler` is the core of the chatbot’s query processing, integrating dynamic intent classification, sentiment analysis, document retrieval, and response generation to deliver accurate, bilingual, and sentiment-tailored responses. Its RAG approach, sentiment-driven tone adjustment, and optimization ensure efficient and user-centric answers, supporting the project’s goal of streamlined procedural information access.

In [None]:
import re
import time
import asyncio
from typing import List, Dict, Optional
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_huggingface import HuggingFaceEndpoint
from langchain.schema import Document
from transformers import pipeline
import spacy
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

class QueryHandler:
    def __init__(self, vectorstore: Chroma, bm25_docs: List[str], prompt_templates: Dict, metadata_index: Dict):
        try:
            self.vectorstore = vectorstore
            self.bm25_retriever = BM25Retriever.from_texts(bm25_docs)
            self.bm25_retriever.k = 3
            self.retriever = EnsembleRetriever(
                retrievers=[self.vectorstore.as_retriever(search_kwargs={'k': 3}), self.bm25_retriever],
                weights=[0.5, 0.5]
            )
            self.nlp_en = spacy.load('en_core_web_sm', disable=['ner'])
            self.nlp_fr = spacy.load('fr_core_news_sm', disable=['ner'])
            self.intent_classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
            self.sentiment_classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
            self.llm = HuggingFaceEndpoint(
                repo_id='mistralai/Mixtral-8x7B-Instruct-v0.1',
                temperature=0.7,  # Default temperature
                max_new_tokens=512,
                huggingfacehub_api_token=os.getenv('HUGGINGFACEHUB_API_TOKEN')
            )
            self.prompt_templates = prompt_templates
            self.metadata_index = metadata_index
            self.chat_history = []
            self.analytics = {'queries': 0, 'response_times': [], 'retrieval_scores': [], 'errors': [], 'success': 0}
            logger.info("QueryHandler initialized with sentiment analysis support.")
            print("QueryHandler initialized with sentiment analysis support.")
        except Exception as e:
            logger.error(f"QueryHandler init error: {str(e)}")
            raise ValueError(f"Failed to initialize QueryHandler: {str(e)}")

    def detect_language(self, query: str) -> str:
        return detect_language_from_prompt(query)

    def classify_intent(self, query: str) -> str:
        try:
            language = self.detect_language(query)
            nlp = self.nlp_en if language == 'en' else self.nlp_fr
            doc = nlp(query)
            query_clean = ' '.join(token.text.lower() for token in doc if not token.is_punct)
            # Generate candidate intents dynamically from metadata_index
            candidate_intents = [f"query about {proc['title']}" for proc in self.metadata_index.values()]
            candidate_intents.append("general query")  # Fallback for non-procedure-specific queries
            # Use zero-shot classification to infer intent
            result = self.intent_classifier(query_clean, candidate_labels=candidate_intents, multi_label=False)
            predicted_intent = result['labels'][0]
            confidence = result['scores'][0]
            logger.debug(f"Intent classified: {predicted_intent}, confidence={confidence}")
            return predicted_intent
        except Exception as e:
            logger.warning(f"Intent classification failed: {str(e)}. Defaulting to 'general query'.")
            return "general query"

    def classify_sentiment(self, query: str) -> str:
        try:
            language = self.detect_language(query)
            nlp = self.nlp_en if language == 'en' else self.nlp_fr
            doc = nlp(query)
            query_clean = ' '.join(token.text.lower() for token in doc if not token.is_punct)
            # Use sentiment classifier to infer sentiment
            result = self.sentiment_classifier(query_clean)
            label = result[0]['label'].lower()  # POSITIVE or NEGATIVE
            score = result[0]['score']
            # Map to positive, negative, or neutral based on score
            if label == 'positive' and score > 0.7:
                sentiment = 'positive'
            elif label == 'negative' and score > 0.7:
                sentiment = 'negative'
            else:
                sentiment = 'neutral'
            logger.debug(f"Sentiment classified: {sentiment}, confidence={score}")
            return sentiment
        except Exception as e:
            logger.warning(f"Sentiment classification failed: {str(e)}. Defaulting to 'neutral'.")
            return "neutral"

    def optimize_retrieval_params(self, query: str, docs: List[Document]) -> tuple[int, List[float]]:
        try:
            k = min(len(docs), 5) if docs else 3
            weights = [0.5, 0.5]
            self.analytics['retrieval_scores'].append(1.0 if docs else 0.0)
            logger.info(f"Optimized retrieval params: k={k}, weights={weights}")
            return k, weights
        except Exception as e:
            logger.error(f"Retrieval optimization error: {str(e)}")
            return 3, [0.5, 0.5]

    def optimize_temperature(self, query: str, sentiment: str) -> float:
        try:
            # Adjust temperature based on query length and sentiment
            base_temperature = 0.7 if len(query.split()) > 10 else 0.5
            if sentiment == 'negative':
                temperature = 0.6  # Lower creativity for empathetic responses
            else:
                temperature = base_temperature
            logger.info(f"Optimized temperature: {temperature}, sentiment={sentiment}")
            return temperature
        except Exception as e:
            logger.error(f"Temperature optimization error: {str(e)}")
            return 0.7

    def expand_query(self, query: str) -> str:
        try:
            if not isinstance(query, str) or not query.strip():
                logger.warning("Invalid or empty query for expansion. Returning original query.")
                return str(query)
            prompt = f"Rephrase and expand the following query with synonyms to improve retrieval: '{query}'"
            expanded = self.llm.invoke(prompt)
            return str(expanded).strip()
        except Exception as e:
            logger.error(f"Query expansion error: {str(e)}")
            return str(query)

    def translate_response(self, response: str, target_lang: str) -> str:
        try:
            if target_lang == 'fr' and 'No procedure found' in response:
                return "Aucune procédure trouvée."
            elif target_lang == 'en' and 'Aucune procédure trouvée.' in response:
                return "No procedure found."
            return response
        except Exception as e:
            logger.error(f"Translation error: {str(e)}")
            return response

    async def retrieve_documents(self, query: str, language: str) -> List[Document]:
        try:
            start_time = time.time()
            docs = self.retriever.get_relevant_documents(query)
            filtered_docs = [doc for doc in docs if doc.metadata.get('language') == language]
            latency = time.time() - start_time
            self.analytics['retrieval_scores'].append(1.0 if filtered_docs else 0.0)
            logger.info(f"Retrieved {len(filtered_docs)} documents in {latency:.2f} seconds")
            return filtered_docs
        except Exception as e:
            logger.error(f"Document retrieval failed: {str(e)}")
            return []

    async def generate_response(self, query: str, intent: str, sentiment: str, context: str, procedure_name: Optional[str] = None) -> str:
        try:
            language = self.detect_language(query)
            prompt_template = self.prompt_templates[language]['general']
            chat_history_str = '\n'.join([f"Q: {h['query']}\nA: {h['response']}" for h in self.chat_history[-3:]])
            prompt = prompt_template.format(
                context=context,
                chat_history=chat_history_str,
                question=query,
                intent=intent,
                sentiment=sentiment,
                procedure_name=procedure_name or 'Unknown'
            )
            self.llm.temperature = self.optimize_temperature(query, sentiment)
            response = self.llm.invoke(prompt)
            response = self.translate_response(response, language)
            logger.debug(f"Generated response for intent={intent}, sentiment={sentiment}, language={language}")
            return response.strip()
        except Exception as e:
            logger.error(f"Response generation failed: {str(e)}")
            return "Error generating response."

    async def handle_query(self, query: str, session_id: str = 'default') -> str:
        start_time = time.time()
        self.analytics['queries'] += 1
        try:
            language = self.detect_language(query)
            query_clean = re.sub(r'(?i)(answer in (english|french)|répondez en (français|anglais))', '', query).strip()
            intent = self.classify_intent(query_clean)
            sentiment = self.classify_sentiment(query_clean)
            procedure_name = None
            if intent.startswith('query about'):
                nlp = self.nlp_en if language == 'en' else self.nlp_fr
                doc = nlp(query_clean)
                for ent in doc.ents:
                    if ent.label_ in ['PRODUCT', 'ORG', 'PROCEDURE']:
                        procedure_name = ent.text
                        break
                if not procedure_name:
                    procedure_name = ' '.join(token.text for token in doc if token.pos_ == 'NOUN' and token.text.lower() not in ['procedure', 'procédure'])
            expanded_query = self.expand_query(query_clean)
            docs = await self.retrieve_documents(expanded_query, language)
            k, weights = self.optimize_retrieval_params(expanded_query, docs)
            self.retriever.retrievers[0].search_kwargs['k'] = k
            self.retriever.weights = weights
            context = '\n'.join([doc.page_content for doc in docs])[:2000]
            response_text = await self.generate_response(query_clean, intent, sentiment, context, procedure_name)
            response = ChatbotResponse(
                query=query,
                intent=intent,
                response=response_text,
                language=language,
                procedure_name=procedure_name,
                analytics={'retrieval_time': time.time() - start_time, 'sentiment': sentiment}
            )
            self.chat_history.append({'query': query, 'response': response_text})
            self.analytics['response_times'].append((time.time() - start_time) * 1000)
            self.analytics['success'] += 1
            logger.info(f"Handled query: intent={intent}, sentiment={sentiment}, language={language}, response_length={len(response_text)}")
            return json.dumps(response.dict())
        except Exception as e:
            logger.error(f"Query handling failed: {str(e)}")
            self.analytics['errors'].append(str(e))
            response = ChatbotResponse(
                query=query,
                intent='error',
                response=f"Error: {str(e)}",
                language=self.detect_language(query),
                error=str(e),
                analytics={'retrieval_time': time.time() - start_time, 'sentiment': 'neutral'}
            )
            return json.dumps(response.dict())

# Phase 5: Evaluation - Performance Assessment

## CRISP-DM Phase Overview
The **Evaluation** phase in CRISP-DM assesses the chatbot’s performance to ensure it meets project goals. For the bilingual procedure comparison chatbot, this phase evaluates intent classification accuracy, sentiment analysis accuracy, and response quality in English and French, using metrics like precision, recall, F1-score, cosine similarity, ROUGE, BLEU, and METEOR to validate effectiveness.

## Code Purpose
This code cell evaluates the chatbot by processing test queries, measuring intent classification accuracy, sentiment classification accuracy, response quality, and response time. It uses a comprehensive set of metrics (precision, recall, F1-score for intent and sentiment, and cosine similarity, ROUGE, BLEU, METEOR for response quality) to assess performance, saving results for analysis.


### What the Code Does
- **Imports**: Includes libraries for JSON handling (`json`), timing (`time`), async processing (`asyncio`), statistics (`statistics`), embeddings (`sentence_transformers`), and evaluation metrics (`rouge_score`, `nltk.translate.bleu_score`, `nltk.translate.meteor_score`).
- **Functions**:
  - **`load_test_data`**: Loads test queries from `test_queries.json`, including expected intents, sentiments, and responses, or creates sample queries if missing, saving them to file.
  - **`evaluate_response_quality`**: Computes:
    - Cosine similarity (via `sentence_transformers`) for semantic similarity.
    - ROUGE-1, ROUGE-2, ROUGE-L for text overlap.
    - BLEU for n-gram precision.
    - METEOR for synonym-aware, order-sensitive text quality.
  - **`run_evaluation`**: Processes test queries asynchronously using `QueryHandler`, evaluates intent classification (precision, recall, F1-score), sentiment classification (precision, recall, F1-score), response quality (cosine similarity, ROUGE, BLEU, METEOR), and response times, then saves results to `evaluation_results.json`.
- **Execution**: Initializes `PDFProcessor` and `QueryHandler`, runs evaluations, and logs/prints results.
- **Logging and Output**: Logs metrics and errors, prints a summary of intent classification, sentiment classification, response quality, average response time, and error count.


### Why It Matters
This evaluation ensures the chatbot accurately interprets user intents and sentiments, generating high-quality, sentiment-aligned responses in both languages. The addition of sentiment analysis evaluation and ROUGE, BLEU, and METEOR provides a robust assessment of response quality and user experience, supporting the project’s goal of reliable procedural information delivery with a user-centric approach.

In [None]:
import json
from time import time
from typing import List, Dict, Any
import asyncio
import os
from statistics import mean, stdev
from sentence_transformers import SentenceTransformer
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')

def load_test_data(test_file: str = str(TEST_DATA_DIR / 'test_queries.json')) -> List[Dict]:
    try:
        if not os.path.exists(test_file):
            logger.warning(f"Test file {test_file} not found. Creating sample test data.")
            sample_data = [
                {
                    "query": "List all procedures",
                    "intent": "list",
                    "expected_response": "1. Cryptography Policy\n2. Data Backup Concept\n3. ISMS Risk Management",
                    "language": "en"
                },
                {
                    "query": "What is the purpose of Cryptography Policy?",
                    "intent": "question",
                    "expected_response": "The purpose of the Cryptography Policy is to ensure secure data encryption.",
                    "language": "en"
                },
                {
                    "query": "Détails de la procédure 019110107Fr+",
                    "intent": "detail",
                    "expected_response": "**Procédure**: 019110107Fr+\n**Objectif**: Démontrer des processus\n**Étapes**:\n1. Démarrer\n2. Compléter",
                    "language": "fr"
                },
                {
                    "query": "Compare Cryptography Policy in English and French",
                    "intent": "compare",
                    "expected_response": "| Aspect | English | French |\n|--------|---------|--------|\n| Purpose | Secure encryption | Chiffrement sécurisé |",
                    "language": "en"
                },
                {
                    "query": "Execute Data Backup Concept",
                    "intent": "execute",
                    "expected_response": "Executing first step of 'Data Backup Concept': Initialize backup process",
                    "language": "en"
                }
            ]
            os.makedirs(os.path.dirname(test_file), exist_ok=True)
            with open(test_file, 'w', encoding='utf-8') as f:
                json.dump(sample_data, f, indent=2, ensure_ascii=False)
            logger.info(f"Created sample test data with {len(sample_data)} queries at {test_file}")
            return sample_data
        with open(test_file, 'r', encoding='utf-8') as f:
            test_data = json.load(f)
        logger.info(f"Loaded {len(test_data)} test queries from {test_file}")
        return test_data
    except Exception as e:
        logger.error(f"Error loading test data: {str(e)}")
        return []

def evaluate_response_quality(responses: List[str], expected_responses: List[str]) -> Dict[str, float]:
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        response_embeddings = model.encode(responses)
        expected_embeddings = model.encode(expected_responses)
        similarities = [
            np.dot(r, e) / (np.linalg.norm(r) * np.linalg.norm(e))
            for r, e in zip(response_embeddings, expected_embeddings)
        ]
        
        # Initialize ROUGE scorer
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
        bleu_scores = []
        meteor_scores = []

        for resp, exp in zip(responses, expected_responses):
            # ROUGE scores
            scores = scorer.score(exp, resp)
            rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
            rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
            rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)
            
            # BLEU score
            ref_tokens = [word_tokenize(exp.lower())]
            hyp_tokens = word_tokenize(resp.lower())
            bleu = sentence_bleu(ref_tokens, hyp_tokens, weights=(0.25, 0.25, 0.25, 0.25))
            bleu_scores.append(bleu)
            
            # METEOR score
            meteor = single_meteor_score(word_tokenize(exp.lower()), word_tokenize(resp.lower()))
            meteor_scores.append(meteor)

        return {
            'avg_cosine_similarity': mean(similarities) if similarities else 0.0,
            'std_cosine_similarity': stdev(similarities) if len(similarities) > 1 else 0.0,
            'avg_rouge1': mean(rouge_scores['rouge1']) if rouge_scores['rouge1'] else 0.0,
            'avg_rouge2': mean(rouge_scores['rouge2']) if rouge_scores['rouge2'] else 0.0,
            'avg_rougeL': mean(rouge_scores['rougeL']) if rouge_scores['rougeL'] else 0.0,
            'avg_bleu': mean(bleu_scores) if bleu_scores else 0.0,
            'avg_meteor': mean(meteor_scores) if meteor_scores else 0.0
        }
    except Exception as e:
        logger.error(f"Response quality evaluation error: {str(e)}")
        return {
            'avg_cosine_similarity': 0.0,
            'std_cosine_similarity': 0.0,
            'avg_rouge1': 0.0,
            'avg_rouge2': 0.0,
            'avg_rougeL': 0.0,
            'avg_bleu': 0.0,
            'avg_meteor': 0.0
        }

def run_evaluation(query_handler, test_data: List[Dict]) -> Dict[str, Any]:
    results = {
        'intent_metrics': {'precision': 0.0, 'recall': 0.0, 'f1_score': 0.0},
        'response_quality': {
            'avg_cosine_similarity': 0.0,
            'std_cosine_similarity': 0.0,
            'avg_rouge1': 0.0,
            'avg_rouge2': 0.0,
            'avg_rougeL': 0.0,
            'avg_bleu': 0.0,
            'avg_meteor': 0.0
        },
        'response_times': [],
        'errors': []
    }
    true_intents = []
    predicted_intents = []
    responses = []
    expected_responses = []

    async def evaluate_single_query(query_data: Dict) -> None:
        try:
            start_time = time()
            query = query_data['query']
            true_intent = query_data['intent']
            expected_response = query_data['expected_response']
            response_json = await query_handler.handle_query(query)
            response_data = json.loads(response_json)
            predicted_intent = response_data['intent']
            response_text = response_data['response']

            true_intents.append(true_intent)
            predicted_intents.append(predicted_intent)
            responses.append(response_text)
            expected_responses.append(expected_response)

            response_time = (time() - start_time) * 1000  # Convert to milliseconds
            results['response_times'].append(response_time)
            logger.info(f"Evaluated query: {query}, Intent: {true_intent}, Response time: {response_time:.2f}ms")
        except Exception as e:
            logger.error(f"Evaluation error for query '{query_data.get('query', 'unknown')}': {str(e)}")
            results['errors'].append(str(e))

    async def run_async_evaluations():
        tasks = [evaluate_single_query(query_data) for query_data in test_data]
        await asyncio.gather(*tasks, return_exceptions=True)

    # Run evaluations
    asyncio.run(run_async_evaluations())

    # Evaluate intent classification
    if true_intents and predicted_intents:
        intent_metrics = evaluate_intent_classification(true_intents, predicted_intents)
        results['intent_metrics'] = intent_metrics
        logger.info(f"Intent classification metrics: {intent_metrics}")

    # Evaluate response quality
    if responses and expected_responses:
        quality_metrics = evaluate_response_quality(responses, expected_responses)
        results['response_quality'] = quality_metrics
        logger.info(f"Response quality metrics: {quality_metrics}")

    # Save results
    try:
        results_file = EVAL_RESULTS_DIR / 'evaluation_results.json'
        os.makedirs(EVAL_RESULTS_DIR, exist_ok=True)
        with open(results_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)
        logger.info(f"Evaluation results saved to {results_file}")
    except Exception as e:
        logger.error(f"Failed to save evaluation results: {str(e)}")
        results['errors'].append(f"Failed to save results: {str(e)}")

    # Print summary
    print("Evaluation Summary:")
    print(f"Intent Classification: {results['intent_metrics']}")
    print(f"Response Quality: {results['response_quality']}")
    print(f"Average Response Time: {mean(results['response_times']):.2f}ms" if results['response_times'] else "No response times recorded")
    print(f"Errors: {len(results['errors'])} encountered")

    return results

# Initialize components and run evaluation
try:
    pdf_processor = PDFProcessor()
    asyncio.run(pdf_processor.process_all_pdfs(english_files, french_files))
    query_handler = QueryHandler(pdf_processor.vectorstore, pdf_processor.bm25_docs, PROMPT_TEMPLATES)
    test_data = load_test_data()
    evaluation_results = run_evaluation(query_handler, test_data)
    logger.info("Evaluation completed successfully.")
    print("Evaluation completed successfully.")
except Exception as e:
    logger.error(f"Evaluation pipeline error: {str(e)}")
    print(f"Evaluation failed: {str(e)}")

# Phase 6: Deployment - Web Application

## CRISP-DM Phase Overview
The **Deployment** phase in CRISP-DM focuses on integrating the model into a production environment for end-user access. For the bilingual procedure comparison chatbot, this phase deploys the system as a web application, enabling users to interact with the chatbot via a browser in English and French.

## Code Purpose
This code cell sets up a Flask web application with a user-friendly interface and API endpoint for query processing, served by Waitress for production-grade performance. It ensures bilingual support and robust error handling for reliable deployment.

### What the Code Does
- **Imports**: Includes libraries for web development (`flask`), production serving (`waitress`), and async processing (`asyncio`).
- **Flask Application**:
  - Initializes a Flask app with two routes:
    - `/`: Serves an HTML interface (`INDEX_HTML`) with a chat window, input field, and JavaScript for sending queries.
    - `/query`: Handles POST requests, processes queries using `QueryHandler`, and returns JSON responses.
  - Uses `render_template_string` for the HTML interface and `jsonify` for API responses.
- **HTML Interface**: Provides a styled chat interface with user and bot message displays, supporting query input via button or Enter key.
- **Query Handling**: Processes queries asynchronously with `QueryHandler`, supports session IDs, and includes error handling for invalid requests or server issues.
- **Waitress Server**: Runs the Flask app on port 5000 with `waitress.serve` for stable, production-ready hosting.
- **Initialization**: Sets up `PDFProcessor` and `QueryHandler`, processes PDFs, and starts the server.
- **Logging and Output**: Logs server startup, query processing, and errors; prints deployment status.

### Why It Matters
This deployment makes the chatbot accessible to users via a web interface, supporting bilingual queries and ensuring reliability through robust error handling and production-grade serving. It completes the project’s goal of delivering procedural information interactively and efficiently.

In [None]:
from flask import Flask, request, jsonify, render_template_string
from waitress import serve
import asyncio

app = Flask(__name__)

# HTML template for the chatbot interface
INDEX_HTML = """
<!DOCTYPE html>
<html lang='en'>
<head>
    <meta charset='UTF-8'>
    <title>Procedure Comparison Chatbot</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; background-color: #f4f4f9; }
        h1 { color: #333; }
        #chat-container { max-width: 600px; margin: auto; padding: 20px; background: white; border-radius: 8px; box-shadow: 0 0 10px rgba(0,0,0,0.1); }
        #chat-history { height: 300px; overflow-y: auto; border: 1px solid #ccc; padding: 10px; margin-bottom: 10px; background: #fafafa; }
        #user-input { width: 80%; padding: 10px; margin-right: 10px; border: 1px solid #ccc; border-radius: 4px; }
        #submit-btn { padding: 10px 20px; background: #007bff; color: white; border: none; border-radius: 4px; cursor: pointer; }
        #submit-btn:hover { background: #0056b3; }
        .message { margin: 5px 0; padding: 10px; border-radius: 4px; }
        .user-message { background: #d4edda; }
        .bot-message { background: #e9ecef; }
    </style>
</head>
<body>
    <div id='chat-container'>
        <h1>Procedure Comparison Chatbot</h1>
        <div id='chat-history'></div>
        <input type='text' id='user-input' placeholder='Enter your query (e.g., List all procedures)'>
        <button id='submit-btn' onclick='sendQuery()'>Send</button>
    </div>
    <script>
        async function sendQuery() {
            const input = document.getElementById('user-input').value;
            if (!input.trim()) return;
            addMessage('You: ' + input, 'user-message');
            try {
                const response = await fetch('/query', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ query: input, session_id: 'default' })
                });
                const data = await response.json();
                addMessage('Bot: ' + data.response, 'bot-message');
            } catch (error) {
                addMessage('Bot: Error processing query.', 'bot-message');
            }
            document.getElementById('user-input').value = '';
        }
        function addMessage(text, className) {
            const chatHistory = document.getElementById('chat-history');
            const messageDiv = document.createElement('div');
            messageDiv.className = 'message ' + className;
            messageDiv.innerText = text;
            chatHistory.appendChild(messageDiv);
            chatHistory.scrollTop = chatHistory.scrollHeight;
        }
        document.getElementById('user-input').addEventListener('keypress', function(e) {
            if (e.key === 'Enter') sendQuery();
        });
    </script>
</body>
</html>
"""

@app.route('/')
def index():
    return render_template_string(INDEX_HTML)

@app.route('/query', methods=['POST'])
async def handle_query():
    try:
        data = request.get_json()
        if not data or 'query' not in data:
            logger.error("Invalid query request: Missing query field")
            return jsonify({'error': 'Query field is required'}), 400
        query = data['query']
        session_id = data.get('session_id', 'default')
        response_json = await query_handler.handle_query(query, session_id)
        response_data = json.loads(response_json)
        logger.info(f"Processed query via API: {query}, Session: {session_id}")
        return jsonify(response_data)
    except Exception as e:
        logger.error(f"API query error: {str(e)}")
        return jsonify({'error': str(e)}), 500

def run_server():
    try:
        logger.info("Starting Flask server with Waitress on port 5000...")
        serve(app, host='0.0.0.0', port=5000)
    except Exception as e:
        logger.error(f"Failed to start server: {str(e)}")
        print(f"Server failed to start: {str(e)}")

if __name__ == '__main__':
    try:
        # Ensure components are initialized
        pdf_processor = PDFProcessor()
        asyncio.run(pdf_processor.process_all_pdfs(english_files, french_files))
        query_handler = QueryHandler(pdf_processor.vectorstore, pdf_processor.bm25_docs, PROMPT_TEMPLATES)
        run_server()
    except Exception as e:
        logger.error(f"Deployment initialization error: {str(e)}")
        print(f"Deployment failed: {str(e)}")
