# ü§ñ Restaurant Intelligence Chatbot - Production System

**Enterprise-Grade Conversational AI with RAG**

This system integrates:
- ‚úÖ Advanced Sentiment Analysis
- ‚úÖ Independent Aspect-Based Analysis
- ‚úÖ Vector Retrieval (RAG)
- ‚úÖ LLM-Driven Recommendations
- ‚úÖ Hallucination Prevention
- ‚úÖ Production-Grade Error Handling

---

## System Architecture

```
User Query ‚Üí Intent Router ‚Üí Vector Retrieval ‚Üí LLM Reasoning ‚Üí Structured Response
                ‚Üì                    ‚Üì                ‚Üì
         Context Memory      Semantic Search    Grounded Generation
```

## üì¶ Installation & Dependencies

In [1]:
!pip install -q \
    chromadb \
    langchain \
    langchain-community \
    sentence-transformers \
    transformers \
    accelerate \
    python-dotenv



[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m52.0/52.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m21.4/21.4 MB[0m [31m82.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.5/2.5 MB[0m [31m91.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m278.2/278.2 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.0/2.0 MB[0m [31m75.9 MB/s[0m eta [36

## ‚öôÔ∏è Configuration & Environment Setup

In [2]:
# Core Imports
import os
import warnings
import logging
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, field
import json

# Data Processing
import pandas as pd
import numpy as np
from tqdm import tqdm

# ML & NLP
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer

# Vector DB & RAG
import chromadb
from chromadb.config import Settings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.llms.base import LLM

# Validation
from pydantic import BaseModel, Field, validator

# Configuration
warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
tqdm.pandas()

print("‚úÖ All dependencies loaded successfully")

2026-02-13 19:57:17.476931: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1771012637.720700      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1771012637.786964      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1771012638.374285      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771012638.374324      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771012638.374327      55 computation_placer.cc:177] computation placer alr

‚úÖ All dependencies loaded successfully


In [3]:
@dataclass
class SystemConfig:
    """Production-grade configuration with environment variable support"""
    
    # Paths
    data_path: str = "/kaggle/input/datasets/shahriard07/restaurant-review/dhaka_restaurants.csv"
    vector_db_path: str = "/kaggle/working/restaurant_vector_db"
    
    # Model Configuration
    sentiment_model: str = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
    embedding_model: str = "all-MiniLM-L6-v2"
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    
    # LLM Configuration
    llm_temperature: float = 0.2
    max_tokens: int = 512
    
    # RAG Configuration
    retrieval_k: int = 5
    similarity_threshold: float = 0.7
    
    # Aspect Keywords
    aspects: Dict[str, List[str]] = field(default_factory=lambda: {
        "food": ["food", "taste", "meal", "dish", "cuisine", "flavor", "delicious", "‡¶ñ‡¶æ‡¶¨‡¶æ‡¶∞", "‡¶∏‡ßç‡¶¨‡¶æ‡¶¶"],
        "service": ["service", "staff", "waiter", "waitress", "manager", "server", "‡¶∏‡¶æ‡¶∞‡ßç‡¶≠‡¶ø‡¶∏", "‡¶ï‡¶∞‡ßç‡¶Æ‡ßÄ"],
        "price": ["price", "cost", "expensive", "cheap", "value", "affordable", "‡¶¶‡¶æ‡¶Æ", "‡¶Æ‡ßÇ‡¶≤‡ßç‡¶Ø"],
        "ambience": ["ambience", "atmosphere", "environment", "decor", "vibe", "‡¶™‡¶∞‡¶ø‡¶¨‡ßá‡¶∂"],
        "cleanliness": ["clean", "hygiene", "sanitary", "dirty", "‡¶™‡¶∞‡¶ø‡¶∑‡ßç‡¶ï‡¶æ‡¶∞"]
    })
    
    # Negative Triggers
    negative_triggers: List[str] = field(default_factory=lambda: [
        "late", "slow", "rude", "bad", "cold", "delay", "terrible", "awful",
        "disappointing", "poor", "worst", "‡¶ñ‡¶æ‡¶∞‡¶æ‡¶™", "‡¶¶‡ßá‡¶∞‡¶ø", "‡¶†‡¶æ‡¶®‡ßç‡¶°‡¶æ"
    ])
    
    # Production Settings
    batch_size: int = 32
    enable_logging: bool = True
    fallback_enabled: bool = True

# Initialize configuration
config = SystemConfig()
logger.info(f"System initialized on device: {config.device}")
print(f"üîß Configuration loaded - Device: {config.device}")

üîß Configuration loaded - Device: cuda


## üìä Data Layer - Schema Validation & Cleaning

In [4]:
class ReviewSchema(BaseModel):
    """Pydantic schema for review validation"""
    business_name: str = Field(..., min_length=1)
    review_text: str = Field(..., min_length=10)
    review_rating: float = Field(..., ge=1.0, le=5.0)
    business_address: Optional[str] = None
    
    @validator('review_text')
    def validate_text(cls, v):
        if not isinstance(v, str) or len(v.strip()) < 10:
            raise ValueError('Review text must be at least 10 characters')
        return v.strip()

class DataPipeline:
    """Production-grade data processing pipeline"""
    
    def __init__(self, config: SystemConfig):
        self.config = config
        self.df_raw = None
        self.df_cleaned = None
        
    def load_and_validate(self, path: str) -> pd.DataFrame:
        """Load data with schema validation"""
        try:
            logger.info(f"Loading data from {path}")
            df = pd.read_csv(path)
            self.df_raw = df.copy()
            
            # Validate required columns
            required_cols = ['business_name', 'review_text', 'review_rating']
            missing = set(required_cols) - set(df.columns)
            if missing:
                raise ValueError(f"Missing required columns: {missing}")
            
            logger.info(f"‚úÖ Loaded {len(df)} rows")
            return df
            
        except Exception as e:
            logger.error(f"Data loading failed: {e}")
            raise
    
    def clean_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Clean and normalize data"""
        logger.info("Starting data cleaning pipeline")
        
        # Drop nulls
        df_clean = df.dropna(subset=['review_text', 'business_name']).copy()
        logger.info(f"Removed {len(df) - len(df_clean)} null rows")
        
        # Ensure text is string
        df_clean = df_clean[df_clean['review_text'].apply(lambda x: isinstance(x, str))]
        
        # Text normalization
        df_clean['review_text'] = (
            df_clean['review_text']
            .str.strip()
            .str.replace(r'\s+', ' ', regex=True)
        )
        
        # Filter short reviews
        df_clean = df_clean[df_clean['review_text'].str.len() >= 10]
        
        # Normalize restaurant names
        df_clean['business_name_normalized'] = (
            df_clean['business_name']
            .str.strip()
            .str.lower()
        )
        
        # Deduplicate
        before_dedup = len(df_clean)
        df_clean = df_clean.drop_duplicates(subset=['business_name', 'review_text'])
        logger.info(f"Removed {before_dedup - len(df_clean)} duplicate reviews")
        
        self.df_cleaned = df_clean.reset_index(drop=True)
        logger.info(f"‚úÖ Cleaning complete - {len(self.df_cleaned)} clean rows")
        
        return self.df_cleaned
    
    def get_restaurant_index(self) -> Dict[str, int]:
        """Create restaurant name index for fast lookup"""
        if self.df_cleaned is None:
            raise ValueError("Data not cleaned yet")
        
        return (
            self.df_cleaned
            .groupby('business_name_normalized')
            .size()
            .to_dict()
        )

# Initialize and run data pipeline
data_pipeline = DataPipeline(config)
df = data_pipeline.load_and_validate(config.data_path)
df_cleaned = data_pipeline.clean_data(df)

print(f"\nüìä Data Summary:")
print(f"Total Reviews: {len(df_cleaned):,}")
print(f"Unique Restaurants: {df_cleaned['business_name'].nunique():,}")
print(f"Average Review Length: {df_cleaned['review_text'].str.len().mean():.0f} chars")
df_cleaned.head(3)


üìä Data Summary:
Total Reviews: 977
Unique Restaurants: 126
Average Review Length: 405 chars


Unnamed: 0,business_name,business_address,business_phone,business_website,business_rating,business_total_reviews,reviewer_name,review_rating,review_date,review_text,review_additional_info,business_name_normalized
0,Izumi Japanese Kitchen,"House 24 C, Rd 119, Dhaka 1212, Bangladesh",+880 1933-446677,https://m.facebook.com/izumiBD/,4.5,2233,"{'name': 'Raunak Maskay', 'thumbnail': 'https:...",5.0,a month ago,"Izumi Japanese Kitchen in Gulshan, Dhaka is on...",,izumi japanese kitchen
1,Izumi Japanese Kitchen,"House 24 C, Rd 119, Dhaka 1212, Bangladesh",+880 1933-446677,https://m.facebook.com/izumiBD/,4.5,2233,"{'name': 'Dewan Asif', 'thumbnail': 'https://l...",5.0,4 months ago,Izumi Japanese Kitchen is a great place for re...,,izumi japanese kitchen
2,Izumi Japanese Kitchen,"House 24 C, Rd 119, Dhaka 1212, Bangladesh",+880 1933-446677,https://m.facebook.com/izumiBD/,4.5,2233,"{'name': 'Dr. Mehruba Mona', 'thumbnail': 'htt...",5.0,Edited 8 months ago,One of the authentic Japanese restaurant in Dh...,,izumi japanese kitchen


## üé≠ Sentiment Engine - Advanced Analysis with Fallback

In [5]:
class SentimentEngine:
    """Production sentiment analyzer with fallback mechanisms"""
    
    def __init__(self, config: SystemConfig):
        self.config = config
        self.model = None
        self.fallback_mode = False
        self._load_model()
    
    def _load_model(self):
        """Load sentiment model with retry logic"""
        try:
            logger.info(f"Loading sentiment model: {self.config.sentiment_model}")
            self.model = pipeline(
                "sentiment-analysis",
                model=self.config.sentiment_model,
                device=0 if self.config.device == "cuda" else -1
            )
            logger.info("‚úÖ Sentiment model loaded")
        except Exception as e:
            logger.warning(f"Model loading failed: {e}. Enabling fallback mode.")
            self.fallback_mode = True
    
    def _normalize_label(self, label: str) -> str:
        """Normalize sentiment labels to standard format"""
        label_lower = label.lower()
        if 'pos' in label_lower:
            return 'positive'
        elif 'neg' in label_lower:
            return 'negative'
        else:
            return 'neutral'
    
    def _fallback_sentiment(self, text: str) -> Tuple[str, float]:
        """Rule-based fallback sentiment analysis"""
        text_lower = text.lower()
        
        positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best', 'wonderful']
        negative_words = ['bad', 'terrible', 'awful', 'worst', 'hate', 'poor', 'disappointing']
        
        pos_count = sum(1 for word in positive_words if word in text_lower)
        neg_count = sum(1 for word in negative_words if word in text_lower)
        
        if pos_count > neg_count:
            return 'positive', 0.6
        elif neg_count > pos_count:
            return 'negative', 0.6
        else:
            return 'neutral', 0.5
    
    def analyze(self, text: str) -> Tuple[str, float]:
        """Analyze sentiment with error handling"""
        try:
            if self.fallback_mode or self.model is None:
                return self._fallback_sentiment(text)
            
            # Truncate to model limit
            result = self.model(text[:512])[0]
            label = self._normalize_label(result['label'])
            confidence = result['score']
            
            return label, confidence
            
        except Exception as e:
            logger.warning(f"Sentiment analysis failed for text, using fallback: {e}")
            return self._fallback_sentiment(text)
    
    def batch_analyze(self, texts: List[str]) -> pd.DataFrame:
        """Batch process with progress tracking"""
        logger.info(f"Analyzing {len(texts)} reviews")
        
        results = []
        for text in tqdm(texts, desc="Sentiment Analysis"):
            label, confidence = self.analyze(text)
            results.append({'sentiment': label, 'confidence': confidence})
        
        return pd.DataFrame(results)

# Run sentiment analysis
sentiment_engine = SentimentEngine(config)
sentiment_results = sentiment_engine.batch_analyze(df_cleaned['review_text'].tolist())

df_cleaned['overall_sentiment'] = sentiment_results['sentiment']
df_cleaned['sentiment_confidence'] = sentiment_results['confidence']

print("\nüé≠ Sentiment Distribution:")
print(df_cleaned['overall_sentiment'].value_counts())
print(f"\nAverage Confidence: {df_cleaned['sentiment_confidence'].mean():.2%}")

config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0
Sentiment Analysis:   0%|          | 4/977 [00:00<01:59,  8.15it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Sentiment Analysis: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 977/977 [00:10<00:00, 93.04it/s] 


üé≠ Sentiment Distribution:
overall_sentiment
positive    726
negative    213
neutral      38
Name: count, dtype: int64

Average Confidence: 72.86%





## üîç Aspect Extraction - Independent Analysis (NOT Copied from Overall)

In [6]:
# ==============================
# Aspect Extraction Module
# ==============================

class AspectAnalyzer:
    """Advanced aspect-based sentiment analyzer (Production Safe)"""

    def __init__(self, config: SystemConfig, sentiment_engine: SentimentEngine):
        self.config = config
        self.sentiment_engine = sentiment_engine
        self.aspects = config.aspects

    def extract_aspect_text(self, text: str, aspect: str) -> Optional[str]:
        if not isinstance(text, str) or not text.strip():
            return None

        text_lower = text.lower()
        keywords = self.aspects.get(aspect, [])

        # Aspect not mentioned at all
        if not any(keyword in text_lower for keyword in keywords):
            return None

        # Extract relevant sentences
        sentences = text.split(".")
        relevant = []

        for sentence in sentences:
            if any(keyword in sentence.lower() for keyword in keywords):
                relevant.append(sentence.strip())

        return " ".join(relevant) if relevant else text[:200]

    def analyze_aspect(self, text: str, aspect: str) -> Dict[str, Any]:
        aspect_text = self.extract_aspect_text(text, aspect)

        if aspect_text is None:
            return {
                "mentioned": False,
                "sentiment": None,
                "confidence": 0.0
            }

        sentiment, confidence = self.sentiment_engine.analyze(aspect_text)

        return {
            "mentioned": True,
            "sentiment": sentiment,
            "confidence": confidence
        }

    def batch_analyze(self, texts: List[str]) -> pd.DataFrame:
        results = []

        for text in tqdm(texts, desc="Aspect Analysis"):
            row = {}

            for aspect in self.aspects.keys():
                data = self.analyze_aspect(text, aspect)

                row[f"{aspect}_mentioned"] = bool(data["mentioned"])
                row[f"{aspect}_sentiment"] = data["sentiment"]
                row[f"{aspect}_confidence"] = float(data["confidence"])

            results.append(row)

        return pd.DataFrame(results)


# ==============================
# Run Aspect Analysis
# ==============================

aspect_analyzer = AspectAnalyzer(config, sentiment_engine)

aspect_results = aspect_analyzer.batch_analyze(
    df_cleaned["review_text"].fillna("").tolist()
)

# ------------------------------
# Remove previous aspect columns (if re-run)
# ------------------------------

aspect_prefixes = list(config.aspects.keys())

cols_to_drop = [
    col for col in df_cleaned.columns
    if any(col.startswith(prefix) for prefix in aspect_prefixes)
]

df_cleaned = df_cleaned.drop(columns=cols_to_drop, errors="ignore")

# ------------------------------
# Safe Merge
# ------------------------------

df_cleaned = pd.concat(
    [
        df_cleaned.reset_index(drop=True),
        aspect_results.reset_index(drop=True)
    ],
    axis=1
)

# ==============================
# Safe Aspect Summary
# ==============================

print("\nüîç Aspect Analysis Summary:")

for aspect in config.aspects.keys():

    col_mentioned = f"{aspect}_mentioned"
    col_sentiment = f"{aspect}_sentiment"

    if col_mentioned not in df_cleaned.columns:
        continue

    mask = df_cleaned[col_mentioned] == True
    mentioned_count = int(mask.sum())

    if mentioned_count > 0:

        sentiment_dist = (
            df_cleaned.loc[mask, col_sentiment]
            .dropna()
            .value_counts()
        )

        print(f"\n{aspect.upper()} ‚Üí {mentioned_count} mentions")
        print(sentiment_dist)


Aspect Analysis: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 977/977 [00:17<00:00, 55.99it/s]


üîç Aspect Analysis Summary:

FOOD ‚Üí 837 mentions
food_sentiment
positive    595
negative    177
neutral      65
Name: count, dtype: int64

SERVICE ‚Üí 508 mentions
service_sentiment
positive    400
negative     88
neutral      20
Name: count, dtype: int64

PRICE ‚Üí 306 mentions
price_sentiment
positive    130
negative    113
neutral      63
Name: count, dtype: int64

AMBIENCE ‚Üí 388 mentions
ambience_sentiment
positive    328
negative     41
neutral      19
Name: count, dtype: int64

CLEANLINESS ‚Üí 74 mentions
cleanliness_sentiment
positive    55
negative    13
neutral      6
Name: count, dtype: int64





## ‚ö†Ô∏è Conflict Detection - Multi-Level Analysis

In [7]:
class ConflictDetector:
    """Advanced conflict detection system"""
    
    def detect_rating_sentiment_conflict(self, row: pd.Series) -> str:
        rating = row.get('review_rating', 0)
        sentiment = row.get('overall_sentiment')
        
        if rating >= 4 and sentiment == 'negative':
            return 'Hidden Dissatisfaction'
        elif rating <= 2 and sentiment == 'positive':
            return 'Politeness Bias'
        elif rating == 3 and sentiment in ['positive', 'negative']:
            return 'Ambiguous Experience'
        else:
            return 'No Conflict'
    
    def detect_aspect_conflicts(self, row: pd.Series, aspects: List[str]) -> int:
        overall = row.get('overall_sentiment')
        conflicts = 0
        
        for aspect in aspects:
            if row.get(f'{aspect}_mentioned', False):
                aspect_sent = row.get(f'{aspect}_sentiment')
                if aspect_sent and aspect_sent != overall:
                    conflicts += 1
        
        return conflicts
    
    def analyze(self, df: pd.DataFrame, aspects: List[str]) -> pd.DataFrame:
        logger.info("Running conflict detection")
        
        df['rating_sentiment_conflict'] = df.apply(
            self.detect_rating_sentiment_conflict, axis=1
        )
        
        df['aspect_conflict_count'] = df.apply(
            lambda row: self.detect_aspect_conflicts(row, aspects), axis=1
        )
        
        df['has_conflict'] = (
            (df['rating_sentiment_conflict'] != 'No Conflict') |
            (df['aspect_conflict_count'] > 0)
        )
        
        logger.info("‚úÖ Conflict detection complete")
        return df

conflict_detector = ConflictDetector()
df_cleaned = conflict_detector.analyze(df_cleaned, list(config.aspects.keys()))

print("\n‚ö†Ô∏è Conflict Analysis:")
print(df_cleaned['rating_sentiment_conflict'].value_counts())
print(f"\nTotal Conflicts: {df_cleaned['has_conflict'].sum():,} ({df_cleaned['has_conflict'].mean():.1%})")


‚ö†Ô∏è Conflict Analysis:
rating_sentiment_conflict
No Conflict               799
Ambiguous Experience       87
Hidden Dissatisfaction     82
Politeness Bias             9
Name: count, dtype: int64

Total Conflicts: 364 (37.3%)


## üóÑÔ∏è Vector Database - RAG Layer

In [8]:
import shutil

class VectorStoreManager:
    """Production vector database manager"""
    
    def __init__(self, config: SystemConfig):
        self.config = config
        self.embeddings = None
        self.vector_store = None
        self._initialize_embeddings()
    
    def _initialize_embeddings(self):
        try:
            logger.info(f"Loading embedding model: {self.config.embedding_model}")
            self.embeddings = HuggingFaceEmbeddings(
                model_name=self.config.embedding_model
            )
            logger.info("‚úÖ Embedding model loaded")
        except Exception as e:
            logger.error(f"Embedding model loading failed: {e}")
            raise
    
    def create_documents(self, df: pd.DataFrame) -> List[Document]:
        logger.info(f"Creating {len(df)} documents")
        documents = []
        
        for _, row in df.iterrows():
            aspect_summary = {}
            for aspect in self.config.aspects.keys():
                if row.get(f'{aspect}_mentioned', False):
                    aspect_summary[aspect] = row.get(f'{aspect}_sentiment')
            
            doc = Document(
                page_content=row['review_text'],
                metadata={
                    'restaurant': row['business_name'],
                    'restaurant_normalized': row['business_name_normalized'],
                    'rating': float(row['review_rating']),
                    'sentiment': row['overall_sentiment'],
                    'confidence': float(row['sentiment_confidence']),
                    'aspects': json.dumps(aspect_summary),
                    'conflict': row['rating_sentiment_conflict'],
                    'has_conflict': bool(row['has_conflict'])
                }
            )
            documents.append(doc)
        
        return documents
    
    def build_vector_store(self, documents: List[Document]) -> Chroma:
        try:
            if os.path.exists(self.config.vector_db_path):
                shutil.rmtree(self.config.vector_db_path)
            
            logger.info("Building vector store...")
            self.vector_store = Chroma.from_documents(
                documents=documents,
                embedding=self.embeddings,
                persist_directory=self.config.vector_db_path,
                client_settings=Settings(anonymized_telemetry=False)
            )
            
            logger.info("‚úÖ Vector store built")
            return self.vector_store
        except Exception as e:
            logger.error(f"Vector store creation failed: {e}")
            raise

vector_manager = VectorStoreManager(config)
documents = vector_manager.create_documents(df_cleaned)
vector_store = vector_manager.build_vector_store(documents)

print(f"\nüóÑÔ∏è Vector Store Ready - {len(documents):,} documents")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


üóÑÔ∏è Vector Store Ready - 977 documents


### CELL 9: Mock LLM ###

In [None]:
class MockLLM(LLM):
    """Mock LLM for demonstration (replace with real LLM in production)"""

    @property
    def _llm_type(self) -> str:
        return "mock"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        # Extract context from prompt
        if "Context:" in prompt:
            context_section = prompt.split("Context:")[1].split("Question:")[0]
            reviews = [r.strip() for r in context_section.strip().split("\n\n") if r.strip()]

            # Simple rule-based response
            if len(reviews) > 0:
                return f"Based on {len(reviews)} reviews analyzed, I can provide insights. The reviews show varying experiences across different aspects."
            else:
                return "Insufficient review evidence to provide a reliable recommendation."

        return "I can only answer based on the provided review context."

llm = MockLLM()
print("‚úÖ Mock LLM initialized (replace with real LLM for production)")

### CELL 10: RAG Chatbot ###

In [None]:
class RAGChatbot:
    """Production RAG chatbot with grounding"""

    def __init__(self, vector_store: Chroma, llm: LLM, config: SystemConfig):
        self.vector_store = vector_store
        self.llm = llm
        self.config = config
        self.qa_chain = self._build_chain()

    def _build_chain(self):
        template = """You are a restaurant intelligence advisor. Answer ONLY using the provided reviews below.

CRITICAL RULES:
- Only use information from the Context section
- If insufficient evidence, respond: "Insufficient review evidence to provide a reliable recommendation."
- Never make assumptions or use external knowledge
- Always cite specific reviews when making claims

Context:
{context}

Question: {question}

Answer (grounded in reviews only):"""

        prompt = PromptTemplate(
            template=template,
            input_variables=["context", "question"]
        )

        chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=self.vector_store.as_retriever(
                search_kwargs={"k": self.config.retrieval_k}        
            ),
            chain_type_kwargs={"prompt": prompt},
            return_source_documents=True
        )

        return chain

    def query(self, question: str, restaurant_filter: Optional[str] = None) -> Dict[str, Any]:
        """Query with optional restaurant filtering"""
        try:
            if restaurant_filter:
                # Filter by restaurant
                filter_dict = {"restaurant_normalized": restaurant_filter.lower()}
                docs = self.vector_store.similarity_search(
                    question, k=self.config.retrieval_k, filter=filter_dict
                )

                if not docs:
                    return {
                        "answer": f"No reviews found for '{restaurant_filter}'",
                        "sources": [],
                        "confidence": 0.0
                    }

                # Build context manually
                context = "\n\n".join([doc.page_content for doc in docs])
                answer = self.llm(f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:")

                return {
                    "answer": answer,
                    "sources": docs,
                    "confidence": 0.8
                }
            else:
                result = self.qa_chain({"query": question})
                return {
                    "answer": result['result'],
                    "sources": result.get('source_documents', []),  
                    "confidence": 0.8
                }
        except Exception as e:
            logger.error(f"Query failed: {e}")
            return {
                "answer": "An error occurred processing your query.",
                "sources": [],
                "confidence": 0.0
            }

chatbot = RAGChatbot(vector_store, llm, config)
print("‚úÖ RAG Chatbot initialized")

### CELL 11: Recommendation Engine ###

In [None]:
class RecommendationEngine:
    """Advanced recommendation scoring system"""

    def __init__(self, df: pd.DataFrame, config: SystemConfig):     
        self.df = df
        self.config = config

    def calculate_score(self, restaurant_name: str) -> Dict[str, Any]:
        """Calculate comprehensive recommendation score"""
        restaurant_df = self.df[
            self.df['business_name_normalized'] == restaurant_name.lower()
        ]

        if len(restaurant_df) == 0:
            return {"error": "Restaurant not found"}

        if len(restaurant_df) < 3:
            return {"error": "Insufficient reviews (minimum 3 required)", "count": len(restaurant_df)}

        # Sentiment Distribution (40%)
        sentiment_score = (restaurant_df['overall_sentiment'] == 'positive').mean() * 40

        # Aspect Scores (30%)
        aspect_scores = {}
        aspect_total = 0
        for aspect in self.config.aspects.keys():
            mentioned = restaurant_df[f'{aspect}_mentioned']        
            if mentioned.sum() > 0:
                pos_rate = (
                    restaurant_df[mentioned][f'{aspect}_sentiment'] == 'positive'
                ).mean()
                aspect_scores[aspect] = pos_rate
                aspect_total += pos_rate

        aspect_score = (aspect_total / len(aspect_scores)) * 30 if aspect_scores else 0

        # Conflict Penalty (15%)
        conflict_rate = restaurant_df['has_conflict'].mean()        
        conflict_score = (1 - conflict_rate) * 15

        # Volume Bonus (10%)
        volume_score = min(len(restaurant_df) / 100, 1.0) * 10      

        # Confidence (5%)
        confidence_score = restaurant_df['sentiment_confidence'].mean() * 5

        total_score = sentiment_score + aspect_score + conflict_score + volume_score + confidence_score

        # Extract insights
        strengths = []
        weaknesses = []

        for aspect, score in aspect_scores.items():
            if score > 0.7:
                strengths.append(f"Excellent {aspect}")
            elif score < 0.4:
                weaknesses.append(f"Poor {aspect}")

        risk_factors = []
        if conflict_rate > 0.15:
            risk_factors.append(f"{conflict_rate:.0%} hidden dissatisfaction")

        return {
            "score": round(total_score, 1),
            "review_count": len(restaurant_df),
            "sentiment_distribution": restaurant_df['overall_sentiment'].value_counts().to_dict(),
            "aspect_scores": {k: round(v*100, 1) for k, v in aspect_scores.items()},
            "strengths": strengths,
            "weaknesses": weaknesses,
            "risk_factors": risk_factors,
            "conflict_rate": round(conflict_rate * 100, 1)
        }

rec_engine = RecommendationEngine(df_cleaned, config)
print("‚úÖ Recommendation Engine initialized")

### CELL 12: Interactive Chat Interface ###

In [None]:
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets

# Chat history
chat_history = []

# Create UI components
output_area = widgets.Output()
question_input = widgets.Text(
    placeholder='Ask about restaurants (e.g., "Best food quality?")',
    description='Question:',
    layout=widgets.Layout(width='70%')
)
restaurant_input = widgets.Text(
    placeholder='Optional: Filter by restaurant name',
    description='Restaurant:',
    layout=widgets.Layout(width='70%')
)
send_button = widgets.Button(
    description='Send',
    button_style='primary',
    icon='paper-plane'
)
clear_button = widgets.Button(
    description='Clear Chat',
    button_style='warning',
    icon='trash'
)

def format_chat_message(role, message, sources=None):
    """Format chat message with styling"""
    if role == "user":
        return f'''
        <div style="background: #e3f2fd; padding: 10px; margin: 5px 0; border-radius: 10px; border-left: 4px solid #2196F3;">
            <strong>üßë You:</strong> {message}
        </div>
        '''
    else:
        sources_html = ""
        if sources and len(sources) > 0:
            sources_html = f"<br><small>üìö Based on {len(sources)} reviews</small>"
        return f'''
        <div style="background: #f1f8e9; padding: 10px; margin: 5px 0; border-radius: 10px; border-left: 4px solid #8BC34A;">
            <strong>ü§ñ Assistant:</strong> {message}{sources_html}  
        </div>
        '''

def send_message(b):
    """Handle send button click"""
    question = question_input.value.strip()
    restaurant = restaurant_input.value.strip() if restaurant_input.value.strip() else None

    if not question:
        with output_area:
            clear_output(wait=True)
            for msg in chat_history:
                display(HTML(msg))
            display(HTML('<p style="color: red;">‚ö†Ô∏è Please enter a qquestion</p>'))
        return

    # Add user message to history
    user_msg = format_chat_message("user", question + (f" (Restaurant: {restaurant})" if restaurant else ""))
    chat_history.append(user_msg)

    # Get response
    result = chatbot.query(question, restaurant_filter=restaurant)  

    # Add bot response to history
    bot_msg = format_chat_message("assistant", result['answer'], result.get('sources'))
    chat_history.append(bot_msg)

    # Update display
    with output_area:
        clear_output(wait=True)
        for msg in chat_history:
            display(HTML(msg))

    # Clear inputs
    question_input.value = ""
    restaurant_input.value = ""

def clear_chat(b):
    """Clear chat history"""
    global chat_history
    chat_history = []
    with output_area:
        clear_output()
        display(HTML('<p style="color: #666;">Chat cleared. Start a new conversation!</p>'))

# Attach event handlers
send_button.on_click(send_message)
clear_button.on_click(clear_chat)
question_input.on_submit(lambda x: send_message(None))

# Display UI
print("\n" + "="*80)
print("ü§ñ INTERACTIVE RESTAURANT CHATBOT")
print("="*80)
print("\nAsk questions about restaurants or get recommendations!")  
print("\nExample questions:")
print("  - Which restaurant has the best food?")
print("  - Tell me about the service quality")
print("  - Is Izumi Japanese Kitchen good for couples?")
print("  - What do people say about prices?")
print("\n" + "="*80 + "\n")

display(widgets.VBox([
    widgets.HTML("<h3>üçΩÔ∏è Restaurant Intelligence Chatbot</h3>"),    
    question_input,
    restaurant_input,
    widgets.HBox([send_button, clear_button]),
    output_area
]))