## Part 1: Input Preprocessing

### 1.a Intent Classification

In [1]:
%pip install openai python-dotenv neo4j -q

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from typing import Dict, List, Any, Optional
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from neo4j import GraphDatabase

load_dotenv()

True

#### 5 Intent Categories

| Intent | Purpose | Example |
|--------|---------|----------|
| LIST_HOTELS | Find multiple hotels | "Show hotels in Paris" |
| RECOMMEND_HOTEL | Get personalized suggestions | "Recommend a hotel for families" |
| DESCRIBE_HOTEL | Get details about one hotel | "Tell me about Hilton Cairo" |
| COMPARE_HOTELS | Compare multiple hotels | "Compare Hilton vs Marriott" |
| CHECK_VISA | Visa requirements | "Do I need a visa for Turkey?" |

In [3]:
class IntentClassifier:
    
    def __init__(self):
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.model = "gpt-4o-mini"
        self.intents = {
            "LIST_HOTELS": "Find multiple hotels matching filters",
            "RECOMMEND_HOTEL": "Get personalized suggestions",
            "DESCRIBE_HOTEL": "Get details about one hotel",
            "COMPARE_HOTELS": "Compare multiple hotels",
            "CHECK_VISA": "Check visa requirements"
        }
    
    def classify(self, user_query: str) -> Optional[str]:
        prompt = f"""Classify this query into ONE intent: {list(self.intents.keys())} or return NONE.
        
        Intent definitions:
        - LIST_HOTELS: neutral search (keywords: show, find, list)
        - RECOMMEND_HOTEL: opinions/advice (keywords: recommend, suggest, best, top)
        - DESCRIBE_HOTEL: one specific hotel (must mention hotel name)
        - COMPARE_HOTELS: multiple hotels (keywords: compare, vs, which is better)
        - CHECK_VISA: visa requirements (keywords: visa, entry requirement)
        
        Query: \"{user_query}\"
        
        Return ONLY the intent name or NONE."""
        
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0,
                max_tokens=20
            )
            intent = response.choices[0].message.content.strip().upper()
            return intent if intent in self.intents else None
        except Exception as e:
            print(f"Error: {e}")
            return None

classifier = IntentClassifier()
print("IntentClassifier initialized")

IntentClassifier initialized


In [4]:
test_intents = [
    ("Show me hotels in Paris", "LIST_HOTELS"),
    ("Recommend a hotel for families in Dubai", "RECOMMEND_HOTEL"),
    ("Tell me about Hilton Cairo", "DESCRIBE_HOTEL"),
    ("Compare Hilton and Marriott", "COMPARE_HOTELS"),
    ("Do I need a visa for Turkey?", "CHECK_VISA"),
    ("What's the weather?", None),
]

correct = 0
for query, expected in test_intents:
    result = classifier.classify(query)
    status = "PASS" if result == expected else "FAIL"
    if result == expected:
        correct += 1
    print(f"[{status}] '{query}' -> {result}")

print(f"\nIntent Classification: {correct}/{len(test_intents)} passed")

[PASS] 'Show me hotels in Paris' -> LIST_HOTELS
[PASS] 'Recommend a hotel for families in Dubai' -> RECOMMEND_HOTEL
[PASS] 'Tell me about Hilton Cairo' -> DESCRIBE_HOTEL
[PASS] 'Compare Hilton and Marriott' -> COMPARE_HOTELS
[PASS] 'Do I need a visa for Turkey?' -> CHECK_VISA
[PASS] 'What's the weather?' -> None

Intent Classification: 6/6 passed


### 1.b Entity Extraction

#### Entity Schemas by Intent

| Intent | Entities |
|--------|----------|
| LIST_HOTELS | city, country, star_rating |
| RECOMMEND_HOTEL | city, country, traveller_type, age_group, user_gender, star_rating, aspects |
| DESCRIBE_HOTEL | hotel_name, aspects |
| COMPARE_HOTELS | hotel1, hotel2, traveller_type, aspects |
| CHECK_VISA | from_country, to_country |

In [5]:
SCHEMAS: Dict[str, Dict[str, Any]] = {
    "LIST_HOTELS": {"city": None, "country": None, "star_rating": None},
    "RECOMMEND_HOTEL": {"city": None, "country": None, "traveller_type": None, 
                        "age_group": None, "user_gender": None, "star_rating": None, "aspects": None},
    "DESCRIBE_HOTEL": {"hotel_name": None, "aspects": None},
    "COMPARE_HOTELS": {"hotel1": None, "hotel2": None, "traveller_type": None, "aspects": None},
    "CHECK_VISA": {"from_country": None, "to_country": None}
}

ALLOWED_ASPECTS = ["cleanliness", "comfort", "facilities", "location", "staff", "value_for_money"]

print(f"Schemas: {list(SCHEMAS.keys())}")
print(f"Allowed aspects: {ALLOWED_ASPECTS}")

Schemas: ['LIST_HOTELS', 'RECOMMEND_HOTEL', 'DESCRIBE_HOTEL', 'COMPARE_HOTELS', 'CHECK_VISA']
Allowed aspects: ['cleanliness', 'comfort', 'facilities', 'location', 'staff', 'value_for_money']


In [6]:
def enforce_schema(intent: str, entities: Dict[str, Any]) -> Dict[str, Any]:
    schema = SCHEMAS[intent]
    result = {}
    
    for key in schema.keys():
        value = entities.get(key, None)
        
        if key == "star_rating" and value is not None:
            try:
                value = int(value)
                if not (1 <= value <= 5):
                    value = None
            except (ValueError, TypeError):
                value = None
        
        elif key == "aspects" and value is not None:
            if isinstance(value, str):
                value = [value]
            if isinstance(value, list):
                normalized = []
                for asp in value:
                    if isinstance(asp, str):
                        asp_clean = asp.lower().strip().replace(" ", "_").replace("-", "_")
                        if asp_clean in ALLOWED_ASPECTS:
                            normalized.append(asp_clean)
                value = list(set(normalized)) if normalized else None
            else:
                value = None
        
        elif key == "traveller_type" and value is not None:
            if isinstance(value, str):
                value = value.lower().strip()
                if value not in ["family", "solo", "couple", "business", "group"]:
                    value = None
        
        elif key == "user_gender" and value is not None:
            if isinstance(value, str):
                value = value.lower().strip()
                if value not in ["male", "female"]:
                    value = None
        
        result[key] = value
    
    return result

print("Schema enforcement function defined")

Schema enforcement function defined


In [7]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def extract_entities(text: str, intent: str) -> Dict[str, Any]:
    if intent not in SCHEMAS:
        return dict(SCHEMAS.get("LIST_HOTELS", {}))
    
    prompt = f"""Extract entities from this query and return ONLY valid JSON.
    
    Query: \"{text}\"
    Intent: {intent}
    Required keys: {list(SCHEMAS[intent].keys())}
    
    RULES:
    1. Extract ONLY explicitly mentioned entities
    2. Vague words (good, best, nice) do NOT extract aspects
    3. Aspects only if explicitly mentioned (e.g., "clean rooms" -> cleanliness)
    4. For possessive forms (e.g., "hotel's cleanliness"), extract the aspect after the possessive
    5. Preserve complete hotel names including articles (e.g., "The Azure Tower", not "Azure Tower")
    6. Allowed aspects: {ALLOWED_ASPECTS}
    7. traveller_type: family, solo, couple, business, group
    8. user_gender: male, female
    9. star_rating: 1-5 (numeric)
    
    Return ONLY JSON matching: {SCHEMAS[intent]}"""
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            max_tokens=200
        )
        
        raw = response.choices[0].message.content.strip()
        if raw.startswith("```"):
            raw = raw.split("```")[1]
            if raw.startswith("json"):
                raw = raw[4:]
            raw = raw.strip()
        
        entities = json.loads(raw)
        return enforce_schema(intent, entities)
    except Exception as e:
        print(f"Error: {e}")
        return dict(SCHEMAS[intent])

print("Entity extraction function defined")

Entity extraction function defined


In [8]:
test_cases = [
    ("Recommend a hotel for families in Dubai with clean rooms", "RECOMMEND_HOTEL"),
    ("I want good hotels for family in Paris", "RECOMMEND_HOTEL"),
    ("Compare Hilton Cairo and Marriott Cairo", "COMPARE_HOTELS"),
    ("Describe Hilton Dubai", "DESCRIBE_HOTEL"),
    ("Do Egyptians need a visa for Turkey?", "CHECK_VISA"),
]

for query, intent in test_cases:
    result = extract_entities(query, intent)
    print(f"\nQuery: {query}")
    print(f"Intent: {intent}")
    print(f"Extracted: {json.dumps({k: v for k, v in result.items() if v is not None}, indent=2)}")


Query: Recommend a hotel for families in Dubai with clean rooms
Intent: RECOMMEND_HOTEL
Extracted: {
  "city": "Dubai",
  "traveller_type": "family",
  "aspects": [
    "cleanliness"
  ]
}

Query: I want good hotels for family in Paris
Intent: RECOMMEND_HOTEL
Extracted: {
  "city": "Paris",
  "traveller_type": "family"
}

Query: Compare Hilton Cairo and Marriott Cairo
Intent: COMPARE_HOTELS
Extracted: {
  "hotel1": "Hilton Cairo",
  "hotel2": "Marriott Cairo"
}

Query: Describe Hilton Dubai
Intent: DESCRIBE_HOTEL
Extracted: {
  "hotel_name": "Hilton Dubai"
}

Query: Do Egyptians need a visa for Turkey?
Intent: CHECK_VISA
Extracted: {
  "from_country": "Egypt",
  "to_country": "Turkey"
}


### 1.c Input Embedding

In [29]:
%pip install sentence-transformers -q

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
print("Primary embedding model loaded: all-MiniLM-L6-v2")

In [None]:
def embed_query(query: str):
    embedding = embedder.encode([query], convert_to_numpy=True)
    return embedding[0]

print("Query embedding function defined")

## Part 2: Graph Retrieval

### 2.a Baseline (Cypher Queries)

In [9]:
class Neo4jConnection:
    def __init__(self, config_path=None):
        if config_path is None:
            possible_paths = [
                os.path.join('KnowledgeGraph', 'config.txt'),
                os.path.join('Milestone 3', 'KnowledgeGraph', 'config.txt'),
            ]
            config_path = None
            for path in possible_paths:
                if os.path.exists(path):
                    config_path = path
                    break
        
        config = {}
        with open(config_path, 'r') as f:
            for line in f:
                if '=' in line:
                    key, value = line.strip().split('=', 1)
                    config[key] = value
        
        self.driver = GraphDatabase.driver(config['URI'], auth=(config['USERNAME'], config['PASSWORD']))
    
    def execute_query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters or {})
            return [dict(record) for record in result]
    
    def close(self):
        self.driver.close()

print("Neo4jConnection class defined")

Neo4jConnection class defined


#### 14 Query Templates by Intent

- **LIST_HOTELS**: L1-L5 (5 variants)
- **RECOMMEND_HOTEL**: R1, R3-R5 (4 variants, R2 removed)
- **DESCRIBE_HOTEL**: D1-D2 (2 variants)
- **COMPARE_HOTELS**: C1-C2 (2 variants)
- **CHECK_VISA**: V1 (1 variant)

In [10]:
class QueryLibrary:
    
    @staticmethod
    def template_L1_list_by_city(conn: Neo4jConnection, city: str):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE city.name = $city RETURN h.name, city.name AS city, country.name AS country, h.star_rating
        ORDER BY h.star_rating DESC LIMIT 50"""
        return conn.execute_query(query, {'city': city})
    
    @staticmethod
    def template_L2_list_by_country(conn: Neo4jConnection, country: str):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE country.name = $country RETURN h.name, city.name AS city, country.name AS country, h.star_rating
        ORDER BY h.star_rating DESC LIMIT 50"""
        return conn.execute_query(query, {'country': country})
    
    @staticmethod
    def template_L3_list_by_rating(conn: Neo4jConnection, star_rating: int):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE h.star_rating = $star_rating RETURN h.name, city.name AS city, country.name AS country, h.star_rating
        ORDER BY h.name LIMIT 50"""
        return conn.execute_query(query, {'star_rating': star_rating})
    
    @staticmethod
    def template_L4_list_by_city_and_rating(conn: Neo4jConnection, city: str, star_rating: int):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE city.name = $city AND h.star_rating = $star_rating 
        RETURN h.name, city.name AS city, country.name AS country, h.star_rating LIMIT 50"""
        return conn.execute_query(query, {'city': city, 'star_rating': star_rating})
    
    @staticmethod
    def template_L5_list_by_country_and_rating(conn: Neo4jConnection, country: str, star_rating: int):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE country.name = $country AND h.star_rating = $star_rating
        RETURN h.name, city.name AS city, country.name AS country, h.star_rating LIMIT 50"""
        return conn.execute_query(query, {'country': country, 'star_rating': star_rating})
    
    @staticmethod
    def template_R1_recommend_by_location(conn: Neo4jConnection, city: str, star_rating: int = None):
        where_parts = ["city.name = $city"]
        params = {'city': city}
        if star_rating:
            where_parts.append("h.star_rating = $star_rating")
            params['star_rating'] = star_rating
        where_clause = " AND ".join(where_parts)
        
        query = f"""MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE {where_clause} OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)
        WITH h, city, country, collect(r) AS reviews WHERE size(reviews) > 0 UNWIND reviews AS r
        RETURN h.name, city.name AS city, country.name AS country, avg(r.score_overall) AS overall_score
        ORDER BY overall_score DESC LIMIT 10"""
        return conn.execute_query(query, params)
    
    @staticmethod
    def template_R3_recommend_by_aspects(conn: Neo4jConnection, city: str, aspects: List[str], 
                                        age_group=None, user_gender=None, star_rating: int = None):
        aspect_mapping = {'cleanliness': 'score_cleanliness', 'comfort': 'score_comfort', 'facilities': 'score_facilities',
                         'location': 'score_location', 'staff': 'score_staff', 'value_for_money': 'score_value_for_money'}
        
        valid_aspects = [a for a in (aspects or []) if a in aspect_mapping]
        if not valid_aspects:
            return []
        
        aspect_avg = " + ".join([f"coalesce(avg(r.{aspect_mapping[a]}), 0)" for a in valid_aspects])
        aspect_select = ", ".join([f"avg(r.{aspect_mapping[a]}) AS {a}_review" for a in valid_aspects])
        
        where_parts = ["city.name = $city"]
        params = {'city': city}
        if star_rating:
            where_parts.append("h.star_rating = $star_rating")
            params['star_rating'] = star_rating
        
        demo_conditions = []
        if age_group:
            demo_conditions.append("u.age_group = $age_group")
            params['age_group'] = age_group
        if user_gender:
            demo_conditions.append("u.gender = $user_gender")
            params['user_gender'] = user_gender
        
        where_clause = " AND ".join(where_parts)
        user_match = "<-[:WROTE]-(u:User)" if demo_conditions else ""
        demo_clause = " AND " + " AND ".join(demo_conditions) if demo_conditions else ""
        
        query = f"""MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE {where_clause} OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review){user_match} WHERE TRUE{demo_clause}
        WITH h, city, country, collect(r) AS reviews WHERE size(reviews) > 0 UNWIND reviews AS r
        RETURN h.name, city.name AS city, country.name AS country, {aspect_select},
        ({aspect_avg}) / {len(valid_aspects)} AS composite_aspect_score, count(r) AS review_count
        ORDER BY composite_aspect_score DESC LIMIT 10"""
        
        return conn.execute_query(query, params)
    
    @staticmethod
    def template_R4_recommend_by_traveller_and_aspects(conn: Neo4jConnection, city: str, traveller_type: str, 
                                                       aspects: List[str], age_group=None, user_gender=None, star_rating: int = None):
        aspect_mapping = {'cleanliness': 'score_cleanliness', 'comfort': 'score_comfort', 'facilities': 'score_facilities',
                         'location': 'score_location', 'staff': 'score_staff', 'value_for_money': 'score_value_for_money'}
        
        valid_aspects = [a for a in (aspects or []) if a in aspect_mapping]
        if not valid_aspects:
            return QueryLibrary.template_R3_recommend_by_aspects(conn, city, aspects, age_group, user_gender, star_rating)
        
        aspect_avg = " + ".join([f"coalesce(avg(r.{aspect_mapping[a]}), 0)" for a in valid_aspects])
        aspect_select = ", ".join([f"avg(r.{aspect_mapping[a]}) AS {a}_review" for a in valid_aspects])
        
        where_parts = ["city.name = $city"]
        params = {'city': city, 'traveller_type': traveller_type}
        if star_rating:
            where_parts.append("h.star_rating = $star_rating")
            params['star_rating'] = star_rating
        where_clause = " AND ".join(where_parts)
        
        conditions = ["t.type = $traveller_type"]
        if age_group:
            conditions.append("u.age_group = $age_group")
            params['age_group'] = age_group
        if user_gender:
            conditions.append("u.gender = $user_gender")
            params['user_gender'] = user_gender
        
        traveller_where = " AND ".join(conditions)
        user_match = "<-[:WROTE]-(u:User)" if (age_group or user_gender) else ""
        
        query = f"""MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE {where_clause} OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)<-[:WROTE]-(t:Traveller){user_match}
        WHERE {traveller_where} WITH h, city, country, collect(r) AS reviews WHERE size(reviews) > 0 UNWIND reviews AS r
        RETURN h.name, city.name AS city, country.name AS country, {aspect_select},
        ({aspect_avg}) / {len(valid_aspects)} AS composite_aspect_score, count(r) AS review_count
        ORDER BY composite_aspect_score DESC LIMIT 10"""
        
        results = conn.execute_query(query, params)
        if not results:
            results = QueryLibrary.template_R3_recommend_by_aspects(conn, city, aspects, age_group, user_gender, star_rating)
        return results
    
    @staticmethod
    def template_R5_recommend_with_rating_filter(conn: Neo4jConnection, city: str, star_rating: int):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE city.name = $city AND h.star_rating = $star_rating OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)
        WITH h, city, country, collect(r) AS reviews WHERE size(reviews) > 0 UNWIND reviews AS r
        RETURN h.name, city.name AS city, country.name AS country, avg(r.score_overall) AS overall_score
        ORDER BY overall_score DESC LIMIT 10"""
        return conn.execute_query(query, {'city': city, 'star_rating': star_rating})
    
    @staticmethod
    def template_D1_describe_all_aspects(conn: Neo4jConnection, hotel_name: str):
        query = """MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE toLower(h.name) = toLower($hotel_name) OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)
        RETURN h.name, city.name AS city, country.name AS country,
        h.cleanliness_base, h.comfort_base, h.facilities_base, h.location_base, h.staff_base, h.value_for_money_base,
        avg(r.score_cleanliness) AS cleanliness_review, avg(r.score_comfort) AS comfort_review,
        avg(r.score_facilities) AS facilities_review, avg(r.score_location) AS location_review,
        avg(r.score_staff) AS staff_review, avg(r.score_value_for_money) AS value_for_money_review, count(r) AS review_count LIMIT 1"""
        return conn.execute_query(query, {'hotel_name': hotel_name})
    
    @staticmethod
    def template_D2_describe_specific_aspects(conn: Neo4jConnection, hotel_name: str, aspects: List[str]):
        aspect_mapping = {'cleanliness': ('cleanliness_base', 'score_cleanliness'), 'comfort': ('comfort_base', 'score_comfort'),
                         'facilities': ('facilities_base', 'score_facilities'), 'location': ('location_base', 'score_location'),
                         'staff': ('staff_base', 'score_staff'), 'value_for_money': ('value_for_money_base', 'score_value_for_money')}
        
        valid_aspects = [a for a in aspects if a in aspect_mapping]
        if not valid_aspects:
            return QueryLibrary.template_D1_describe_all_aspects(conn, hotel_name)
        
        aspect_fields = []
        for aspect in valid_aspects:
            base_field, review_field = aspect_mapping[aspect]
            aspect_fields.append(f"h.{base_field}")
            aspect_fields.append(f"avg(r.{review_field}) AS {aspect}_review")
        
        aspect_select = ", ".join(aspect_fields)
        query = f"""MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
        WHERE toLower(h.name) = toLower($hotel_name) OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)
        RETURN h.name, city.name AS city, country.name AS country, {aspect_select}, count(r) AS review_count LIMIT 1"""
        return conn.execute_query(query, {'hotel_name': hotel_name})
    
    @staticmethod
    def template_C1_compare_all_aspects(conn: Neo4jConnection, hotel1: str, hotel2: str, aspects: List[str] = None):
        aspect_mapping = {'cleanliness': ('cleanliness_base', 'score_cleanliness'), 'comfort': ('comfort_base', 'score_comfort'),
                         'facilities': ('facilities_base', 'score_facilities'), 'location': ('location_base', 'score_location'),
                         'staff': ('staff_base', 'score_staff'), 'value_for_money': ('value_for_money_base', 'score_value_for_money')}
        
        if aspects:
            valid_aspects = [a for a in aspects if a in aspect_mapping]
            if not valid_aspects:
                return []
        else:
            valid_aspects = list(aspect_mapping.keys())
        
        aspect_fields = []
        for aspect in valid_aspects:
            base_field, review_field = aspect_mapping[aspect]
            aspect_fields.append(f"h1.{base_field}, h2.{base_field}, avg(r1.{review_field}), avg(r2.{review_field})")
        
        query = f"""MATCH (h1:Hotel)-[:LOCATED_IN]->(city1:City)-[:LOCATED_IN]->(country1:Country),
        (h2:Hotel)-[:LOCATED_IN]->(city2:City)-[:LOCATED_IN]->(country2:Country)
        WHERE toLower(h1.name) = toLower($hotel1) AND toLower(h2.name) = toLower($hotel2)
        OPTIONAL MATCH (h1)<-[:REVIEWED]-(r1:Review) OPTIONAL MATCH (h2)<-[:REVIEWED]-(r2:Review)
        RETURN h1.name, city1.name, country1.name, h2.name, city2.name, country2.name,
        {', '.join(aspect_fields)} LIMIT 1"""
        return conn.execute_query(query, {'hotel1': hotel1, 'hotel2': hotel2})
    
    @staticmethod
    def template_C2_compare_with_traveller_type(conn: Neo4jConnection, hotel1: str, hotel2: str, 
                                               traveller_type: str, aspects: List[str] = None):
        aspect_mapping = {'cleanliness': ('cleanliness_base', 'score_cleanliness'), 'comfort': ('comfort_base', 'score_comfort'),
                         'facilities': ('facilities_base', 'score_facilities'), 'location': ('location_base', 'score_location'),
                         'staff': ('staff_base', 'score_staff'), 'value_for_money': ('value_for_money_base', 'score_value_for_money')}
        
        if aspects:
            valid_aspects = [a for a in aspects if a in aspect_mapping]
            if not valid_aspects:
                return QueryLibrary.template_C1_compare_all_aspects(conn, hotel1, hotel2)
        else:
            valid_aspects = list(aspect_mapping.keys())
        
        query = f"""MATCH (h1:Hotel)-[:LOCATED_IN]->(city1:City)-[:LOCATED_IN]->(country1:Country),
        (h2:Hotel)-[:LOCATED_IN]->(city2:City)-[:LOCATED_IN]->(country2:Country)
        WHERE toLower(h1.name) = toLower($hotel1) AND toLower(h2.name) = toLower($hotel2)
        OPTIONAL MATCH (h1)<-[:REVIEWED]-(r1:Review)<-[:WROTE]-(t1:Traveller) WHERE t1.type = $traveller_type
        OPTIONAL MATCH (h2)<-[:REVIEWED]-(r2:Review)<-[:WROTE]-(t2:Traveller) WHERE t2.type = $traveller_type
        RETURN h1.name, city1.name, country1.name, h2.name, city2.name, country2.name,
        avg(r1.score_overall), avg(r2.score_overall), count(r1), count(r2) LIMIT 1"""
        
        results = conn.execute_query(query, {'hotel1': hotel1, 'hotel2': hotel2, 'traveller_type': traveller_type})
        if not results:
            results = QueryLibrary.template_C1_compare_all_aspects(conn, hotel1, hotel2, aspects)
        return results
    
    @staticmethod
    def template_V1_check_visa_requirement(conn: Neo4jConnection, from_country: str, to_country: str):
        query = """MATCH (from:Country {name: $from_country}), (to:Country {name: $to_country})
        OPTIONAL MATCH (from)-[v:NEEDS_VISA]->(to)
        RETURN from.name, to.name, v.visa_type, CASE WHEN v IS NOT NULL THEN true ELSE false END AS visa_required LIMIT 1"""
        return conn.execute_query(query, {'from_country': from_country, 'to_country': to_country})

print("QueryLibrary defined (14 templates)")

QueryLibrary defined (14 templates)


In [11]:
def select_and_execute_query(conn: Neo4jConnection, intent: str, entities: Dict[str, Any]):
    if intent == "LIST_HOTELS":
        city, country, star_rating = entities.get('city'), entities.get('country'), entities.get('star_rating')
        if city and star_rating:
            return QueryLibrary.template_L4_list_by_city_and_rating(conn, city, star_rating)
        elif country and star_rating:
            return QueryLibrary.template_L5_list_by_country_and_rating(conn, country, star_rating)
        elif city:
            return QueryLibrary.template_L1_list_by_city(conn, city)
        elif country:
            return QueryLibrary.template_L2_list_by_country(conn, country)
        elif star_rating:
            return QueryLibrary.template_L3_list_by_rating(conn, star_rating)
    
    elif intent == "RECOMMEND_HOTEL":
        city = entities.get('city')
        traveller_type = entities.get('traveller_type')
        aspects = entities.get('aspects')
        star_rating = entities.get('star_rating')
        age_group = entities.get('age_group')
        user_gender = entities.get('user_gender')
        
        if city:
            if traveller_type and aspects:
                return QueryLibrary.template_R4_recommend_by_traveller_and_aspects(
                    conn, city, traveller_type, aspects, age_group, user_gender, star_rating)
            elif aspects:
                return QueryLibrary.template_R3_recommend_by_aspects(
                    conn, city, aspects, age_group, user_gender, star_rating)
            elif star_rating and traveller_type:
                return QueryLibrary.template_R1_recommend_by_location(conn, city, star_rating)
            elif star_rating:
                return QueryLibrary.template_R5_recommend_with_rating_filter(conn, city, star_rating)
            elif traveller_type:
                return QueryLibrary.template_R1_recommend_by_location(conn, city)
            else:
                return QueryLibrary.template_R1_recommend_by_location(conn, city)
    
    elif intent == "DESCRIBE_HOTEL":
        hotel_name, aspects = entities.get('hotel_name'), entities.get('aspects')
        if hotel_name:
            return QueryLibrary.template_D2_describe_specific_aspects(conn, hotel_name, aspects) if aspects else QueryLibrary.template_D1_describe_all_aspects(conn, hotel_name)
    
    elif intent == "COMPARE_HOTELS":
        hotel1, hotel2, traveller_type, aspects = entities.get('hotel1'), entities.get('hotel2'), entities.get('traveller_type'), entities.get('aspects')
        if hotel1 and hotel2:
            return QueryLibrary.template_C2_compare_with_traveller_type(conn, hotel1, hotel2, traveller_type, aspects) if traveller_type else QueryLibrary.template_C1_compare_all_aspects(conn, hotel1, hotel2, aspects)
    
    elif intent == "CHECK_VISA":
        from_country, to_country = entities.get('from_country'), entities.get('to_country')
        if from_country and to_country:
            return QueryLibrary.template_V1_check_visa_requirement(conn, from_country, to_country)
    
    return []

print("Query selector defined")

Query selector defined


In [12]:
def full_pipeline(user_query: str, conn: Neo4jConnection):
    print(f"\n{'='*80}")
    print(f"User Query: {user_query}")
    print(f"{'='*80}")
    
    intent = classifier.classify(user_query)
    print(f"Intent: {intent}")
    
    if intent is None or intent == "NONE":
        print("No valid intent detected")
        return None
    
    entities = extract_entities(user_query, intent)
    entities_display = {k: v for k, v in entities.items() if v is not None}
    print(f"Extracted Entities: {json.dumps(entities_display, indent=2)}")
    
    results = select_and_execute_query(conn, intent, entities)
    print(f"Query Results: {len(results)} rows")
    if results and len(results) > 0:
        print(f"Sample: {json.dumps(results[0], indent=2)}")
    print()
    
    return results

try:
    conn = Neo4jConnection()
    
    test_queries = [
        ("L1", "Show hotels in Paris"),
        ("L2", "List hotels in France"),
        ("L3", "Show me 5-star hotels"),
        ("L4", "Show 5-star hotels in Cairo"),
        ("L5", "Show 5-star hotels in Egypt"),
        ("R1", "Recommend hotels in Cairo"),
        ("R3", "Recommend hotels in Cairo with good cleanliness"),
        ("R4", "Recommend hotels in Cairo for families with comfortable rooms"),
        ("R5", "Recommend 4-star hotels in Cairo"),
        ("D1", "Tell me about The Azure Tower"),
        ("D2", "Tell me about The Azure Tower's cleanliness"),
        ("C1", "Compare The Azure Tower and Nile Grandeur"),
        ("C2", "Compare The Azure Tower and Nile Grandeur for families"),
        ("V1", "Do Egyptians need a visa for Turkey?"),
    ]
    
    results_summary = []
    for template_id, query in test_queries:
        result = full_pipeline(query, conn)
        status = "PASS" if result else "EMPTY"
        results_summary.append((template_id, status))
    
    print("\n" + "="*80)
    print("SUMMARY")
    print("="*80)
    passed = sum(1 for _, status in results_summary if status == "PASS")
    for template_id, status in results_summary:
        print(f"{template_id}: {status}")
    print(f"\nResult: {passed}/{len(test_queries)} templates executed successfully")
    
    conn.close()
except Exception as e:
    print(f"Error: {e}")


User Query: Show hotels in Paris
Intent: LIST_HOTELS
Extracted Entities: {
  "city": "Paris"
}
Query Results: 1 rows
Sample: {
  "h.name": "L'\u00c9toile Palace",
  "city": "Paris",
  "country": "France",
  "h.star_rating": 5.0
}


User Query: List hotels in France
Intent: LIST_HOTELS
Extracted Entities: {
  "country": "France"
}
Query Results: 1 rows
Sample: {
  "h.name": "L'\u00c9toile Palace",
  "city": "Paris",
  "country": "France",
  "h.star_rating": 5.0
}


User Query: Show me 5-star hotels
Intent: LIST_HOTELS
Extracted Entities: {
  "star_rating": 5
}
Query Results: 25 rows
Sample: {
  "h.name": "Aztec Heights",
  "city": "Mexico City",
  "country": "Mexico",
  "h.star_rating": 5.0
}


User Query: Show 5-star hotels in Cairo
Intent: LIST_HOTELS
Extracted Entities: {
  "city": "Cairo",
  "star_rating": 5
}
Query Results: 1 rows
Sample: {
  "h.name": "Nile Grandeur",
  "city": "Cairo",
  "country": "Egypt",
  "h.star_rating": 5.0
}


User Query: Show 5-star hotels in Egypt
Inten

### 2.b Embeddings-Based Retrieval (RAG)

#### Step 1: Extract Hotel + Review Data from Neo4j

In [17]:
conn_rag = Neo4jConnection()

query = """
MATCH (h:Hotel)-[:LOCATED_IN]->(city:City)-[:LOCATED_IN]->(country:Country)
OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)<-[:WROTE]-(t:Traveller)
WITH h, city, country, t.type AS traveller_type,
     avg(r.score_overall) AS avg_overall,
     avg(r.score_cleanliness) AS avg_cleanliness,
     avg(r.score_comfort) AS avg_comfort,
     avg(r.score_facilities) AS avg_facilities,
     avg(r.score_location) AS avg_location,
     avg(r.score_staff) AS avg_staff,
     avg(r.score_value_for_money) AS avg_value,
     count(r) AS review_count
WHERE review_count > 0
RETURN h.name AS hotel_name,
       h.star_rating AS star_rating,
       city.name AS city,
       country.name AS country,
       traveller_type,
       avg_overall, avg_cleanliness, avg_comfort, avg_facilities,
       avg_location, avg_staff, avg_value, review_count
ORDER BY h.name, traveller_type
LIMIT 1000
"""

hotel_data = conn_rag.execute_query(query)
print(f"Extracted {len(hotel_data)} hotel-traveller combinations")
if hotel_data:
    print(f"Sample: {json.dumps(hotel_data[0], indent=2)}")

Extracted 100 hotel-traveller combinations
Sample: {
  "hotel_name": "Aztec Heights",
  "star_rating": 5.0,
  "city": "Mexico City",
  "country": "Mexico",
  "traveller_type": "Business",
  "avg_overall": 8.624821002386632,
  "avg_cleanliness": 8.19832935560859,
  "avg_comfort": 8.611217183770881,
  "avg_facilities": 8.66825775656324,
  "avg_location": 9.26945107398567,
  "avg_staff": 8.811217183770879,
  "avg_value": 7.740334128878283,
  "review_count": 419
}


In [18]:
check_query = """
MATCH (h:Hotel)
OPTIONAL MATCH (h)<-[:REVIEWED]-(r:Review)<-[:WROTE]-(t:Traveller)
WITH h, count(DISTINCT t.type) as traveller_types_with_reviews, count(r) as total_reviews
RETURN count(h) as total_hotels,
       sum(CASE WHEN total_reviews > 0 THEN 1 ELSE 0 END) as hotels_with_reviews,
       avg(traveller_types_with_reviews) as avg_traveller_types_per_hotel
"""

stats = conn_rag.execute_query(check_query)
print("DATABASE STATISTICS:")
print(json.dumps(stats[0], indent=2))

print("\nSample of hotel_data extracted:")
print(f"Total combinations: {len(hotel_data)}")
print(f"Unique hotels: {len(set(h['hotel_name'] for h in hotel_data))}")
print(f"Unique cities: {len(set(h['city'] for h in hotel_data))}")

traveller_dist = {}
for h in hotel_data:
    t = h.get('traveller_type')
    traveller_dist[t] = traveller_dist.get(t, 0) + 1

print(f"\nTraveller type distribution in extracted data:")
for t, count in sorted(traveller_dist.items()):
    print(f"  {t}: {count}")

DATABASE STATISTICS:
{
  "total_hotels": 25,
  "hotels_with_reviews": 25,
  "avg_traveller_types_per_hotel": 4.0
}

Sample of hotel_data extracted:
Total combinations: 100
Unique hotels: 25
Unique cities: 25

Traveller type distribution in extracted data:
  Business: 25
  Couple: 25
  Family: 25
  Solo: 25


#### Step 2: Generate Natural Review Text

In [19]:
import random

def generate_review(hotel_info: Dict[str, Any]) -> str:
    traveller_type = hotel_info.get('traveller_type', 'solo')
    hotel_name = hotel_info['hotel_name']
    city = hotel_info['city']
    country = hotel_info['country']
    star = int(hotel_info.get('star_rating', 3))
    
    scores = {
        'overall': hotel_info.get('avg_overall', 7.0),
        'cleanliness': hotel_info.get('avg_cleanliness', 7.0),
        'comfort': hotel_info.get('avg_comfort', 7.0),
        'facilities': hotel_info.get('avg_facilities', 7.0),
        'location': hotel_info.get('avg_location', 7.0),
        'staff': hotel_info.get('avg_staff', 7.0),
        'value': hotel_info.get('avg_value', 7.0)
    }
    
    traveller_contexts = {
        'family': ['with my family', 'with the kids', 'as a family of four', 'family vacation'],
        'couple': ['with my partner', 'romantic getaway', 'anniversary trip', 'couples retreat'],
        'solo': ['solo trip', 'traveling alone', 'on my own', 'business trip'],
        'business': ['business trip', 'work travel', 'conference stay', 'corporate visit'],
        'group': ['with friends', 'group trip', 'with colleagues', 'friends vacation']
    }
    
    tone_templates = [
        'enthusiastic',
        'satisfied',
        'balanced',
        'critical_but_fair'
    ]
    
    tone = random.choice(tone_templates)
    context = random.choice(traveller_contexts.get(traveller_type, ['trip']))
    
    prompt = f"""Write a natural hotel review as if you're a real {traveller_type} traveler.

Hotel: {hotel_name} ({star}-star) in {city}, {country}
Trip context: {context}
Tone: {tone}

Aspect scores (0-10 scale):
- Overall: {scores['overall']:.1f}
- Cleanliness: {scores['cleanliness']:.1f}
- Comfort: {scores['comfort']:.1f}
- Facilities: {scores['facilities']:.1f}
- Location: {scores['location']:.1f}
- Staff: {scores['staff']:.1f}
- Value for money: {scores['value']:.1f}

RULES:
1. Write 80-150 words in first person
2. Sound human and conversational (not formal)
3. Mention 2-4 aspects naturally (don't list all)
4. Convert scores to natural expressions:
   - 9.0+: "amazing", "spotless", "fantastic", "excellent"
   - 8.0-8.9: "great", "very good", "really nice", "impressive"
   - 7.0-7.9: "good", "decent", "solid", "comfortable"
   - 6.0-6.9: "okay", "average", "acceptable", "could be better"
   - <6.0: "disappointing", "needs improvement", "not great"
5. Include traveller context naturally
6. Vary sentence structure and vocabulary
7. NO numeric ratings or formal language
8. Focus on experience, not data

Return ONLY the review text."""
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8,
            max_tokens=250
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating review: {e}")
        return ""

test_review = generate_review(hotel_data[0])
print(f"Generated review ({len(test_review.split())} words):\n")
print(test_review)

Generated review (132 words):

I recently stayed at Aztec Heights during a business trip to Mexico City, and overall, it was a very good experience. The location is fantastic, right in the heart of the city, making it easy to reach meetings and explore a bit in between. The staff was particularly helpful, always ready with a smile and quick to assist when I needed anything.

The room was comfortable, offering a great spot to unwind after a long day. While the cleanliness was decent, I did notice a couple of spots that could use a bit more attention. The facilities were impressive too, with a well-equipped gym that allowed me to stay on track with my fitness routine. All in all, Aztec Heights provides a solid base for business travelers looking for convenience and comfort.


#### Step 3: Generate 1000 Reviews for RAG

In [24]:
import time

def generate_reviews_batch(hotel_data_list, target_count=1000, delay=0.3):
    reviews = []
    
    reviews_per_hotel = max(1, target_count // len(hotel_data_list))
    if reviews_per_hotel * len(hotel_data_list) < target_count:
        reviews_per_hotel += 1
    
    print(f"Generating {target_count} reviews from {len(hotel_data_list)} hotel-traveller combinations")
    print(f"Creating {reviews_per_hotel} reviews per combination\n")
    
    total_needed = target_count
    generated = 0
    
    while generated < total_needed:
        for hotel_info in hotel_data_list:
            if generated >= total_needed:
                break
            
            try:
                review_text = generate_review(hotel_info)
                
                if review_text:
                    review_entry = {
                        'hotel_name': hotel_info['hotel_name'],
                        'city': hotel_info['city'],
                        'country': hotel_info['country'],
                        'star_rating': hotel_info['star_rating'],
                        'traveller_type': hotel_info.get('traveller_type', 'solo'),
                        'review_text': review_text,
                        'metadata': {
                            'avg_overall': hotel_info.get('avg_overall'),
                            'avg_cleanliness': hotel_info.get('avg_cleanliness'),
                            'avg_comfort': hotel_info.get('avg_comfort'),
                            'avg_location': hotel_info.get('avg_location')
                        }
                    }
                    reviews.append(review_entry)
                    generated += 1
                
                if generated % 100 == 0:
                    print(f"\n{'='*70}")
                    print(f"Progress: {generated}/{total_needed} reviews generated")
                    print(f"{'='*70}")
                    
                    if len(reviews) >= 2:
                        print(f"\nSAMPLE: Review #{generated-1}")
                        print(f"Hotel: {reviews[-1]['hotel_name']} ({reviews[-1]['traveller_type']})")
                        print(f"Text: {reviews[-1]['review_text'][:200]}...")
                        
                        print(f"\nSAMPLE: Review #{generated-2}")
                        print(f"Hotel: {reviews[-2]['hotel_name']} ({reviews[-2]['traveller_type']})")
                        print(f"Text: {reviews[-2]['review_text'][:200]}...")
                    
                    word_counts = [len(r['review_text'].split()) for r in reviews]
                    print(f"\nQuality Check:")
                    print(f"  Avg word count: {sum(word_counts)/len(word_counts):.1f}")
                    print(f"  Unique hotels: {len(set(r['hotel_name'] for r in reviews))}")
                    print(f"{'='*70}\n")
                
                time.sleep(delay)
                
            except Exception as e:
                print(f"Error at review {generated}: {e}")
                continue
    
    print(f"\nCompleted: {len(reviews)}/{total_needed} reviews generated successfully")
    return reviews

generated_reviews = generate_reviews_batch(hotel_data, target_count=100, delay=0.3)
print(f"\nTotal reviews: {len(generated_reviews)}")
print(f"\nSample review:\n{generated_reviews[0]['review_text']}")

Generating 100 reviews from 100 hotel-traveller combinations
Creating 1 reviews per combination


Progress: 100/100 reviews generated

SAMPLE: Review #99
Hotel: The Savannah House (Solo)
Text: I recently stayed at The Savannah House in Lagos during my solo trip, and I have to say, it was a wonderful experience! The location is fantastic—close to local attractions, making it easy for me to e...

SAMPLE: Review #98
Hotel: The Savannah House (Family)
Text: We just returned from a fantastic family trip to The Savannah House in Lagos, and I can’t rave enough about it! The cleanliness of the hotel was impressive—everything felt spotless, which is so import...

Quality Check:
  Avg word count: 119.3
  Unique hotels: 25


Completed: 100/100 reviews generated successfully

Total reviews: 100

Sample review:
I recently stayed at Aztec Heights during a business trip to Mexico City, and I have to say, I was quite impressed. The location is fantastic—right in the heart of the city and close to ever

#### Quality Check: Review Diversity

In [25]:
print(f"Total reviews generated: {len(generated_reviews)}\n")

test_hotel = generated_reviews[0]['hotel_name']
test_type = generated_reviews[0]['traveller_type']

same_combo = [r for r in generated_reviews 
              if r['hotel_name'] == test_hotel and r['traveller_type'] == test_type]

print(f"{'='*70}")
print(f"DIVERSITY TEST: {test_hotel} ({test_type})")
print(f"Found {len(same_combo)} reviews for this combination")
print(f"{'='*70}\n")

for i, r in enumerate(same_combo[:4], 1):
    print(f"Review {i}:")
    print(r['review_text'])
    print(f"\n{'-'*70}\n")

word_counts = [len(r['review_text'].split()) for r in generated_reviews]
unique_starts = set(r['review_text'][:30] for r in generated_reviews)

print(f"\n{'='*70}")
print("OVERALL QUALITY METRICS")
print(f"{'='*70}")
print(f"Total reviews: {len(generated_reviews)}")
print(f"Unique hotels: {len(set(r['hotel_name'] for r in generated_reviews))}")
print(f"Unique starting phrases: {len(unique_starts)}/{len(generated_reviews)}")
print(f"\nWord count:")
print(f"  Min: {min(word_counts)}")
print(f"  Avg: {sum(word_counts)/len(word_counts):.1f}")
print(f"  Max: {max(word_counts)}")
print(f"\nTraveller type distribution:")
for ttype in set(r['traveller_type'] for r in generated_reviews):
    count = sum(1 for r in generated_reviews if r['traveller_type'] == ttype)
    print(f"  {ttype}: {count}")
print(f"{'='*70}")

Total reviews generated: 100

DIVERSITY TEST: Aztec Heights (Business)
Found 1 reviews for this combination

Review 1:
I recently stayed at Aztec Heights during a business trip to Mexico City, and I have to say, I was quite impressed. The location is fantastic—right in the heart of the city and close to everything I needed for my meetings. The staff were really nice too; they went out of their way to help me with some last-minute arrangements. The room itself was comfortable, which was a relief after long days of work. I also appreciated the facilities; they had everything I needed to stay productive. Overall, it was a great experience, though I felt the value for money could be a touch better. But all in all, I’d definitely recommend it for anyone traveling on business!

----------------------------------------------------------------------


OVERALL QUALITY METRICS
Total reviews: 100
Unique hotels: 25
Unique starting phrases: 72/100

Word count:
  Min: 98
  Avg: 119.3
  Max: 148

Tra

#### Step 4: Save Reviews

In [26]:
reviews_file = 'synthetic_reviews.json'

with open(reviews_file, 'w', encoding='utf-8') as f:
    json.dump(generated_reviews, f, indent=2, ensure_ascii=False)

print(f"Saved {len(generated_reviews)} reviews to {reviews_file}")
print(f"File size: {os.path.getsize(reviews_file) / 1024:.1f} KB")

Saved 100 reviews to synthetic_reviews.json
File size: 104.4 KB


#### Step 5: Verify Review Quality & Diversity

In [27]:
from collections import Counter

word_counts = [len(r['review_text'].split()) for r in generated_reviews]
traveller_types = [r['traveller_type'] for r in generated_reviews]
unique_hotels = len(set(r['hotel_name'] for r in generated_reviews))
unique_cities = len(set(r['city'] for r in generated_reviews))

print("DIVERSITY ANALYSIS")
print("=" * 60)
print(f"Total reviews: {len(generated_reviews)}")
print(f"Unique hotels: {unique_hotels}")
print(f"Unique cities: {unique_cities}")
print(f"\nWord count: min={min(word_counts)}, max={max(word_counts)}, avg={sum(word_counts)/len(word_counts):.1f}")
print(f"\nTraveller type distribution:")
for ttype, count in Counter(traveller_types).most_common():
    print(f"  {ttype}: {count} ({count/len(generated_reviews)*100:.1f}%)")

print(f"\n{'=' * 60}")
print("SAMPLE REVIEWS")
print("=" * 60)

sample_types = list(set(traveller_types))[:3]
for ttype in sample_types:
    sample = next((r for r in generated_reviews if r['traveller_type'] == ttype), None)
    if sample:
        print(f"\n[{ttype.upper()}] {sample['hotel_name']}, {sample['city']}")
        print(f"{sample['review_text']}")
        print("-" * 60)

DIVERSITY ANALYSIS
Total reviews: 100
Unique hotels: 25
Unique cities: 25

Word count: min=98, max=148, avg=119.3

Traveller type distribution:
  Business: 25 (25.0%)
  Couple: 25 (25.0%)
  Family: 25 (25.0%)
  Solo: 25 (25.0%)

SAMPLE REVIEWS

[FAMILY] Aztec Heights, Mexico City
We just returned from a family trip to Mexico City, and staying at Aztec Heights was a real highlight. The facilities were fantastic, offering something for everyone—from a lovely pool where the kids could splash around to a cozy lounge area where we could unwind after a long day of exploring. We also appreciated how clean everything was; it felt spotless and well-maintained, which is always a plus when traveling with little ones. The staff was very good, always friendly and ready to help us with our questions about the area. The location was decent, making it easy to access popular spots without too much hassle. Overall, it was a great experience, and we’d definitely consider coming back!
--------------------

#### Step 6: Create Embeddings with TWO Models

In [32]:
import numpy as np
%pip install tf-keras
from sentence_transformers import SentenceTransformer

embedder_minilm = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embedder_mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

review_texts = [r['review_text'] for r in generated_reviews]

print("Creating embeddings with Model 1: all-MiniLM-L6-v2...")
embeddings_minilm = embedder_minilm.encode(review_texts, convert_to_numpy=True, show_progress_bar=True)

print("\nCreating embeddings with Model 2: all-mpnet-base-v2...")
embeddings_mpnet = embedder_mpnet.encode(review_texts, convert_to_numpy=True, show_progress_bar=True)

print(f"\nModel 1 (MiniLM): shape={embeddings_minilm.shape}, dim={embeddings_minilm.shape[1]}")
print(f"Model 2 (MPNet): shape={embeddings_mpnet.shape}, dim={embeddings_mpnet.shape[1]}")


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting tf-keras
  Downloading tf_keras-2.20.1-py3-none-any.whl.metadata (1.8 kB)
Downloading tf_keras-2.20.1-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 1.9 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.4 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.20.1
Note: you may need to restart the kernel to use updated packages.



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating embeddings with Model 1: all-MiniLM-L6-v2...


Batches:   0%|          | 0/4 [00:00<?, ?it/s]


Creating embeddings with Model 2: all-mpnet-base-v2...


Batches:   0%|          | 0/4 [00:00<?, ?it/s]


Model 1 (MiniLM): shape=(100, 384), dim=384
Model 2 (MPNet): shape=(100, 768), dim=768


#### Step 7: Store in Neo4j Vector Index

In [33]:
for i, review in enumerate(generated_reviews):
    review['review_id'] = f"review_{i}"
    review['embedding_minilm'] = embeddings_minilm[i].tolist()
    review['embedding_mpnet'] = embeddings_mpnet[i].tolist()

print(f"Added embeddings to {len(generated_reviews)} review objects")
print(f"Sample review keys: {list(generated_reviews[0].keys())}")

Added embeddings to 100 review objects
Sample review keys: ['hotel_name', 'city', 'country', 'star_rating', 'traveller_type', 'review_text', 'metadata', 'review_id', 'embedding_minilm', 'embedding_mpnet']


In [34]:
create_nodes_query = """
UNWIND $reviews AS review
CREATE (sr:SyntheticReview {
    review_id: review.review_id,
    hotel_name: review.hotel_name,
    city: review.city,
    country: review.country,
    star_rating: review.star_rating,
    traveller_type: review.traveller_type,
    review_text: review.review_text,
    embedding_minilm: review.embedding_minilm,
    embedding_mpnet: review.embedding_mpnet
})
"""

print("Creating SyntheticReview nodes in Neo4j...")
conn_rag.execute_query(create_nodes_query, {'reviews': generated_reviews})

count_query = "MATCH (sr:SyntheticReview) RETURN count(sr) as total"
result = conn_rag.execute_query(count_query)
print(f"Created {result[0]['total']} SyntheticReview nodes")

Creating SyntheticReview nodes in Neo4j...
Created 100 SyntheticReview nodes


#### Step 8: Create Vector Indices

In [35]:
index_minilm_query = """
CREATE VECTOR INDEX review_minilm_index IF NOT EXISTS
FOR (sr:SyntheticReview)
ON sr.embedding_minilm
OPTIONS {indexConfig: {
 `vector.dimensions`: 384,
 `vector.similarity_function`: 'cosine'
}}
"""

index_mpnet_query = """
CREATE VECTOR INDEX review_mpnet_index IF NOT EXISTS
FOR (sr:SyntheticReview)
ON sr.embedding_mpnet
OPTIONS {indexConfig: {
 `vector.dimensions`: 768,
 `vector.similarity_function`: 'cosine'
}}
"""

print("Creating vector index for Model 1 (MiniLM - 384 dim)...")
conn_rag.execute_query(index_minilm_query)

print("Creating vector index for Model 2 (MPNet - 768 dim)...")
conn_rag.execute_query(index_mpnet_query)

print("\nVector indices created successfully!")

Creating vector index for Model 1 (MiniLM - 384 dim)...
Creating vector index for Model 2 (MPNet - 768 dim)...

Vector indices created successfully!


#### Step 9: Semantic Search Functions (Both Models)

In [36]:
def semantic_search_minilm(query: str, top_k: int = 5):
    query_embedding = embedder_minilm.encode([query], convert_to_numpy=True)[0].tolist()
    
    search_query = """
    CALL db.index.vector.queryNodes('review_minilm_index', $top_k, $query_embedding)
    YIELD node, score
    RETURN node.review_id AS review_id,
           node.hotel_name AS hotel_name,
           node.city AS city,
           node.country AS country,
           node.traveller_type AS traveller_type,
           node.review_text AS review_text,
           score
    """
    
    results = conn_rag.execute_query(search_query, {
        'query_embedding': query_embedding,
        'top_k': top_k
    })
    return results

def semantic_search_mpnet(query: str, top_k: int = 5):
    query_embedding = embedder_mpnet.encode([query], convert_to_numpy=True)[0].tolist()
    
    search_query = """
    CALL db.index.vector.queryNodes('review_mpnet_index', $top_k, $query_embedding)
    YIELD node, score
    RETURN node.review_id AS review_id,
           node.hotel_name AS hotel_name,
           node.city AS city,
           node.country AS country,
           node.traveller_type AS traveller_type,
           node.review_text AS review_text,
           score
    """
    
    results = conn_rag.execute_query(search_query, {
        'query_embedding': query_embedding,
        'top_k': top_k
    })
    return results

print("Semantic search functions defined for both models")

Semantic search functions defined for both models


#### Step 10: Compare Both Embedding Models

In [37]:
import time

test_queries = [
    "Looking for a clean hotel with great staff for my family",
    "Romantic hotel with amazing views",
    "Budget-friendly hotel with good value for money",
]

print("EMBEDDING MODEL COMPARISON")
print("=" * 80)
print("Model 1: all-MiniLM-L6-v2 (384 dimensions)")
print("Model 2: all-mpnet-base-v2 (768 dimensions)")
print("=" * 80)

comparison_results = []

for test_query in test_queries:
    print(f"\n\nQuery: \"{test_query}\"")
    print("-" * 80)
    
    start_time = time.time()
    results_minilm = semantic_search_minilm(test_query, top_k=3)
    time_minilm = time.time() - start_time
    
    start_time = time.time()
    results_mpnet = semantic_search_mpnet(test_query, top_k=3)
    time_mpnet = time.time() - start_time
    
    print(f"\nModel 1 (MiniLM): {len(results_minilm)} results in {time_minilm:.3f}s")
    if results_minilm:
        top = results_minilm[0]
        print(f"  Top match: {top['hotel_name']} ({top['city']}) - Score: {top['score']:.4f}")
        print(f"  Review: {top['review_text'][:100]}...")
    
    print(f"\nModel 2 (MPNet): {len(results_mpnet)} results in {time_mpnet:.3f}s")
    if results_mpnet:
        top = results_mpnet[0]
        print(f"  Top match: {top['hotel_name']} ({top['city']}) - Score: {top['score']:.4f}")
        print(f"  Review: {top['review_text'][:100]}...")
    
    comparison_results.append({
        'query': test_query,
        'minilm_time': time_minilm,
        'mpnet_time': time_mpnet,
        'minilm_top_score': results_minilm[0]['score'] if results_minilm else 0,
        'mpnet_top_score': results_mpnet[0]['score'] if results_mpnet else 0
    })

print(f"\n\n{'=' * 80}")
print("SUMMARY")
print("=" * 80)
avg_time_minilm = sum(r['minilm_time'] for r in comparison_results) / len(comparison_results)
avg_time_mpnet = sum(r['mpnet_time'] for r in comparison_results) / len(comparison_results)
avg_score_minilm = sum(r['minilm_top_score'] for r in comparison_results) / len(comparison_results)
avg_score_mpnet = sum(r['mpnet_top_score'] for r in comparison_results) / len(comparison_results)

print(f"Average search time - MiniLM: {avg_time_minilm:.3f}s | MPNet: {avg_time_mpnet:.3f}s")
print(f"Average top score - MiniLM: {avg_score_minilm:.4f} | MPNet: {avg_score_mpnet:.4f}")
print(f"\nSpeed winner: {'MiniLM' if avg_time_minilm < avg_time_mpnet else 'MPNet'}")
print(f"Quality winner: {'MiniLM' if avg_score_minilm > avg_score_mpnet else 'MPNet'} (higher score = better match)")

EMBEDDING MODEL COMPARISON
Model 1: all-MiniLM-L6-v2 (384 dimensions)
Model 2: all-mpnet-base-v2 (768 dimensions)


Query: "Looking for a clean hotel with great staff for my family"
--------------------------------------------------------------------------------

Model 1 (MiniLM): 3 results in 0.228s
  Top match: The Golden Oasis (Dubai) - Score: 0.7637
  Review: I recently stayed at The Golden Oasis during a business trip to Dubai, and I was quite impressed. Th...

Model 2 (MPNet): 3 results in 0.069s
  Top match: The Kiwi Grand (Wellington) - Score: 0.7875
  Review: We just wrapped up an incredible stay at The Kiwi Grand in Wellington, and I can’t recommend it enou...


Query: "Romantic hotel with amazing views"
--------------------------------------------------------------------------------

Model 1 (MiniLM): 3 results in 0.032s
  Top match: Nile Grandeur (Cairo) - Score: 0.7862
  Review: We recently stayed at the Nile Grandeur in Cairo, and it was such a delightful experience! The 