# Embeddings Module - Graph-RAG Experiment 2

This notebook demonstrates and executes the embeddings-based retrieval approach for the Airline Flight Insights Graph-RAG system.

**Features:**
- Two embedding models: `all-MiniLM-L6-v2` and `all-mpnet-base-v2`
- Feature Vector Embeddings for Journey and Flight nodes
- Neo4j vector index integration
- Semantic similarity search
- Model comparison and evaluation

In [1]:
# Import required libraries
from neo4j import GraphDatabase
from dotenv import load_dotenv, find_dotenv
import os
import sys

# Import our embeddings module
from embeddings import (
    EMBEDDING_MODELS,
    get_model,
    create_journey_text,
    create_flight_text,
    generate_embeddings,
    generate_single_embedding,
    fetch_journey_nodes,
    fetch_flight_nodes,
    create_vector_index,
    store_journey_embeddings,
    store_flight_embeddings,
    semantic_search_journeys,
    semantic_search_flights,
    get_embedding_context,
    generate_and_store_all_embeddings,
    compare_models
)

print("Imports successful!")
print(f"Available models: {list(EMBEDDING_MODELS.keys())}")

  from .autonotebook import tqdm as notebook_tqdm


Imports successful!
Available models: ['minilm', 'mpnet']


In [2]:
# Load environment variables
load_dotenv(find_dotenv())

NEO4J_URI = os.getenv('NEO4J_URI') or os.getenv('URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME') or os.getenv('USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD') or os.getenv('PASSWORD')

print(f"Neo4j URI: {NEO4J_URI}")
print(f"Username: {NEO4J_USERNAME}")

Neo4j URI: neo4j+s://d9ac65c9.databases.neo4j.io
Username: neo4j


In [3]:
# Connect to Neo4j
driver = GraphDatabase.driver(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD)
)

driver.verify_connectivity()
print("Successfully connected to Neo4j!")

Successfully connected to Neo4j!


## 1. Text Representation Examples

Before generating embeddings, we convert node properties into human-readable text.

In [4]:
# Example: Journey text representations
sample_journeys = [
    {
        "passenger_class": "Business",
        "food_satisfaction_score": 2,
        "arrival_delay_minutes": 45,
        "actual_flown_miles": 2500,
        "number_of_legs": 2
    },
    {
        "passenger_class": "Economy",
        "food_satisfaction_score": 5,
        "arrival_delay_minutes": -10,
        "actual_flown_miles": 500,
        "number_of_legs": 1
    }
]

print("Journey Text Representations:")
print("=" * 60)
for i, journey in enumerate(sample_journeys, 1):
    text = create_journey_text(journey)
    print(f"\nJourney {i}:")
    print(f"  Properties: {journey}")
    print(f"  Text: {text}")

Journey Text Representations:

Journey 1:
  Properties: {'passenger_class': 'Business', 'food_satisfaction_score': 2, 'arrival_delay_minutes': 45, 'actual_flown_miles': 2500, 'number_of_legs': 2}
  Text: A Business class journey covering 2500 miles over 2 flight segments. The flight delayed by 45 minutes and had poor food satisfaction (score: 2/5).

Journey 2:
  Properties: {'passenger_class': 'Economy', 'food_satisfaction_score': 5, 'arrival_delay_minutes': -10, 'actual_flown_miles': 500, 'number_of_legs': 1}
  Text: A Economy class journey covering 500 miles over 1 flight segment. The flight arrived 10 minutes early and had excellent food satisfaction (score: 5/5).


In [5]:
# Example: Flight text representations
sample_flight = {
    "flight_number": "UA123",
    "fleet_type_description": "B787-9"
}

print("Flight Text Representation:")
print("=" * 60)
text = create_flight_text(sample_flight, origin="ORD", destination="LAX")
print(f"Properties: {sample_flight}")
print(f"Text: {text}")

Flight Text Representation:
Properties: {'flight_number': 'UA123', 'fleet_type_description': 'B787-9'}
Text: Flight UA123 operated by B787-9 aircraft operating the route from ORD to LAX.


## 2. Load Embedding Models

We use two models as required:
- **all-MiniLM-L6-v2**: 384 dimensions, faster
- **all-mpnet-base-v2**: 768 dimensions, higher quality

In [6]:
# Load both models
print("Loading embedding models...\n")

model_minilm = get_model("minilm")
print(f"MiniLM dimensions: {model_minilm.get_sentence_embedding_dimension()}")

model_mpnet = get_model("mpnet")
print(f"MPNet dimensions: {model_mpnet.get_sentence_embedding_dimension()}")

Loading embedding models...

Loading model: all-MiniLM-L6-v2...
Model all-MiniLM-L6-v2 loaded successfully!
MiniLM dimensions: 384
Loading model: all-mpnet-base-v2...
Model all-mpnet-base-v2 loaded successfully!
MPNet dimensions: 768


In [7]:
# Test embedding generation
test_texts = [
    "A Business class journey with excellent food satisfaction",
    "An Economy flight that was delayed by 2 hours",
    "A short domestic flight with poor service"
]

print("Testing embedding generation...\n")

# MiniLM
embeddings_minilm = generate_embeddings(test_texts, "minilm")
print(f"MiniLM embeddings shape: {embeddings_minilm.shape}")

# MPNet
embeddings_mpnet = generate_embeddings(test_texts, "mpnet")
print(f"MPNet embeddings shape: {embeddings_mpnet.shape}")

Testing embedding generation...



Batches: 100%|██████████| 1/1 [00:00<00:00,  6.37it/s]


MiniLM embeddings shape: (3, 384)


Batches: 100%|██████████| 1/1 [00:00<00:00,  6.78it/s]

MPNet embeddings shape: (3, 768)





## 3. Generate and Store Embeddings

This section generates embeddings for all Journey and Flight nodes and stores them in Neo4j.

**⚠️ Note:** This process may take several minutes depending on the number of nodes.

In [8]:
# First, let's check how many nodes we have
journeys = fetch_journey_nodes(driver)
flights = fetch_flight_nodes(driver)

print(f"Journey nodes: {len(journeys)}")
print(f"Flight nodes: {len(flights)}")
print(f"\nTotal nodes to embed: {len(journeys) + len(flights)}")

Journey nodes: 3000
Flight nodes: 1838

Total nodes to embed: 4838


In [9]:
# Generate and store embeddings with MiniLM model
# This creates vector indexes and stores embeddings in Neo4j

generate_and_store_all_embeddings(driver, "minilm")


Generating embeddings with all-MiniLM-L6-v2

Fetching Journey nodes...
Found 3000 Journey nodes
Creating text representations...
Generating Journey embeddings...


Batches: 100%|██████████| 94/94 [00:13<00:00,  7.15it/s]


Creating vector index for Journeys...
Created vector index: journey_embedding_minilm (384 dimensions)
Storing Journey embeddings...
Stored 100/3000 Journey embeddings...
Stored 200/3000 Journey embeddings...
Stored 300/3000 Journey embeddings...
Stored 400/3000 Journey embeddings...
Stored 500/3000 Journey embeddings...
Stored 600/3000 Journey embeddings...
Stored 700/3000 Journey embeddings...
Stored 800/3000 Journey embeddings...
Stored 900/3000 Journey embeddings...
Stored 1000/3000 Journey embeddings...
Stored 1100/3000 Journey embeddings...
Stored 1200/3000 Journey embeddings...
Stored 1300/3000 Journey embeddings...
Stored 1400/3000 Journey embeddings...
Stored 1500/3000 Journey embeddings...
Stored 1600/3000 Journey embeddings...
Stored 1700/3000 Journey embeddings...
Stored 1800/3000 Journey embeddings...
Stored 1900/3000 Journey embeddings...
Stored 2000/3000 Journey embeddings...
Stored 2100/3000 Journey embeddings...
Stored 2200/3000 Journey embeddings...
Stored 2300/3000 Jo

Batches: 100%|██████████| 58/58 [00:06<00:00,  9.10it/s]


Creating vector index for Flights...
Created vector index: flight_embedding_minilm (384 dimensions)
Storing Flight embeddings...
Stored 100/1838 Flight embeddings...
Stored 200/1838 Flight embeddings...
Stored 300/1838 Flight embeddings...
Stored 400/1838 Flight embeddings...
Stored 500/1838 Flight embeddings...
Stored 600/1838 Flight embeddings...
Stored 700/1838 Flight embeddings...
Stored 800/1838 Flight embeddings...
Stored 900/1838 Flight embeddings...
Stored 1000/1838 Flight embeddings...
Stored 1100/1838 Flight embeddings...
Stored 1200/1838 Flight embeddings...
Stored 1300/1838 Flight embeddings...
Stored 1400/1838 Flight embeddings...
Stored 1500/1838 Flight embeddings...
Stored 1600/1838 Flight embeddings...
Stored 1700/1838 Flight embeddings...
Stored 1800/1838 Flight embeddings...
Stored 1838/1838 Flight embeddings...
Successfully stored 1838 Flight embeddings with minilm model

Embedding generation complete for minilm!



In [10]:
# Generate and store embeddings with MPNet model

generate_and_store_all_embeddings(driver, "mpnet")


Generating embeddings with all-mpnet-base-v2

Fetching Journey nodes...
Found 3000 Journey nodes
Creating text representations...
Generating Journey embeddings...


Batches: 100%|██████████| 94/94 [01:29<00:00,  1.05it/s]


Creating vector index for Journeys...
Created vector index: journey_embedding_mpnet (768 dimensions)
Storing Journey embeddings...
Stored 100/3000 Journey embeddings...
Stored 200/3000 Journey embeddings...
Stored 300/3000 Journey embeddings...
Stored 400/3000 Journey embeddings...
Stored 500/3000 Journey embeddings...
Stored 600/3000 Journey embeddings...
Stored 700/3000 Journey embeddings...
Stored 800/3000 Journey embeddings...
Stored 900/3000 Journey embeddings...
Stored 1000/3000 Journey embeddings...
Stored 1100/3000 Journey embeddings...
Stored 1200/3000 Journey embeddings...
Stored 1300/3000 Journey embeddings...
Stored 1400/3000 Journey embeddings...
Stored 1500/3000 Journey embeddings...
Stored 1600/3000 Journey embeddings...
Stored 1700/3000 Journey embeddings...
Stored 1800/3000 Journey embeddings...
Stored 1900/3000 Journey embeddings...
Stored 2000/3000 Journey embeddings...
Stored 2100/3000 Journey embeddings...
Stored 2200/3000 Journey embeddings...
Stored 2300/3000 Jou

Batches: 100%|██████████| 58/58 [00:46<00:00,  1.25it/s]


Creating vector index for Flights...
Created vector index: flight_embedding_mpnet (768 dimensions)
Storing Flight embeddings...
Stored 100/1838 Flight embeddings...
Stored 200/1838 Flight embeddings...
Stored 300/1838 Flight embeddings...
Stored 400/1838 Flight embeddings...
Stored 500/1838 Flight embeddings...
Stored 600/1838 Flight embeddings...
Stored 700/1838 Flight embeddings...
Stored 800/1838 Flight embeddings...
Stored 900/1838 Flight embeddings...
Stored 1000/1838 Flight embeddings...
Stored 1100/1838 Flight embeddings...
Stored 1200/1838 Flight embeddings...
Stored 1300/1838 Flight embeddings...
Stored 1400/1838 Flight embeddings...
Stored 1500/1838 Flight embeddings...
Stored 1600/1838 Flight embeddings...
Stored 1700/1838 Flight embeddings...
Stored 1800/1838 Flight embeddings...
Stored 1838/1838 Flight embeddings...
Successfully stored 1838 Flight embeddings with mpnet model

Embedding generation complete for mpnet!



## 4. Semantic Search Examples

Now we can search for similar nodes using natural language queries.

In [11]:
# Example: Search for journeys with delays
query = "Business class flights with long delays and poor food quality"

print(f"Query: '{query}'\n")
print("=" * 60)
print("MiniLM Results:")
print("=" * 60)

results = semantic_search_journeys(driver, query, "minilm", top_k=5)
for r in results:
    print(f"\n  Score: {r['similarity_score']:.3f}")
    print(f"  Class: {r['passenger_class']}, Food Score: {r['food_satisfaction_score']}")
    print(f"  Delay: {r['arrival_delay_minutes']} min, Miles: {r['actual_flown_miles']}")

Query: 'Business class flights with long delays and poor food quality'

MiniLM Results:

  Score: 0.888
  Class: Economy, Food Score: 1
  Delay: 177 min, Miles: 1400

  Score: 0.886
  Class: Economy, Food Score: 1
  Delay: 228 min, Miles: 758

  Score: 0.886
  Class: Economy, Food Score: 1
  Delay: 134 min, Miles: 2562

  Score: 0.885
  Class: Economy, Food Score: 1
  Delay: 38 min, Miles: 733

  Score: 0.885
  Class: Economy, Food Score: 2
  Delay: 104 min, Miles: 1400


In [12]:
# Example: Search for economy flights with good satisfaction
query = "Economy class short trips with excellent food and on-time arrival"

print(f"Query: '{query}'\n")
print("=" * 60)
print("MPNet Results:")
print("=" * 60)

results = semantic_search_journeys(driver, query, "mpnet", top_k=5)
for r in results:
    print(f"\n  Score: {r['similarity_score']:.3f}")
    print(f"  Class: {r['passenger_class']}, Food Score: {r['food_satisfaction_score']}")
    print(f"  Delay: {r['arrival_delay_minutes']} min, Miles: {r['actual_flown_miles']}")

Query: 'Economy class short trips with excellent food and on-time arrival'

MPNet Results:

  Score: 0.800
  Class: Economy, Food Score: 5
  Delay: -12 min, Miles: 2007

  Score: 0.798
  Class: Economy, Food Score: 3
  Delay: -31 min, Miles: 160

  Score: 0.796
  Class: Economy, Food Score: 3
  Delay: -8 min, Miles: 200

  Score: 0.796
  Class: Economy, Food Score: 3
  Delay: -15 min, Miles: 160

  Score: 0.795
  Class: Economy, Food Score: 3
  Delay: -12 min, Miles: 605


In [13]:
# Example: Search for flights
query = "Boeing 787 flights from Chicago"

print(f"Query: '{query}'\n")
print("=" * 60)

results = semantic_search_flights(driver, query, "minilm", top_k=5)
for r in results:
    print(f"\n  Score: {r['similarity_score']:.3f}")
    print(f"  Flight: {r['flight_number']}, Aircraft: {r['fleet_type_description']}")
    print(f"  Route: {r['origin']} -> {r['destination']}")

Query: 'Boeing 787 flights from Chicago'


  Score: 0.834
  Flight: 781, Aircraft: B737-900
  Route: IAX -> LAX

  Score: 0.828
  Flight: 784, Aircraft: B737-MAX9
  Route: LAX -> IAX

  Score: 0.821
  Flight: 787, Aircraft: B737-900
  Route: DEX -> SJX

  Score: 0.815
  Flight: 789, Aircraft: B777-200
  Route: EWX -> SFX

  Score: 0.813
  Flight: 787, Aircraft: A319-100
  Route: DEX -> SBX


## 5. Model Comparison

Compare the results between MiniLM and MPNet models.

In [14]:
# Compare models on the same query
test_query = "Delayed flights with unhappy passengers in business class"

print(f"Comparing models for: '{test_query}'\n")
print("=" * 80)

comparison = compare_models(driver, test_query, top_k=3)

for model_name, results in comparison.items():
    print(f"\n{model_name}:")
    print("-" * 40)
    
    if "error" in results:
        print(f"  Error: {results['error']}")
        continue
    
    print("  Top Journey matches:")
    for r in results.get("journeys", [])[:3]:
        print(f"    - Score: {r['similarity_score']:.3f}, "
              f"Class: {r['passenger_class']}, "
              f"Food: {r['food_satisfaction_score']}, "
              f"Delay: {r['arrival_delay_minutes']}min")

Comparing models for: 'Delayed flights with unhappy passengers in business class'


all-MiniLM-L6-v2:
----------------------------------------
  Top Journey matches:
    - Score: 0.811, Class: Economy, Food: 1, Delay: 177min
    - Score: 0.810, Class: Economy, Food: 2, Delay: 104min
    - Score: 0.809, Class: Economy, Food: 1, Delay: 228min

all-mpnet-base-v2:
----------------------------------------
  Top Journey matches:
    - Score: 0.761, Class: Economy, Food: 1, Delay: 10min
    - Score: 0.758, Class: Economy, Food: 1, Delay: 85min
    - Score: 0.758, Class: Economy, Food: 1, Delay: 97min


## 6. Get Context for RAG

The `get_embedding_context` function returns formatted text suitable for LLM context.

In [15]:
# Get formatted context for a query
query = "Find journeys with poor food satisfaction on long flights"

print(f"Query: '{query}'\n")
print("=" * 80)
print("EMBEDDING CONTEXT (MiniLM):")
print("=" * 80)

context = get_embedding_context(driver, query, "minilm", top_k=3)
print(context)

Query: 'Find journeys with poor food satisfaction on long flights'

EMBEDDING CONTEXT (MiniLM):
Found 3 semantically similar Journey records:
  1. (similarity: 0.806) A Economy class journey covering 157 miles over 1 flight segment. The flight delayed by 58 minutes and had very poor food satisfaction (score: 1/5).
  2. (similarity: 0.805) A Economy class journey covering 1400 miles over 1 flight segment. The flight delayed by 104 minutes and had poor food satisfaction (score: 2/5).
  3. (similarity: 0.803) A Economy class journey covering 199 miles over 1 flight segment. The flight delayed by 162 minutes and had very poor food satisfaction (score: 1/5).

Found 3 semantically similar Flight records:
  1. (similarity: 0.669) Flight 1500 operated by B787-10 aircraft operating the route from EWX to LAX.
  2. (similarity: 0.665) Flight 1232 operated by B757-300 aircraft operating the route from HNX to LAX.
  3. (similarity: 0.663) Flight 374 operated by B737-900 aircraft operating the route

In [16]:
# Compare with MPNet context
print("=" * 80)
print("EMBEDDING CONTEXT (MPNet):")
print("=" * 80)

context_mpnet = get_embedding_context(driver, query, "mpnet", top_k=3)
print(context_mpnet)

EMBEDDING CONTEXT (MPNet):
Found 3 semantically similar Journey records:
  1. (similarity: 0.854) A Economy class journey covering 1400 miles over 1 flight segment. The flight arrived 20 minutes early and had very poor food satisfaction (score: 1/5).
  2. (similarity: 0.853) A Economy class journey covering 2200 miles over 1 flight segment. The flight arrived 16 minutes early and had very poor food satisfaction (score: 1/5).
  3. (similarity: 0.853) A Economy class journey covering 1437 miles over 1 flight segment. The flight arrived 12 minutes early and had very poor food satisfaction (score: 1/5).

Found 3 semantically similar Flight records:
  1. (similarity: 0.644) Flight 1544 operated by B737-MAX9 aircraft operating the route from SEX to DEX.
  2. (similarity: 0.640) Flight 2255 operated by B737-900 aircraft operating the route from DEX to TUX.
  3. (similarity: 0.640) Flight 2477 operated by B737-900 aircraft operating the route from SAX to DEX.


## 7. Cleanup

Close the Neo4j driver when done.

In [17]:
# Uncomment to close the driver
# driver.close()
# print("Driver closed.")