# Sprint 2: RAG Development - Semantic Search Tool

This notebook prototypes the RAG (Retrieval-Augmented Generation) tool for the Airbnb Market Intelligence Agent.

## Goals:
1. Load and prepare Airbnb data for semantic search
2. Create text embeddings using OpenAI's embedding models
3. Build a vector store with Chroma
4. Implement RetrievalQA chain for semantic search
5. Test with qualitative questions about property features

## Features:
- Text document creation from property descriptions
- Chunking with RecursiveCharacterTextSplitter
- OpenAI embeddings for semantic understanding
- Local Chroma vector store
- RetrievalQA chain for intelligent search


In [1]:
# Import required libraries
import pandas as pd
import os
from dotenv import load_dotenv

# LangChain imports for RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.schema import Document

# Load environment variables
load_dotenv()

print("✅ All libraries imported successfully")
print(f"OpenAI API Key loaded: {'Yes' if os.getenv('OPENAI_API_KEY') else 'No'}")


✅ All libraries imported successfully
OpenAI API Key loaded: Yes


In [2]:
# Load the processed Airbnb data
print("Loading Airbnb data...")

# Define path to processed data
data_path = os.path.join("..", "data", "processed", "airbnb_unified_data.csv")

# Load the DataFrame
df = pd.read_csv(data_path)

print(f"✅ Data loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display basic info about the dataset
print(f"\nDataset Overview:")
print(f"Total listings: {len(df):,}")
print(f"Cities: {df['city'].unique()}")
print(f"Sample of text columns:")
print(f"- Name: {df['name'].iloc[0][:100]}...")
print(f"- Description: {df['description'].iloc[0][:100]}...")
print(f"- Neighborhood: {df['neighborhood_overview'].iloc[0][:100]}...")


Loading Airbnb data...
✅ Data loaded successfully!
Shape: (49766, 81)
Columns: ['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_ava

  df = pd.read_csv(data_path)


In [3]:
# Prepare text documents for RAG
print("Creating text documents from property data...")

# Select and combine text columns
text_columns = ['name', 'description', 'neighborhood_overview']

# Create documents by combining text fields
documents = []
for idx, row in df.iterrows():
    # Combine text fields into a single document
    text_parts = []
    
    # Add property name
    if pd.notna(row['name']):
        text_parts.append(f"Property Name: {row['name']}")
    
    # Add description
    if pd.notna(row['description']):
        text_parts.append(f"Description: {row['description']}")
    
    # Add neighborhood overview
    if pd.notna(row['neighborhood_overview']):
        text_parts.append(f"Neighborhood: {row['neighborhood_overview']}")
    
    # Add metadata
    metadata = {
        'listing_id': row['id'],
        'city': row['city'],
        'price': row['price'],
        'property_type': row['property_type'],
        'room_type': row['room_type'],
        'bedrooms': row['bedrooms'],
        'beds': row['beds']
    }
    
    # Create document if we have text content
    if text_parts:
        combined_text = "\n\n".join(text_parts)
        doc = Document(
            page_content=combined_text,
            metadata=metadata
        )
        documents.append(doc)

print(f"✅ Created {len(documents):,} documents")
print(f"Sample document:")
print(f"Content: {documents[0].page_content[:200]}...")
print(f"Metadata: {documents[0].metadata}")


Creating text documents from property data...
✅ Created 49,766 documents
Sample document:
Content: Property Name: An Oasis in the City

Description: Very central to the city which can be reached by an easy walk or by bus, with transport at the door, if required, and all amenities within easy reach....
Metadata: {'listing_id': 11156, 'city': 'Sydney', 'price': 65.0, 'property_type': 'Private room in rental unit', 'room_type': 'Private room', 'bedrooms': 1.0, 'beds': 1.0}


In [4]:
# Split documents into chunks using RecursiveCharacterTextSplitter
print("Splitting documents into chunks...")

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Maximum size of each chunk
    chunk_overlap=200,      # Overlap between chunks for context
    length_function=len,    # Function to measure text length
    separators=["\n\n", "\n", " ", ""]  # Splitting separators in order of preference
)

# Split documents into chunks
text_chunks = text_splitter.split_documents(documents)

print(f"✅ Split {len(documents):,} documents into {len(text_chunks):,} chunks")
print(f"Average chunk size: {sum(len(chunk.page_content) for chunk in text_chunks) / len(text_chunks):.0f} characters")

# Display sample chunks
print(f"\nSample chunks:")
for i, chunk in enumerate(text_chunks[:3]):
    print(f"Chunk {i+1}:")
    print(f"Size: {len(chunk.page_content)} characters")
    print(f"Content: {chunk.page_content[:150]}...")
    print(f"Metadata: {chunk.metadata}")
    print()


Splitting documents into chunks...
✅ Split 49,766 documents into 63,437 chunks
Average chunk size: 546 characters

Sample chunks:
Chunk 1:
Size: 335 characters
Content: Property Name: An Oasis in the City

Description: Very central to the city which can be reached by an easy walk or by bus, with transport at the door,...
Metadata: {'listing_id': 11156, 'city': 'Sydney', 'price': 65.0, 'property_type': 'Private room in rental unit', 'room_type': 'Private room', 'bedrooms': 1.0, 'beds': 1.0}

Chunk 2:
Size: 586 characters
Content: Property Name: Unique Designer Rooftop Apartment in City Location

Description: You will be staying in a unique apartment on the top floor of a centra...
Metadata: {'listing_id': 15253, 'city': 'Sydney', 'price': 99.0, 'property_type': 'Private room in condo', 'room_type': 'Private room', 'bedrooms': 1.0, 'beds': 1.0}

Chunk 3:
Size: 999 characters
Content: Neighborhood: The location is really central and there is number of things to do and see all within a few

In [5]:
# Create OpenAI embeddings for text chunks
print("Creating embeddings using OpenAI...")

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # Cost-effective embedding model
    api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ OpenAI embeddings initialized")
print(f"Model: text-embedding-3-small")
print(f"Embedding dimension: 1536")

# Test embedding creation with a sample chunk
sample_chunk = text_chunks[0].page_content
sample_embedding = embeddings.embed_query(sample_chunk)

print(f"\nSample embedding created:")
print(f"Text: {sample_chunk[:100]}...")
print(f"Embedding dimension: {len(sample_embedding)}")
print(f"First 5 values: {sample_embedding[:5]}")


Creating embeddings using OpenAI...
✅ OpenAI embeddings initialized
Model: text-embedding-3-small
Embedding dimension: 1536

Sample embedding created:
Text: Property Name: An Oasis in the City

Description: Very central to the city which can be reached by a...
Embedding dimension: 1536
First 5 values: [-0.014609303325414658, -0.0228570569306612, 0.03488893806934357, 0.04079357907176018, -0.00550338439643383]


In [6]:
# Initialize Chroma vector store and add embeddings
print("Creating Chroma vector store...")

# Create vector store directory
vector_store_path = os.path.join("..", "data", "vector_store")

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents=text_chunks,
    embedding=embeddings,
    persist_directory=vector_store_path
)

print(f"✅ Chroma vector store created!")
print(f"Vector store path: {vector_store_path}")
print(f"Number of documents in vector store: {vector_store._collection.count()}")

# Test similarity search
print(f"\nTesting similarity search...")
test_query = "dedicated workspace"
similar_docs = vector_store.similarity_search(test_query, k=3)

print(f"Query: '{test_query}'")
print(f"Found {len(similar_docs)} similar documents:")
for i, doc in enumerate(similar_docs):
    print(f"\nResult {i+1}:")
    print(f"Score: Similarity match")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata: {doc.metadata}")


Creating Chroma vector store...
✅ Chroma vector store created!
Vector store path: ../data/vector_store
Number of documents in vector store: 63437

Testing similarity search...
Query: 'dedicated workspace'
Found 3 similar documents:

Result 1:
Score: Similarity match
Content: Property Name: Stylish Stay with Workspace

Description: Welcome to your comfortable and stylish home! This space features a dedicated workspace that caters to both work and relaxation. Nestled in a q...
Metadata: {'property_type': 'Private room in townhouse', 'room_type': 'Private room', 'listing_id': 1.2733935215287355e+18, 'beds': 1.0, 'bedrooms': 0.0, 'city': 'Melbourne', 'price': 60.0}

Result 2:
Score: Similarity match
Content: Property Name: Private Room wth facilities including office table...
Metadata: {'bedrooms': 2.0, 'beds': 2.0, 'property_type': 'Private room in guest suite', 'room_type': 'Private room', 'city': 'Melbourne', 'listing_id': '47750309'}

Result 3:
Score: Similarity match
Content: Property

In [7]:
# Create RetrievalQA chain
print("Building RetrievalQA chain...")

# Initialize ChatOpenAI for the language model
llm = ChatOpenAI(
    model="gpt-4o-mini",  # Cost-effective model for RAG
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Simple chain type for RAG
    retriever=vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}  # Retrieve top 4 most similar documents
    ),
    return_source_documents=True  # Return source documents for transparency
)

print("✅ RetrievalQA chain created!")
print(f"LLM: gpt-4o-mini")
print(f"Retriever: Similarity search with k=4")
print(f"Chain type: stuff")


Building RetrievalQA chain...
✅ RetrievalQA chain created!
LLM: gpt-4o-mini
Retriever: Similarity search with k=4
Chain type: stuff


In [8]:
# Test the RetrievalQA chain with qualitative questions
print("Testing RetrievalQA chain with semantic search queries...")

# Test questions
test_questions = [
    "find a place that mentions a 'dedicated workspace'",
    "show me properties with ocean views or beach access",
    "find listings that mention pet-friendly or allow pets",
    "show me properties with a pool or swimming facilities",
    "find places near public transportation or metro stations"
]

print("=" * 80)

for i, question in enumerate(test_questions, 1):
    print(f"\n🔍 Test {i}: {question}")
    print("-" * 60)
    
    try:
        # Query the RAG system
        result = qa_chain.invoke({"query": question})
        
        print(f"Answer: {result['result']}")
        
        print(f"\nSource documents used:")
        for j, doc in enumerate(result['source_documents'][:2]):  # Show top 2 sources
            print(f"  Source {j+1}:")
            print(f"    Listing ID: {doc.metadata.get('listing_id', 'N/A')}")
            print(f"    City: {doc.metadata.get('city', 'N/A')}")
            print(f"    Price: ${doc.metadata.get('price', 'N/A')}")
            print(f"    Content: {doc.page_content[:150]}...")
            print()
            
    except Exception as e:
        print(f"❌ Error: {e}")
    
    print("=" * 80)


Testing RetrievalQA chain with semantic search queries...

🔍 Test 1: find a place that mentions a 'dedicated workspace'
------------------------------------------------------------
Answer: Here are the properties that mention a 'dedicated workspace':

1. **Stylish Stay with Workspace**: This space features a dedicated workspace that caters to both work and relaxation.

2. **Inner-city apartment with dedicated office**: This apartment includes a dedicated workspace/study.

3. **Renovated 1-bed apt w parking & study workstation**: This apartment has a dedicated study space with a sit/stand desk.

4. **Clean Furnished Room (with lock) and work area**: This room includes a dedicated desk to work with 50MBPS Wi-Fi.

Source documents used:
  Source 1:
    Listing ID: 1.2733935215287355e+18
    City: Melbourne
    Price: $60.0
    Content: Property Name: Stylish Stay with Workspace

Description: Welcome to your comfortable and stylish home! This space features a dedicated workspace that ...



In [9]:
# Save the RAG components for later use
print("Saving RAG components...")

# Persist the vector store
vector_store.persist()

# Save the QA chain configuration
import pickle

rag_config = {
    'vector_store_path': vector_store_path,
    'embedding_model': 'text-embedding-3-small',
    'llm_model': 'gpt-4o-mini',
    'chunk_size': 1000,
    'chunk_overlap': 200,
    'retrieval_k': 4
}

# Save configuration
config_path = os.path.join("..", "data", "rag_config.pkl")
with open(config_path, 'wb') as f:
    pickle.dump(rag_config, f)

print("✅ RAG components saved!")
print(f"Vector store persisted to: {vector_store_path}")
print(f"Configuration saved to: {config_path}")

# Summary
print(f"\n📊 RAG System Summary:")
print(f"- Documents processed: {len(documents):,}")
print(f"- Text chunks created: {len(text_chunks):,}")
print(f"- Vector embeddings: OpenAI text-embedding-3-small")
print(f"- Vector store: Chroma (local)")
print(f"- LLM: GPT-4o-mini")
print(f"- Retrieval: Top 4 similar documents")

print(f"\n🎉 RAG prototype completed successfully!")
print(f"You can now perform semantic search on Airbnb property descriptions!")


Saving RAG components...
✅ RAG components saved!
Vector store persisted to: ../data/vector_store
Configuration saved to: ../data/rag_config.pkl

📊 RAG System Summary:
- Documents processed: 49,766
- Text chunks created: 63,437
- Vector embeddings: OpenAI text-embedding-3-small
- Vector store: Chroma (local)
- LLM: GPT-4o-mini
- Retrieval: Top 4 similar documents

🎉 RAG prototype completed successfully!
You can now perform semantic search on Airbnb property descriptions!


  vector_store.persist()
