# Project 4: Personalized Real State AI Agent <a class="jp-toc-ignore"></a>
The goal of this notebook is to develop code to programatically extract CAGEd and RAIS datasets from surce FTP server."

# Project Config
The following cell contains the API keys and parameters used in the project:

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

# API Keys
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_API_BASE = "https://openai.vocareum.com/v1"

# LLM Configuration
LLM_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"

# Vector DB Configuration
VECTOR_DB_PATH = "./vector_db"

# Application Settings
LISTINGS_FILE_PATH='listings.json'
NUM_LISTINGS_TO_GENERATE = 100
NUM_LISTINGS_TO_RETURN = 5

# Testing OpenAI API key
To make sure our OpenAI API key is working, we are going to get a list of São Paulo neighborhoods, which are going to be used in the next session to generate the listings.

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
llm = ChatOpenAI(
    model=LLM_MODEL, 
    openai_api_key=OPENAI_API_KEY,
    openai_api_base="https://openai.vocareum.com/v1"
)

prompt = 'Give me a list of 10 neighborhoods in São Paulo'

response = llm([HumanMessage(content=prompt)])
listing_text = response.content

  llm = ChatOpenAI(
  response = llm([HumanMessage(content=prompt)])


In [3]:
print(listing_text)

1. Pinheiros
2. Itaim Bibi
3. Vila Madalena
4. Moema
5. Jardins
6. Vila Mariana
7. Brooklin
8. Consolação
9. Santana
10. Morumbi


# Data generation
In this session, we are going to use OpenAI API to generate synthetic data about real state houses in different neighborhoods in São Paulo, to use in our Real State agent.

In [4]:
import json
import random
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from tqdm.notebook import tqdm

def generate_listings(num_listings=NUM_LISTINGS_TO_GENERATE, file_path=LISTINGS_FILE_PATH):
    """Generate synthetic real estate listings using LLM"""
    
    llm = ChatOpenAI(
        model=LLM_MODEL, 
        openai_api_key=OPENAI_API_KEY,
        openai_api_base="https://openai.vocareum.com/v1"
    )
    
    neighborhoods = ["Jardins", "Pinheiros", "Vila Madalena", "Moema", 
                     "Itaim Bibi", "Morumbi", "Liberdade", "Bela Vista", 
                     "Brooklin", "Paraiso"
                    ]
    
    listings = []
    
    for i in tqdm(range(num_listings)):
        # Create a prompt for the LLM to generate a diverse listing
        neighborhood = random.choice(neighborhoods)
        bedrooms = random.randint(1, 5)
        bathrooms = random.randint(1, 4)
        price = random.randint(200, 1500) * 1000
        size = random.randint(800, 4000)
        
        prompt = f"""
        Generate a detailed real estate listing with the following specifications:
        - Neighborhood: {neighborhood}
        - Price: ${price:,}
        - Bedrooms: {bedrooms}
        - Bathrooms: {bathrooms}
        - House Size: {size} sqft
        
        Include a property description highlighting unique features and a separate neighborhood description.
        Format the output exactly as follows:
        
        Neighborhood: [neighborhood name]
        Price: [price]
        Bedrooms: [number]
        Bathrooms: [number]
        House Size: [size] sqft

        Description: [detailed property description]

        Neighborhood Description: [neighborhood description]
        """
        
        response = llm([HumanMessage(content=prompt)])
        listing_text = response.content
        
        # Parse the generated listing into structured format
        listing_data = {}
        sections = listing_text.split("\n\n")
        
        # Parse basic info
        basic_info = sections[0].strip().split("\n")
        for line in basic_info:
            if ":" in line:
                key, value = line.split(":", 1)
                listing_data[key.strip()] = value.strip()
        
        # Parse description and neighborhood
        for section in sections[1:]:
            if section.startswith("Description:"):
                listing_data["Description"] = section.replace("Description:", "", 1).strip()
            elif section.startswith("Neighborhood Description:"):
                listing_data["Neighborhood Description"] = section.replace("Neighborhood Description:", "", 1).strip()
        
        listings.append(listing_data)
    
    # Save listings to file
    with open(file_path, "w") as f:
        json.dump(listings, f, indent=2)
    
    return listings

## Generating listings
In this step, we are going to check if the listings json file was already created, and load it as a dictionary.
Otherwise, generate it with the function `generate_listings`.

In [5]:
import json
from pathlib import Path

file_path = Path('listings.json')

if file_path.exists() and file_path.is_file():
    with open(file_path, "r") as file:
        listings = json.load(file)
        print(f'Loaded listings from local file with {len(listings)} records')
else:
    listings = generate_listings(NUM_LISTINGS_TO_GENERATE)
    print(f"Generated {NUM_LISTINGS_TO_GENERATE} new listings.")

  0%|          | 0/100 [00:00<?, ?it/s]

Generated 100 new listings.


In [7]:
len(listings)

100

In [6]:
listings[0]

{'Neighborhood': 'Liberdade',
 'Price': '$1,229,000',
 'Bedrooms': '5',
 'Bathrooms': '2',
 'House Size': '3362 sqft',
 'Description': 'This stunning 5-bedroom, 2-bathroom home in the vibrant neighborhood of Liberdade is a rare find. The spacious house offers 3362 sqft of living space, perfect for a growing family or those who love to entertain. The open floor plan allows for seamless flow between the living room, dining area, and kitchen. The master bedroom features an en-suite bathroom and a walk-in closet, providing a luxurious retreat. The backyard is a private oasis with a well-maintained garden and a patio, ideal for outdoor gatherings. With modern amenities and stylish finishes throughout, this home is truly a gem in the heart of Liberdade.',
 'Neighborhood Description': 'Liberdade is a diverse and culturally rich neighborhood known for its vibrant street markets, colorful festivals, and excellent dining options. The area is home to a large Japanese community, reflected in its m

# Vector Store

## Creating the Vector table model and embedding

In [8]:
import json
import os
import lancedb
import pyarrow as pa
from langchain_openai import OpenAIEmbeddings

# Connect to LanceDB
db = lancedb.connect("~/real_estate_db")

# Create embeddings object
embeddings = OpenAIEmbeddings(
    openai_api_base="https://openai.vocareum.com/v1",
    model="text-embedding-ada-002"
)

# Prepare data with embeddings
formatted_listings = []
for listing in tqdm(listings):
    full_text = f"""
    Neighborhood: {listing.get('Neighborhood', '')}
    Price: {listing.get('Price', '')}
    Bedrooms: {listing.get('Bedrooms', '')}
    Bathrooms: {listing.get('Bathrooms', '')}
    House Size: {listing.get('House Size', '')}
    
    {listing.get('Description', '')}
    
    {listing.get('Neighborhood Description', '')}
    """
    
    # Generate embedding
    vector = embeddings.embed_query(full_text)
    
    formatted_listing = {
        "neighborhood": listing.get('Neighborhood', ''),
        "price": listing.get('Price', ''),
        "bedrooms": listing.get('Bedrooms', ''),
        "bathrooms": listing.get('Bathrooms', ''),
        "house_size": listing.get('House Size', ''),
        "description": listing.get('Description', ''),
        "neighborhood_description": listing.get('Neighborhood Description', ''),
        "full_text": full_text,
        "vector": vector
    }
    
    formatted_listings.append(formatted_listing)

# Create proper PyArrow schema
table_schema = pa.schema([
    pa.field("neighborhood", pa.string()),
    pa.field("price", pa.string()),
    pa.field("bedrooms", pa.string()),
    pa.field("bathrooms", pa.string()),
    pa.field("house_size", pa.string()),
    pa.field("description", pa.string()),
    pa.field("neighborhood_description", pa.string()),
    pa.field("full_text", pa.string()),
    pa.field("vector", pa.list_(pa.float32(), 1536))
])

# Create table with proper schema
try:
    table = db.create_table("real_estate_listings", schema=table_schema, mode="overwrite")
    table.add(formatted_listings)
    print(f"Successfully added {len(formatted_listings)} listings with embeddings to LanceDB")
except Exception as e:
    print(f"Error: {e}")


  0%|          | 0/100 [00:00<?, ?it/s]

Successfully added 100 listings with embeddings to LanceDB


## Testing vector search

In [9]:
table.head().to_pandas()

Unnamed: 0,neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description,full_text,vector
0,Liberdade,"$1,229,000",5,2,3362 sqft,"This stunning 5-bedroom, 2-bathroom home in th...",Liberdade is a diverse and culturally rich nei...,"\n Neighborhood: Liberdade\n Price: $1,2...","[-0.0005255764, 0.034925453, -0.02344918, -0.0..."
1,Moema,"$341,000",5,3,1469 sqft,"This stunning 5-bedroom, 3-bathroom home in th...",Moema is known for its upscale residential are...,"\n Neighborhood: Moema\n Price: $341,000...","[0.0040002028, 0.024501242, -0.027659297, -0.0..."
2,Bela Vista,"$1,484,000",2,4,1896 sqft,This stunning property in Bela Vista offers a ...,Bela Vista is a sought-after neighborhood know...,"\n Neighborhood: Bela Vista\n Price: $1,...","[0.013551887, 0.025079373, 0.004996533, 5.9686..."
3,Jardins,"$1,333,000",1,4,3123 sqft,This luxurious property in the prestigious nei...,"Jardins is known for its upscale vibe, with tr...","\n Neighborhood: Jardins\n Price: $1,333...","[0.0057548485, 0.037057538, -0.015223352, 0.01..."
4,Paraiso,"$928,000",3,2,1952 sqft,"This charming 3-bedroom, 2-bathroom home in Pa...",Paraiso is a highly sought-after neighborhood ...,"\n Neighborhood: Paraiso\n Price: $928,0...","[0.0061713457, 0.031144377, -0.0227503, 0.0028..."


In [13]:
print(db.table_names())

['real_estate_listings']


## Performing a traditional query based on filter for `"neighborhood = 'Brooklin'"`

In [10]:
brooklin_properties = table.search().where("neighborhood = 'Brooklin'").to_pandas()
brooklin_properties

Unnamed: 0,neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description,full_text,vector
0,Brooklin,"$1,480,000",3,4,3277 sqft,"This stunning 3-bedroom, 4-bathroom home in Br...",Brooklin is a charming and sought-after neighb...,"\n Neighborhood: Brooklin\n Price: $1,48...","[0.021954942, 0.020284735, -0.014713126, -0.00..."
1,Brooklin,"$826,000",2,3,3704 sqft,This stunning property in the desirable neighb...,Brooklin is a charming and family-friendly nei...,"\n Neighborhood: Brooklin\n Price: $826,...","[0.010267774, 0.022764903, -0.01431883, -0.003..."
2,Brooklin,"$1,252,000",1,3,1464 sqft,This stunning property in Brooklin features a ...,Brooklin is a highly sought-after neighborhood...,"\n Neighborhood: Brooklin\n Price: $1,25...","[0.010742205, 0.018312903, -0.018632611, -0.00..."
3,Brooklin,"$1,194,000",3,4,1381 sqft,"This stunning 3 bedroom, 4 bathroom home in Br...",Brooklin is a charming and family-friendly nei...,"\n Neighborhood: Brooklin\n Price: $1,19...","[0.015906336, 0.011426305, -0.019385846, -0.01..."
4,Brooklin,"$739,000",3,2,3284 sqft,This stunning property in Brooklin features a ...,Brooklin is a charming neighborhood known for ...,"\n Neighborhood: Brooklin\n Price: $739,...","[0.013718702, 0.018449288, -0.011263911, -0.00..."
5,Brooklin,"$962,000",4,1,2953 sqft,"This stunning 4-bedroom, 1-bathroom home in th...",Brooklin is a charming and family-friendly nei...,"\n Neighborhood: Brooklin\n Price: $962,...","[0.013344246, 0.020806236, -0.01607988, -0.010..."
6,Brooklin,"$592,000",1,3,1143 sqft,"This charming 1-bedroom, 3-bathroom home in th...",Brooklin is a picturesque neighborhood known f...,"\n Neighborhood: Brooklin\n Price: $592,...","[0.016753623, 0.01999085, -0.00809307, -0.0144..."
7,Brooklin,"$634,000",1,2,921 sqft,"This charming 1-bedroom, 2-bathroom home in Br...",Brooklin is a highly sought-after neighborhood...,"\n Neighborhood: Brooklin\n Price: $634,...","[0.013864158, 0.01864176, -0.013486637, 0.0045..."
8,Brooklin,"$1,016,000",3,1,1075 sqft,"This charming 3-bedroom, 1-bathroom home in Br...",Brooklin is a highly sought-after neighborhood...,"\n Neighborhood: Brooklin\n Price: $1,01...","[0.011117681, 0.012637502, -0.012682202, -0.00..."
9,Brooklin,"$230,000",3,1,2944 sqft,"This charming 3-bedroom, 1-bathroom home in th...",Brooklin is a peaceful and family-friendly nei...,"\n Neighborhood: Brooklin\n Price: $230,...","[0.017872278, 0.01602076, -0.012889897, 0.0045..."


In [11]:
# Connect to your database
db = lancedb.connect("~/real_estate_db")
table = db.open_table("real_estate_listings")

# Perform a vector search with a natural language query
query = "spacious family home with modern design"
query_vector = embeddings.embed_query(query)

In [12]:
results = table.search(
    query_vector,  # Pass the vector directly instead of text
    vector_column_name='vector'
).limit(3).to_pandas()

In [13]:
# Display results
for i, result in results.iterrows():
    print(f"\nMatch #{i+1} - Similarity Score: {result['_distance']:.4f}")
    print(f"Neighborhood: {result['neighborhood']}")
    print(f"Price: {result['price']}")
    print(f"Description: {result['description'][:100]}...")


Match #1 - Similarity Score: 0.3722
Neighborhood: Jardins
Price: $848,000
Description: This stunning 5 bedroom, 4 bathroom home in the desirable Jardins neighborhood is a rare find. The h...

Match #2 - Similarity Score: 0.3741
Neighborhood: Jardins
Price: $529,000
Description: This stunning 4-bedroom, 2-bathroom home in the prestigious Jardins neighborhood is a rare find. The...

Match #3 - Similarity Score: 0.3753
Neighborhood: Moema
Price: $1,210,000
Description: This stunning property in Moema features 3 spacious bedrooms, each with its own en-suite bathroom, p...


# Creating Real State Chat

In [14]:
from langchain_core.chat_history import BaseChatMessageHistory, InMemoryChatMessageHistory
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import ChatPromptTemplate, MessagesPlaceholder, HumanMessagePromptTemplate
from langchain_core.runnables.history import RunnableWithMessageHistory


# Function to manage session-based message history
def get_session_history(user_id: str, conversation_id: str) -> BaseChatMessageHistory:
    store = {}
    if (user_id, conversation_id) not in store:
        store[(user_id, conversation_id)] = InMemoryChatMessageHistory()
    return store[(user_id, conversation_id)]

# Prompt templates
prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),  # Placeholder for chat history
    ("human", "{question}"),  # User's input question or request
])

# Wrap the prompt with RunnableWithMessageHistory to manage conversation state
conversation_chain = RunnableWithMessageHistory(
    runnable=prompt_template | llm,
    get_session_history=lambda: get_session_history("user_123", "conversation_1"),
    input_messages_key="question",  
    history_messages_key="history",  
)

# Step 1: Ask the user for their preferences interactively
def ask_preferences_interactively():
    print("Please describe your ideal property in detail as a paragraph.")
    print("Focus on the following aspects:")
    print("- Property size and layout (e.g., number of bedrooms, bathrooms, house size)")
    print("- Important features and amenities (e.g., backyard, garage, modern heating system)")
    print("- Neighborhood characteristics (e.g., quiet, urban, family-friendly)")
    print("- Location requirements (e.g., proximity to transportation or schools)")
    print("- Price range (if applicable)")
    print("- Any extra information or specific requirements.")
    
    user_input = input("\nEnter your preferences as a paragraph:\n> ")
    
    return user_input.strip()

# Step 2: Process the user's response into a structured format using LangChain's LLM
def process_preferences(response):
    question = f"""
    Based on the following description provided by the customer:
    
    "{response}"
    
    Extract the preferences into the following structured format:
    
    Neighborhood: [neighborhood name]
    Price: [price]
    Bedrooms: [number]
    Bathrooms: [number]
    House Size: [size] sqft
    Extra Information: [detailed property description]
    
    Provide only the structured output.
    """
    
    structured_output = conversation_chain.invoke({"question": question})
    
    return structured_output.content

# Step 3: Confirm preferences with the user interactively
def confirm_preferences(structured_preferences):
    print("\nHere are the details you provided:\n")
    print(structured_preferences)
    
    while True:
        user_confirmation = input("\nIs this what you're searching for? Please answer 'yes' or 'no':\n> ").strip().lower()
        
        if user_confirmation.lower() == "yes":
            return True
        elif user_confirmation.lower() == "no":
            return False
        else:
            print("Invalid response. Please answer 'yes' or 'no'.")

In [16]:
def augment_listings(preferences, listings):
    """
    Augment the descriptions of the retrieved listings using LLM.
    Tailor each description to resonate with the buyer’s preferences.
    """
    augmented_listings = []
    
    for i, listing in listings.iterrows():
        # Prepare prompt for LLM
        prompt = f"""
        A buyer is looking for a property with the following preferences:
        
        {preferences}
        
        You have a property listing with these details:
        - Neighborhood: {listing['neighborhood']}
        - Price: {listing['price']}
        - Bedrooms: {listing['bedrooms']}
        - Bathrooms: {listing['bathrooms']}
        - House Size: {listing['house_size']} sqft
        - Description: {listing['description']}
        
        Your task is to enhance the description of this property to emphasize aspects that align with the buyer's preferences. 
        Ensure factual integrity and do not invent any details. Tailor the description to make it more appealing to this specific buyer.
        
        Provide only the augmented description.
        """
        
        # Use LLM to generate augmented description
        response = llm.predict(prompt)
        
        # Add augmented description to listing
        augmented_listing = {
            "neighborhood": listing["neighborhood"],
            "price": listing["price"],
            "bedrooms": listing["bedrooms"],
            "bathrooms": listing["bathrooms"],
            "house_size": listing["house_size"],
            "original_description": listing["description"],
            "augmented_description": response.strip()
        }
        
        augmented_listings.append(augmented_listing)
    
    return augmented_listings

In [17]:
import lancedb
from langchain_openai import OpenAIEmbeddings


def search_listings(preferences):
    # Generate an embedding for the user's preferences
    preference_vector = embeddings.embed_query(preferences)
    
    # Connect to the "real_estate_listings" table
    table = db.open_table("real_estate_listings")
    
    # Perform vector search to find top 5 matches
    results = table.search(preference_vector).limit(5).to_arrow()
    
    # Convert PyArrow Table to Pandas DataFrame for easier processing
    results_df = results.to_pandas()
    
    # Augment listings using LLM
    augmented_listings = augment_listings(preferences, results_df)
    
    # Display results
    print("\nHere are the top 5 matching properties based on your preferences:\n")
    for i, listing in enumerate(augmented_listings):
        print(f"Match {i + 1}:")
        print(f"Neighborhood: {listing['neighborhood']}")
        print(f"Price: {listing['price']}")
        print(f"Bedrooms: {listing['bedrooms']}")
        print(f"Bathrooms: {listing['bathrooms']}")
        print(f"House Size: {listing['house_size']} sqft")
        print("\nOriginal Description:")
        print(listing["original_description"])
        print("\nAugmented Description:")
        print(listing["augmented_description"])
        print("-" * 50)


In [18]:
# Step 4: Refine preferences interactively if necessary
def refine_preferences(initial_response):
    while True:
        structured_preferences = process_preferences(initial_response)
        confirmed = confirm_preferences(structured_preferences)
        
        if confirmed:
            print("\nPreferences confirmed. Proceeding with these details.")
            
            # Perform vector search after confirmation
            search_listings(structured_preferences)
            break
        else:
            print("\nPlease rephrase your requirements or add more details about your ideal property.")
            initial_response = input("\nEnter your updated preferences as a paragraph:\n> ").strip()

# Main function to execute the workflow interactively
def main_chat():
    print("Welcome! Let's find your ideal property.")
    
    # Step 1: Collect initial preferences from the user interactively
    initial_response = ask_preferences_interactively()
    
    # Step 2-4: Process and refine preferences iteratively until confirmed
    refine_preferences(initial_response)

In [19]:
main_chat()

Welcome! Let's find your ideal property.
Please describe your ideal property in detail as a paragraph.
Focus on the following aspects:
- Property size and layout (e.g., number of bedrooms, bathrooms, house size)
- Important features and amenities (e.g., backyard, garage, modern heating system)
- Neighborhood characteristics (e.g., quiet, urban, family-friendly)
- Location requirements (e.g., proximity to transportation or schools)
- Price range (if applicable)
- Any extra information or specific requirements.



Enter your preferences as a paragraph:
>  Apartment in Moema, 4 bedroons, 3 bathrooms, 2000 sqrt, close to school, under 500.000. accepts pets



Here are the details you provided:

Neighborhood: Moema
Price: Under 500,000
Bedrooms: 4
Bathrooms: 3
House Size: 2000 sqft
Extra Information: Close to school, accepts pets



Is this what you're searching for? Please answer 'yes' or 'no':
>  yes



Preferences confirmed. Proceeding with these details.


  response = llm.predict(prompt)



Here are the top 5 matching properties based on your preferences:

Match 1:
Neighborhood: Moema
Price: $516,000
Bedrooms: 4
Bathrooms: 3
House Size: 2863 sqft sqft

Original Description:
This stunning 4-bedroom, 3-bathroom home in Moema is the epitome of luxury living. With a spacious house size of 2863 sqft, this property offers ample room for comfortable living and entertaining. The interior features high-end finishes, modern appliances, and large windows that fill the space with natural light. The master bedroom boasts a walk-in closet and an en-suite bathroom with a luxurious soaking tub. The backyard is perfect for outdoor gatherings, with a patio area and lush landscaping creating a serene oasis.

Augmented Description:
This luxurious 4-bedroom, 3-bathroom home in Moema is a perfect fit for those seeking a spacious property close to schools. With a generous house size of 2863 sqft, there is plenty of room for comfortable living and entertaining. The interior is adorned with high