# Personalized Real-Estate Agent

In this notebook, we will build a personalized real-estate agent. 

#### Description
1. After configuration, we first generate and save some fictitious real-estate listings to choose from. (You can skip this step, if you want to)
2. Next, we store them in a Vector Database (`chromadb`)
3. We then generate user preferences. You can use the hardcoded question-answer pairs or give your own answers by setting `COLLECT_USER_PREFERENCES = True` in the configuration section
4. Next, we recommend a suitable listing via a similarity search on the vector database
5. We then pass the recommended listing to an LLM in order to re-write it based on the given user preferences (from step 3)
6. Finally, we compare the old and new listings to make sure no factual information has been altered and the human preferences have indeed been taken into account. 

#### Preliminaries
In order to run this notebook, make sure
- to have the packages from `requirements.txt` installed
- to have a `.env` file including at least `OPENAI_API_KEY` and `OPENAI_API_BASE` variables defined. You need to provide valid OpenAI credentials in order to query the OpenAI models in the code below. 

#### Some hints
- If you just want to see the personalization part, you can skip the two first cells in the "Generating Real Estate Listings" section. 
- Please also find some ideas for improvements at the very end of the notebook.


## Import statements and configuration

In [None]:
import os
from dotenv import load_dotenv # To keep private keys private
load_dotenv()
# To store the generated listings in a csv file
import csv
import io
import datetime

openai_api_key = os.getenv("OPENAI_API_KEY")
openai_api_base = os.getenv("OPENAI_API_BASE")

if openai_api_key is not None:
    print(f"Using API key from the `.env` file.")
else:
    print("OPENAI_API_KEY not found in environment variable. - Please set it up in the `.env` file.")


if openai_api_base is not None:
    print(f"Using API base URL from the `.env` file. - You're all set.")
else:
    print("OPENAI_API_BASE not found in environment variable. - Please set it up in the `.env` file.")

from openai import OpenAI
client = OpenAI(
    base_url = openai_api_base,
    api_key = openai_api_key
)

MODEL_NAME = "gpt-3.5-turbo"
version = "v2"      # To make file names etc. unique
LISTINGS_FILE = f"listings_{version}.csv"
COLLECTION_NAME = "listings" # Name of the collection in the vector database
PERSIST_DIRECTORY = "chroma_db" # Directory where the vector database is stored

# LangChain components we are going to use
# from langchain.llms import OpenAI
from langchain_community.chat_models import ChatOpenAI      # Since langchain.llms/OpenAI is deprecated (also deprecated)
from langchain.document_loaders.csv_loader import CSVLoader # To load the CSV file
from langchain.vectorstores import Chroma                   # For vector database
import chromadb                                             # For metadata-based retrieval
import tiktoken                                             # For token counting, required by Chroma
from langchain.embeddings.openai import OpenAIEmbeddings    # (Deprecated)
EMBEDDINGS_MODEL_NAME = "text-embedding-ada-002"            # OpenAI's embedding model
from langchain.text_splitter import CharacterTextSplitter   # To make embeddings more efficient
from langchain.chains import RetrievalQA                    # To perform Retrieval-Augmented Generation (RAG) (deprecated)

import pandas as pd
import numpy as np

COLLECT_USER_PREFERENCES = False        # Set to True to collect user preferences interactively

COMPARISONS_FILE = "comparisons.txt"    # File to store the old and new listings for future reference.


Using API key from the `.env` file.
Using API base URL from the `.env` file. - You're all set.


## Generating Real Estate Listings

First, we generate some fantasy listings using an LLM. The one example given is: 

In [None]:
listing_elements = ["Unique ID", "Neighborhood", "Price", "Bedrooms", "Bathrooms", "House Size (in sqft)", "Description", "Neighborhood Description"]

listing_elements_text = ",".join(listing_elements)

example_listing = """
1,"Green Oaks",800000,3,2,2000,"Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.","Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a breeze."
"""

num_listings = 20
system_prompt = f"""
You are a real-estate listing generator. Your task is to create realistic and diverse real estate listings for the area of Munich (Germany) and surroundings, based on the provided example. Each listing should include the following fields: Unique ID (an integer counter), Neighborhood, Price, Bedrooms, Bathrooms, House Size (in sqft), Description, and Neighborhood Description. The listings should be varied in terms of price, size, and neighborhood features.
"""

user_prompt = f"""
Please generate {num_listings} real estate listings in the same format as the example below. The listings should be diverse and include various neighborhoods, prices, and features. Each listing should have a unique neighborhood description that highlights local amenities and attractions. The format should be csv-compatible, i.e., numbers should not contain commas, and text should be enclosed in double quotes if it contains commas. The listings should be realistic and reflect current market trends. Dollar values should not contain the $ sign. The listings should be in the following format: {listing_elements_text}
Example Listing:
{example_listing}
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

try:
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages
    )

    # To access the actual response text:
    raw_listings = response.choices[0].message.content
    print(raw_listings)
except Exception as e:
    print(f"An error occurred: {e}")





1,"Maxvorstadt",950000,2,2,1600,"Beautiful 2-bedroom, 2-bathroom apartment in the vibrant neighborhood of Maxvorstadt. This modern unit features high ceilings, large windows, and a sleek kitchen with top-of-the-line appliances. The spacious living room is perfect for entertaining guests or relaxing after a day exploring the local art galleries and museums. Enjoy urban living at its finest in this stylish Maxvorstadt residence.","Maxvorstadt is known for its artistic flair, with numerous museums and galleries such as the iconic Alte Pinakothek and Museum Brandhorst. The neighborhood is dotted with trendy cafes, boutique shops, and cozy beer gardens, offering a unique blend of culture and entertainment."

2,"Schwabing",1200000,4,3,2400,"Luxurious 4-bedroom, 3-bathroom townhouse in the upscale neighborhood of Schwabing. This elegant home features a grand entrance foyer, a gourmet kitchen with granite countertops, and a private patio for al fresco dining. The master suite boasts a spa-like

Next, we store the listings as a `.csv` file in order to retrieve them later.

In [3]:
# Use io.StringIO to treat the string like a file
string_io = io.StringIO(listing_elements_text + "\n" + response.choices[0].message.content)

# Open the output file for writing
with open(LISTINGS_FILE, 'w', newline='', encoding='utf-8') as outfile:
    # Create a CSV reader to read the string data
    reader = csv.reader(string_io)

    # Create a CSV writer to write to the file
    writer = csv.writer(outfile)

    # Read each row from the string data and write it to the file
    for row in reader:
        writer.writerow(row)

If the listings are already generated, we can just read them from the csv file (code adapted from the course exercises on LangChain)

In [4]:
# If using a pandas DataFrame, this would do it.
# df = pd.read_csv(LISTINGS_FILE)

# We are going to use LangChain, so we do this:
loader = CSVLoader(file_path=LISTINGS_FILE, encoding="utf-8", csv_args={"delimiter": ","})
docs = loader.load()
print(docs)


[Document(metadata={'source': 'listings_v3.csv', 'row': 0}, page_content='Unique ID: 1\nNeighborhood: Maxvorstadt\nPrice: 950000\nBedrooms: 2\nBathrooms: 2\nHouse Size (in sqft): 1600\nDescription: Beautiful 2-bedroom, 2-bathroom apartment in the vibrant neighborhood of Maxvorstadt. This modern unit features high ceilings, large windows, and a sleek kitchen with top-of-the-line appliances. The spacious living room is perfect for entertaining guests or relaxing after a day exploring the local art galleries and museums. Enjoy urban living at its finest in this stylish Maxvorstadt residence.\nNeighborhood Description: Maxvorstadt is known for its artistic flair, with numerous museums and galleries such as the iconic Alte Pinakothek and Museum Brandhorst. The neighborhood is dotted with trendy cafes, boutique shops, and cozy beer gardens, offering a unique blend of culture and entertainment.'), Document(metadata={'source': 'listings_v3.csv', 'row': 1}, page_content="Unique ID: 2\nNeighborh

## Storing Listings in a Vector Database
We now have loaded the generated listings and want to store them in a vector database.

In [3]:
embeddings = OpenAIEmbeddings(
    openai_api_key=openai_api_key,
    openai_api_base=openai_api_base,
    model=EMBEDDINGS_MODEL_NAME,
    #chunk_size=1,  # This is important for Chroma
    max_retries=3, # Number of retries for embedding requests
    request_timeout=60, # Timeout for embedding requests
)
# Splitting the data to make embeddings more efficient
splitter = CharacterTextSplitter(
                chunk_size=1000,
                chunk_overlap=0
            )
split_docs = splitter.split_documents(docs)
db = Chroma.from_documents(split_docs, embeddings, 
                           collection_name=COLLECTION_NAME, 
                           persist_directory=PERSIST_DIRECTORY
                           )


  embeddings = OpenAIEmbeddings(


## Building the User Preference Interface
We now collect user preferences. We can either use hard-coded question and answer pairs, or generate them, interactively, at runtime.

In [36]:
questions = [   
    "How big do you want your house to be?",
    "What are 3 most important things for you in choosing this property?", 
    "Which amenities would you like?", 
    "Which transportation options are important to you?",
    "How urban do you want your neighborhood to be?",   
    "Do you have an upper price target?",   
]

answers = []
if COLLECT_USER_PREFERENCES:
    # Collect user preferences interactively
    for question in questions:
        answer = input(question + " ")
        answers.append(answer)
else:
    answers = [
        "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
        "A quiet neighborhood, good local schools, and convenient shopping options.",
        "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
        "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
        "A balance between suburban tranquility and access to urban amenities like restaurants and theaters.",
        "It should be under $1,000,000."
    ]

history = "" # "You are an AI real estate assistant. Following is a conversation between you and a human. You are helping the human find a property that meets their needs.\n\n"
for i in range(len(questions)):
    history = history + "AI: " + questions[i] + "\n" + "Human: " + answers[i] + "\n"


## Searching based on preferences
Now that we have the user preferences, we need to find the listings that most closely match with them. We do this using the `RetrievalQA` just as in the course.

In [37]:
temperature = 0.0
llm = ChatOpenAI(model_name=MODEL_NAME, temperature=temperature, max_tokens = 500)
rag = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=db.as_retriever())

query = f"""
Based on the following conversation history, please find the best-matching real estate listing from the database. The conversation history is as follows:
{history}

SELECTION INSTRUCTIONS THAT MUST BE STRICTLY FOLLOWED:
AI will provide a highly personalized recommendation based only on the conversation history and the existing listings in the database, as included in the context.
AI should be very sensible to human personal preferences captured in the answers to personal questions, and should not be influenced by anything else.
AI will also build a persona for human based on human answers to questions, and use this persona to recommend a listing.
OUTPUT FORMAT:
First, include the persona you came up with in the explanation for the listing choice. Describe the persona in a few sentences.
Explain how human preferences captured in the answers to personal questions influenced creation of this persona.
Next, add some data of the selected listing.
YOUR RECOMMENDATION MUST END WITH TEXT: "I recommend listing number " FOLLOWED BY THE Unique ID of the selected listing in the database and no further punctuation marks.
FOLLOW THE INSTRUCTIONS STRICTLY, OTHERWISE HUMAN WILL NOT BE ABLE TO UNDERSTAND YOUR REVIEW.
"""
response = rag.run(query)
print(response)

The persona based on the human's answers to personal questions is someone who values a comfortable and spacious home in a quiet neighborhood with good schools and convenient shopping options. They prioritize amenities like a backyard for gardening, a two-car garage, and a modern, energy-efficient heating system. Transportation options such as easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads are important. They seek a balance between suburban tranquility and urban amenities like restaurants and theaters, all within a budget under $1,000,000.

Based on these preferences, the best-matching real estate listing from the database is:
Neighborhood: Neuhausen
Price: $550,000
Bedrooms: 3
Bathrooms: 2
House Size: 1500 sqft
Description: Charming 3-bedroom, 2-bathroom townhouse in the cozy Neuhausen neighborhood. This well-maintained property features a sunlit living room, updated kitchen, and a private backyard oasis. The bedrooms offer ample space and com

## Personalize Listing Descriptions
Lastly, we tailor the listing description to the buyer preferences as revealed via the conversation history, using our LLM.

In [42]:
recommended_entry_id=response[-8:] 
recommended_entry_id=recommended_entry_id[recommended_entry_id.find(" ") + 1:]
successful = False
try:
    recommended_entry_id = int(recommended_entry_id)
    # print(f"Recommended entry ID: {recommended_entry_id}")
    successful = True
except ValueError:
    print(f"Could not parse recommended entry ID from LLM response. Expected an integer, but got: '{recommended_entry_id}'")

if not successful:   
    try:
        # Sometimes, the LLM generates a leading punctuation mark even if it is instructed not to do this.
        new_entry_id = recommended_entry_id[:-1].strip() 
        recommended_entry_id = int(new_entry_id)
        successful = True
        print(f"Was successful on second try. Found ID: '{recommended_entry_id}'")
    except ValueError:
        print(f"Could not parse recommended entry ID from LLM response on the second try. Expected an integer, but got: '{recommended_entry_id}'")

if not successful:
    # We failed two times, so something else seems to be wrong. 
    print(f"Tried twice to parse recommended entry ID from LLM response, but failed. Please check manually.")
    recommended_entry_id = input("Please enter the Unique ID of the recommended listing, as given by the LLM (integers only): ")


try:
    # This should work, but doesn't:
    """
    chroma_client = chromadb.PersistentClient(
        path=PERSIST_DIRECTORY,
        )
    collection = chroma_client.get_collection(name=COLLECTION_NAME)
    recommended_listing = collection.get(where={"Unique ID": recommended_entry_id}, include=['metadatas', 'documents'])
    
    if recommended_listing and recommended_listing['ids']:
        print(f"Recommended listing found: {recommended_listing}")
    else:
        print(f"Entry with 'Unique ID' = {recommended_entry_id} not found in the database.")
    """

    # So, instead, we use a pandas DataFrame to get the recommended listing
    df = pd.read_csv(LISTINGS_FILE, encoding="utf-8")
    recommended_listing = df[df['Unique ID'] == recommended_entry_id]

except Exception as e:
    print(f"An error occurred while querying the database: {e}")

# Assuming reccommended_listing is a DataFrame, not a result from the collection query, see comment in try block above.
if not recommended_listing.empty:
    print(recommended_listing.to_string(index=False)) 
else:
    print(f"Entry with 'Unique ID' = {recommended_entry_id} not found in the database.")


Could not parse recommended entry ID from LLM response. Expected an integer, but got: '4.'
Was successful on second try. Found ID: '4'
 Unique ID Neighborhood  Price  Bedrooms  Bathrooms  House Size (in sqft)                                                                                                                                                                                                                                                                                                                                       Description                                                                                                                                                                                                                                                                                                                                                                                                Neighborhood Description
         4    Neuhausen 550000         3     

In [43]:
final_prompt = f"""
Based on the following conversation history, please rewrite the recommended listing to more closely match the human's preferences as revealed in the conversation history. 
Conversation history: 
{history}

Recommended listing:
{recommended_listing.to_string(index=False)}

OTUPUT INSTRUCTIONS THAT MUST BE STRICTLY FOLLOWED:
- AI will build a persona for a human based on the human answers in the conversation history, and then use this persona to rewrite the recommended listing. 
- AI will rewrite the recommended listing to make it more appealing to the human, based on the human's preferences as revealed in the conversation history. 
- AI will also report how the human's preferences influenced the rewriting of the listing.
- AI will not change any of the factual information of the listing (like price, number of bedrooms, Unique ID, etc.), or invent any new information. 
- AI can, however, freely rewrite or restructure the given listing description or neighborhood description from the recommended listing. 
- The AI should respond with the whole listing, including all the fields as the original listing, like Unique ID, Neighborhood, Description, etc.. The output of the listing should be in CSV format. None of the factual information (like price, number of bedrooms, Unique ID, etc.) should be changed. All field values that contain strings (especially 'Neighborhood', 'Description', and 'Neighborhood Description') should be enclosed in double quotes. Numbers and prices should be formatted as numbers, so no commas or dollar signs should be included. 
- Neither the field headings (first line) nor the listing data shall contain any leading whitespaces (but should contain double quotes for string fields).
- YOUR ANSWER MUST END WITH the listing in CSV format, surrounded by '```csv' in the line before the CSV data, and '```' in the line after the CSV data (i.e., on the last line of the output).
FOLLOW THE INSTRUCTIONS STRICTLY.
"""

system_prompt = f"""
You are a real-estate agent. Your task is to rewrite a real estate listing based on a given generic listing that caters to human preferences revealed via a conversation history. You should not make up any new information or alter any factual information as given in the original listing. 
"""


messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": final_prompt}
]

try:
    new_response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages
    )

    # To access the actual response text:
    new_listing = new_response.choices[0].message.content
    print(new_listing)
except Exception as e:
    print(f"An error occurred: {e}")



The human's preferences influenced the rewriting of the recommended listing to focus more on the features that align with their needs and desires. The listing now highlights the property's comfortable three bedrooms, spacious kitchen, cozy living room, quiet neighborhood, good local schools, convenient shopping options, backyard for gardening, two-car garage, energy-efficient heating system, easy access to a reliable bus line, proximity to a major highway, bike-friendly roads, and a balance between suburban tranquility and urban amenities.

```csv
Unique ID,Neighborhood,Price,Bedrooms,Bathrooms,House Size (in sqft),Description,Neighborhood Description
4,"Neuhausen",550000,3,2,1500,"Charming 3-bedroom, 2-bathroom townhouse in the cozy Neuhausen neighborhood. This well-maintained property features a sunlit living room, updated kitchen, and a private backyard oasis. The bedrooms offer ample space and comfort, making it an ideal home for a growing family or those looking for a peaceful ret

We now have our rewritten listing. We will 
- extract the new information from the response String
- compare all facts to ensure they are not changed
- Then print the old and new descriptions in order to see whether or not the LLM has taken the user preferences into account. 

In [44]:
try:
    csv_data = new_listing.split('```csv')[1].strip().split('```')[0].strip()
    print(repr(csv_data))
    df_new = pd.read_csv(io.StringIO(csv_data), encoding="utf-8")
except Exception as e:
    print(f"An error occurred while parsing the CSV data from the LLM response: {e}")

comparison1 = "Field name               - Old Value       - New value\n"
for column_name, new_value in df_new.iloc[0].items():
    try: 
        old_value = recommended_listing.iloc[0]
        comparison1 = comparison1 + f"{column_name:<24} - {recommended_listing.iloc[0][column_name]:<15} - {new_value:<10}\n"
    except Exception as e:
        comparison1 = comparison1 + f"An error occurred while accessing the original listing: {e}\n"
print(comparison1)

'Unique ID,Neighborhood,Price,Bedrooms,Bathrooms,House Size (in sqft),Description,Neighborhood Description\n4,"Neuhausen",550000,3,2,1500,"Charming 3-bedroom, 2-bathroom townhouse in the cozy Neuhausen neighborhood. This well-maintained property features a sunlit living room, updated kitchen, and a private backyard oasis. The bedrooms offer ample space and comfort, making it an ideal home for a growing family or those looking for a peaceful retreat within the city limits. Neuhausen is a picturesque district known for its relaxed vibe, leafy streets, and array of local bakeries and cafes. Residents can enjoy leisurely walks in the Nymphenburg Palace gardens or discover hidden gems in the vibrant quarter of Rotkreuzplatz. With a strong sense of community and proximity to schools and parks, Neuhausen provides a welcoming environment for families and nature enthusiasts alike."'
Field name               - Old Value       - New value
Unique ID                - 4               - 4         
Ne

The factual information looks good, let's reformat the descriptions in order to see whether the descriptions somehow match the user preferences.

In [45]:
comparison_full = f"""-----------------------
Conversation history: 
{history}

-----------------------
Direct comparison: 
{comparison1}

-----------------------
Old description: 
{recommended_listing.iloc[0]['Description']}

-----------------------
New description: 
{df_new.iloc[0]['Description']}

-----------------------
Old neighborhood description: 
{recommended_listing.iloc[0]['Neighborhood Description']}

-----------------------
New neighborhood description: 
{df_new.iloc[0]['Neighborhood Description']}

"""
print(comparison_full)

-----------------------
Conversation history: 
AI: How big do you want your house to be?
Human: A comfortable three-bedroom house with a spacious kitchen and a cozy living room.
AI: What are 3 most important things for you in choosing this property?
Human: A quiet neighborhood, good local schools, and convenient shopping options.
AI: Which amenities would you like?
Human: A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.
AI: Which transportation options are important to you?
Human: Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.
AI: How urban do you want your neighborhood to be?
Human: A balance between suburban tranquility and access to urban amenities like restaurants and theaters.
AI: Do you have an upper price target?
Human: It should be under $1,000,000.


-----------------------
Direct comparison: 
Field name               - Old Value       - New value
Unique ID                - 4               - 

For future reference, we provide the possibility to store and retrieve all generated listings, including the respective conversation history and the comparisons:

In [None]:
try:
    with open(COMPARISONS_FILE, 'a', encoding='utf-8') as file:
        file.write(f"-----------------------\n")
        file.write(f"{datetime.datetime.now(datetime.timezone.utc).strftime('%Y-%m-%d %H:%M:%S %Z')}\n")
        file.write(f"-----------------------\n")
        file.write(comparison_full)
        file.write(f"-----------------------\n")
except FileNotFoundError:
    print(f"Error: The file {COMPARISONS_FILE} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Execute this cell if you just want to retrieve past examples

In [47]:
try:
    with open(COMPARISONS_FILE, 'r', encoding='utf-8') as file:
        content = file.read()
        print(content)
except FileNotFoundError:
    print(f"Error: The file {COMPARISONS_FILE} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

-----------------------
2025-04-23 13:58:48 UTC
-----------------------
-----------------------
Conversation history: 
AI: How big do you want your house to be?
Human: Roughly enough for a family of 5 plus guests
AI: What are 3 most important things for you in choosing this property?
Human: the garden, good public transportation nearby, and a nice view above the city center
AI: Which amenities would you like?
Human: they are not important
AI: Which transportation options are important to you?
Human: public transportation and bike lanes
AI: How urban do you want your neighborhood to be?
Human: Rather urban
AI: Do you have an upper price target?
Human: 1 million dollars


-----------------------
Direct comparison: 
Field name               - Old Value       - New value
Unique ID                - 5               - 5         
Neighborhood             - Sendling        -  Sendling 
Price                    - 720000          - 720000    
Bedrooms                 - 3               - 3        

## Ideas for improvement

#### Update LangChain packages and code
Several LangChain components are already deprecated. For now, I followed more closely the course content. In a future version of this application, I might update, e.g., 
- using a `LangGraph`, 
- replacing `RetrievalQA`
- updating the OpenAI-related components for embeddings or the LLM itself


#### Improve formatting of the LLM output. 
LLMs are probabilistic in nature. When constructing the query for the recommended listing, the hardest (and potentially still not fully resolved) part was to make the LLM adhere **strictly** to my guidelines (e.g., adding the csv-formatted listing in between ```` ```csv```` and ```` ``` ```` )
