This is a starter notebook for the project, you'll have to import the libraries you'll need, you can find a list of the ones available in this workspace in the requirements.txt file in this workspace. 

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "xxx"
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

from langchain.llms import OpenAI


In [3]:
MODEL_NAME = 'gpt-3.5-turbo'
OPENAI_API_KEY = "xxx"

In [4]:
from langchain.chat_models import ChatOpenAI
llm = OpenAI(model_name=MODEL_NAME, temperature=0, api_key=OPENAI_API_KEY)



In [54]:
instruction = "Generate a CSV file with at least 10 real estate listings."
sample_listing= \
"""
Neighborhood: Downtown San Mateo
Price: $1,080,000
Bedrooms: 2
Bathrooms: 2
House Size: 1,500 sqft

Description:  Rare & amazing opportunity to own this luxury & updated house that is perfectly situated between Downtown San Mateo & Downtown Burlingame. Stunning 2 bed, 2 bath, home that offers an open & spacious floor plan with a cozy fireplace & living room that extends out to a private patio. 
Neighborhood Description: complex is secure, well-appointed & within strolling distance of Downtown San Mateo shops, restaurants, & the Japanese Tea Garden. Residents will benefit from quick access to major freeways, Caltrain, Bart, & SFO.
"""

In [49]:
from pydantic import BaseModel, Field, NonNegativeInt
from typing import List

class RealEstateListing(BaseModel):
    """
    A real estate listing.
    
    Attributes:
    - neighborhood: str
    - price: NonNegativeInt
    - bedrooms: NonNegativeInt
    - bathrooms: NonNegativeInt
    - house_size: NonNegativeInt
    - description: str
    - neighborhood_description: str
    """
    neighborhood: str = Field(description="The neighborhood where the property is located")
    price: NonNegativeInt = Field(description="The price of the property in USD")
    bedrooms: NonNegativeInt = Field(description="The number of bedrooms in the property")
    bathrooms: NonNegativeInt = Field(description="The number of bathrooms in the property")
    house_size: NonNegativeInt = Field(description="The size of the house in square feet")
    description: str = Field(description="A description of the property")
    neighborhood_description: str = Field(description="A description of the neighborhood.")  

class ListingCollection(BaseModel):
    """
    A collection of real estate listings.
    
    Attributes:
    - listings: List[RealEstateListing]
    """
    listings: List[RealEstateListing] = Field(description="A list of real estate listings")

In [50]:
from langchain.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=ListingCollection)

In [55]:
from langchain.prompts import PromptTemplate

# printing the prompt
prompt = PromptTemplate(
    template="{instruction}\n{sample}\n{format_instructions}\n",
    input_variables=["instruction", "sample"],
    partial_variables={"format_instructions": parser.get_format_instructions},
)

query = prompt.format(
    instruction=instruction,
    sample=sample_listing,
)
print(query)

Generate a CSV file with at least 10 real estate listings.

Neighborhood: Downtown San Mateo
Price: $1,080,000
Bedrooms: 2
Bathrooms: 2
House Size: 1,500 sqft

Description:  Rare & amazing opportunity to own this luxury & updated house that is perfectly situated between Downtown San Mateo & Downtown Burlingame. Stunning 2 bed, 2 bath, home that offers an open & spacious floor plan with a cozy fireplace & living room that extends out to a private patio. 
Neighborhood Description: complex is secure, well-appointed & within strolling distance of Downtown San Mateo shops, restaurants, & the Japanese Tea Garden. Residents will benefit from quick access to major freeways, Caltrain, Bart, & SFO.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} 

In [56]:
response = llm(query)

In [6]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tzdata>=2022.7
  Downloading tzdata-2025.1-py2.py3-none-any.whl (346 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.8/346.8 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tzdata, pandas
Successfully installed pandas-2.2.3 tzdata-2025.1


In [7]:
import sys
sys.path.append("/home/student/.local/lib/python3.10/site-packages")


In [57]:
import pandas as pd
from fastapi.encoders import jsonable_encoder

print(pd.__version__)  # Check if it's installed correctly
# create a dataframe from the response
result = parser.parse(response)
print(result)
df = pd.DataFrame(jsonable_encoder(result.listings))
df.head()
df

2.2.3
listings=[RealEstateListing(neighborhood='Downtown San Mateo', price=1080000, bedrooms=2, bathrooms=2, house_size=1500, description='Rare & amazing opportunity to own this luxury & updated house that is perfectly situated between Downtown San Mateo & Downtown Burlingame. Stunning 2 bed, 2 bath, home that offers an open & spacious floor plan with a cozy fireplace & living room that extends out to a private patio.', neighborhood_description='Complex is secure, well-appointed & within strolling distance of Downtown San Mateo shops, restaurants, & the Japanese Tea Garden. Residents will benefit from quick access to major freeways, Caltrain, Bart, & SFO.'), RealEstateListing(neighborhood='Sunnyvale', price=950000, bedrooms=3, bathrooms=2, house_size=1800, description='Beautiful single-family home located in the heart of Sunnyvale. This 3 bed, 2 bath property features a spacious backyard, updated kitchen, and a cozy living room with a fireplace.', neighborhood_description='Quiet neighb

Unnamed: 0,neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description
0,Downtown San Mateo,1080000,2,2,1500,Rare & amazing opportunity to own this luxury ...,"Complex is secure, well-appointed & within str..."
1,Sunnyvale,950000,3,2,1800,Beautiful single-family home located in the he...,"Quiet neighborhood with easy access to parks, ..."
2,Palo Alto,2200000,4,3,2500,Luxurious modern home in the prestigious Palo ...,"Prime location near Stanford University, top-r..."
3,Mountain View,1200000,3,2,1600,Charming ranch-style home in the desirable Mou...,"Close to tech campuses, parks, and downtown Mo..."
4,Redwood City,1350000,4,3,2200,Spacious family home in a quiet Redwood City n...,"Close to schools, parks, and shopping centers...."
5,Menlo Park,1800000,5,4,3000,Elegant estate in the prestigious Menlo Park n...,"Located near top-rated schools, parks, and ups..."
6,San Francisco,2500000,3,3,2000,Modern luxury condo in the heart of San Franci...,"Prime location near Union Square, restaurants,..."
7,Oakland,900000,2,1,1200,Cozy bungalow in the vibrant Oakland neighborh...,"Close to local cafes, shops, and parks. Easy a..."
8,Berkeley,1100000,3,2,1600,Classic craftsman home in the sought-after Ber...,"Located near UC Berkeley, parks, and gourmet d..."
9,San Jose,800000,4,2,1800,Spacious family home in a quiet San Jose neigh...,"Close to schools, parks, and shopping centers...."


In [58]:
# save the dataframe to a csv file
df.to_csv('listings.csv', index_label = 'id')

## Step 3: Storing Listings in a Vector Database

`Vector Database Setup:` Initialize and configure ChromaDB or a similar vector database to store real estate listings.

`Generating and Storing Embeddings:` Convert the LLM-generated listings into suitable embeddings that capture the semantic content of each listing, and store these embeddings in the vector database.

In [9]:
import shutil
import pandas as pd

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize and configure ChromaDB or a similar vector database to store real estate listings
CHROMA_PATH = "chroma"
CSV_PATH = "listings.csv" 

embedding_function = OpenAIEmbeddings()

df = pd.read_csv(CSV_PATH)
documents = []
for index, row in df.iterrows():
    documents.append(Document(page_content=row['description'], metadata={'id': str(index)}))


# Split Text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # keep all listing
    chunk_overlap=50,
    length_function=len,
    add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

if chunks:
    document = chunks[9]
    print(document.page_content)
    print(document.metadata)

# Save to Chroma
if os.path.exists(CHROMA_PATH):
    shutil.rmtree(CHROMA_PATH)

db = Chroma.from_documents(
    chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
)
db.persist()
print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

Split 10 documents into 10 chunks.
Spacious family home in a quiet San Jose neighborhood. This 4 bed, 2 bath property features a large living room, updated kitchen, and a backyard with fruit trees.
{'id': '9', 'start_index': 0}
Saved 10 chunks to chroma.


## Step 4: Building the User Preference Interface

- Collect buyer preferences, such as the number of bedrooms, bathrooms, location, and other specific requirements from a set of questions or telling the buyer to enter their preferences in natural language. 
Example:

```python
questions = [   
                "How big do you want your house to be?" 
                "What are 3 most important things for you in choosing this property?", 
                "Which amenities would you like?", 
                "Which transportation options are important to you?",
                "How urban do you want your neighborhood to be?",   
            ]
answers = [
    "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "A quiet neighborhood, good local schools, and convenient shopping options.",
    "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "A balance between suburban tranquility and access to urban amenities like restaurants and theaters."
]
```
- Buyer Preference Parsing: Implement logic to interpret and structure these preferences for querying the vector database.

In [10]:
PROMPT_TEMPLATE =\
"""
Based on the following context:

{context}

---

Answer the question : {question}
"""

## Step 5: Searching Based on Preferences

- Semantic Search Implementation: Use the structured buyer preferences to perform a semantic search on the vector database, retrieving listings that most closely match the user's requirements.
- Listing Retrieval Logic: Fine-tune the retrieval algorithm to ensure that the most relevant listings are selected based on the semantic closeness to the buyer’s preferences.

In [11]:
from langchain.prompts import ChatPromptTemplate


In [21]:
# Define the Prompt Template
PROMPT_TEMPLATE = """
You are an AI assistant that helps generate real estate listings based on provided data. 

### Context:
{context}

### Task:
Using the information above, answer the following question:

{question}

### Guidelines:
- Use clear, concise language.
- Provide structured responses with details like **Neighborhood, Price, Bedrooms, Bathrooms, House Size, Description, and Neighborhood Description**.
- Maintain a professional yet engaging tone.

Respond accurately based on the given context.
"""


In [22]:
def predict_response(query_text, PROMPT_TEMPLATE):
    embedding_function = OpenAIEmbeddings()
    db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

    # Search the DB.
    results = db.similarity_search_with_relevance_scores(query_text, k=3)
    # print(results)
    if len(results) == 0 or results[0][1] < 0.75:
        print(f"Unable to find matching results.")
    else:
        context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
        sources = [doc.metadata.get("id", None) for doc, _score in results]
        prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
        prompt = prompt_template.format(context=context_text, question=query_text)
        print(f"Generated Prompt:\n{prompt}")
        
        model = ChatOpenAI()
        response_text = model.predict(prompt)
        print (f"Response: {response_text}\nSources: {sources}")

In [23]:
predict_response("A house with gourmet kitchen" , PROMPT_TEMPLATE)

Generated Prompt:
Human: 
You are an AI assistant that helps generate real estate listings based on provided data. 

### Context:
Luxurious modern home in the prestigious Palo Alto neighborhood. This 4 bed, 3 bath property boasts high-end finishes, a gourmet kitchen, and a spacious master suite.

---

Spacious family home in a quiet San Jose neighborhood. This 4 bed, 2 bath property features a large living room, updated kitchen, and a backyard with fruit trees.

---

Spacious family home in a quiet Redwood City neighborhood. This 4 bed, 3 bath property offers a large kitchen, formal dining room, and a backyard perfect for entertaining.

### Task:
Using the information above, answer the following question:

A house with gourmet kitchen

### Guidelines:
- Use clear, concise language.
- Provide structured responses with details like **Neighborhood, Price, Bedrooms, Bathrooms, House Size, Description, and Neighborhood Description**.
- Maintain a professional yet engaging tone.

Respond acc

In [24]:
predict_response('A house price below $300,000', PROMPT_TEMPLATE)

Unable to find matching results.


## Step 6: Personalizing Listing Descriptions

- LLM Augmentation: For each retrieved listing, use the LLM to augment the description, tailoring it to resonate with the buyer’s specific preferences. This involves subtly emphasizing aspects of the property that align with what the buyer is looking for.
- Maintaining Factual Integrity: Ensure that the augmentation process enhances the appeal of the listing without altering factual information.

In [29]:
# Define the Prompt Template
buyer_preferences = """
- Budget: $1,900,000 max
- Neighborhood: Downtown San Mateo or close areas
- Bedrooms: At least 2
- Bathrooms: At least 2
- Would like easy access to public transportation
"""

AUGMENT_PROMPT_TEMPLATE = """
You are an AI assistant that helps generate real estate listings based on provided data. 

### Context:
{context}

### Buyer Preferences:
{buyer_preferences}


### Task:
Using the information above, answer the following question:

{question}

### Guidelines:
- Use clear, concise language.
- Provide structured responses with details like **Neighborhood, Price, Bedrooms, Bathrooms, House Size, Description, and Neighborhood Description**.
- Maintain a professional yet engaging tone.

craft a response that not only answers the question {question}, but also ensures that your explanation is distinct, captivating, and customized to align with the specified preferences. This involves subtly emphasizing aspects of the property that align with what the buyer is looking for.
"""



In [32]:
# Search the DB.
query_text = "A house with gourmet kitchen"
results = db.similarity_search_with_relevance_scores(query_text, k=3)
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
sources = [doc.metadata.get("id", None) for doc, _score in results]

prompt_template = ChatPromptTemplate.from_template(AUGMENT_PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, 
                                buyer_preferences=buyer_preferences,
                                question=query_text)

# Print and Generate Response
print(f"Generated Prompt:\n{prompt}")

model = ChatOpenAI()
response_text = model.predict(prompt)
print (f"Response: {response_text}\nSources: {sources}")


Generated Prompt:
Human: 
You are an AI assistant that helps generate real estate listings based on provided data. 

### Context:
Luxurious modern home in the prestigious Palo Alto neighborhood. This 4 bed, 3 bath property boasts high-end finishes, a gourmet kitchen, and a spacious master suite.

---

Spacious family home in a quiet San Jose neighborhood. This 4 bed, 2 bath property features a large living room, updated kitchen, and a backyard with fruit trees.

---

Spacious family home in a quiet Redwood City neighborhood. This 4 bed, 3 bath property offers a large kitchen, formal dining room, and a backyard perfect for entertaining.

### Buyer Preferences:

- Budget: $1,900,000 max
- Neighborhood: Downtown San Mateo or close areas
- Bedrooms: At least 2
- Bathrooms: At least 2
- Would like easy access to public transportation



### Task:
Using the information above, answer the following question:

A house with gourmet kitchen

### Guidelines:
- Use clear, concise language.
- Provid

## Step 7: Deliverables and Testing

- Test your "HomeMatch" application and make sure it meets all of the requirements in the rubric(opens in a new tab). 

### Application was used with several queries: 
    - unrelistic query - returned 0 matches
    - realistic query - returned 3 matches with details about the property
    - same realistic query - it returned enhanced listing based on buyer preferences 