# Setting up environment

In [59]:
!pip install pandas
!pip install openai
!pip install chromadb
!pip install langchain
!pip install numpy
!pip install -U langchain-openai
!pip install pydantic
!pip install shutil

[31mERROR: Could not find a version that satisfies the requirement shutil (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for shutil[0m[31m
[0m

In [60]:
import os
import pandas as pd
import shutil
from openai import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.evaluation import load_evaluator
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from dataclasses import dataclass
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Generating Real Estate Listings

### Here we will use GPT-4 for better output.

In [34]:
client = OpenAI(api_key='YOUR_OPENAI_KEY')

prompt='''
Craft a comprehensive CSV file encapsulating the essence of 20 distinct real estate listings. Each listing should be meticulously organized into columns, showcasing the following attributes:

- Neighborhood: Identify the neighborhood where the property resides, such as "Green Oaks."
- Price: State the property's market price in USD, formatted (e.g., "$800,000").
- Bedrooms: Enumerate the bedrooms within the property (e.g., 3).
- Bathrooms: Count the property's bathrooms (e.g., 2).
- House Size: Detail the property's square footage (e.g., "2,000 sqft").

Elevate each property's presentation with a comprehensive paragraph that accentuates its unique features, amenities, and sustainable qualities. Highlight elements like energy-efficient appliances, solar panel integration, use of sustainable materials, and the presence of gardens.

Example Entry Format:
Neighborhood,Price,Bedrooms,Bathrooms,House Size,Description
Green Oaks,"$800,000",3,2,"2,000 sqft","Nestled in Green Oaks, this eco-friendly haven features a 3-bedroom, 2-bathroom layout with solar panels and efficient insulation. Highlights include abundant natural light, hardwood floors, and an open-concept kitchen that leads to a lush backyard, embodying a sanctuary for eco-conscious living. The neighborhood of Green Oaks is celebrated for its vibrant and environmentally-aware community, boasting organic stores, community gardens, and convenient transit options, rendering it perfect for those prioritizing sustainability and community engagement."


Ensure the description not only reflects the property's allure, such as its eco-friendly design and comfortable living spaces but also paints a vivid picture of the neighborhood's character. Emphasize community elements like organic grocery stores, parks, cafés, accessibility to public transportation, and commitment to environmental initiatives.

Structure the CSV with clear headers for each column. Follow the example provided to format each subsequent row with information specific to a different property listing.
Make sure you generate 20 unique listings.
'''

messages = [{"role": "system", "content": f"{prompt}"}]
response = client.chat.completions.create(model="gpt-4-turbo-preview", messages=messages)
bot_response = response.choices[0].message.content
messages.append({"role": "assistant", "content": bot_response})
print(bot_response)

```csv
Neighborhood,Price,Bedrooms,Bathrooms,House Size,Description
Green Oaks,"$800,000",3,2,"2,000 sqft","Nestled in Green Oaks, this eco-friendly haven features a 3-bedroom, 2-bathroom layout with solar panels and efficient insulation. Highlights include abundant natural light, hardwood floors, and an open-concept kitchen that leads to a lush backyard, embodying a sanctuary for eco-conscious living. The neighborhood of Green Oaks is celebrated for its vibrant and environmentally-aware community, boasting organic stores, community gardens, and convenient transit options, rendering it perfect for those prioritizing sustainability and community engagement."
Sunset Valley,"$750,000",4,3,"2,500 sqft","Located in the heart of Sunset Valley, this stunning 4-bedroom, 3-bathroom home offers a hint of luxury with sustainable living. Its high-efficiency appliances, LED lighting throughout, and large windows for passive solar heating make it an exemplary model of energy efficiency. The home's m

### Let's load the dataset

In [38]:
df=pd.read_csv('Home.csv')
df.head()

Unnamed: 0,Neighborhood,Price,Bedrooms,Bathrooms,House Size,Description
0,Green Oaks,"$800,000",3,2.0,"2,000 sqft","Nestled in Green Oaks, this eco-friendly haven..."
1,Sunset Valley,"$750,000",4,3.0,"2,500 sqft","Located in the heart of Sunset Valley, this st..."
2,Riverbend,"$950,000",5,4.0,"3,200 sqft",Experience grand living in the prestigious nei...
3,Maple Grove,"$650,000",3,2.5,"2,100 sqft",This charming property in Maple Grove presents...
4,Lakewood,"$1,200,000",4,3.5,"4,000 sqft",Nestle into luxury with this elegant 4-bedroom...


#### Data looks good, we can now proceed with vector database embeddings

# Storing Listings in a Vector Database

### Let's understand how this vector embedding works.

In [58]:
os.environ["OPENAI_API_KEY"] ="YOUR_OPENAI_KEY"

# Get embedding for a word.
embedding_function = OpenAIEmbeddings()
vector = embedding_function.embed_query("new york")
print(f"Vector for 'new york': {vector}")
print(f"Vector length: {len(vector)}")

# Compare vector of two words
evaluator = load_evaluator('pairwise_embedding_distance')
words = ("new york", "nyc")
x = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
print(f"Comparing ({words[0]}, {words[1]}): {x}")


Vector for 'new york': [-0.010643694203542546, -0.014876828044973634, 0.0088584746146283, -0.04187150908392437, -0.02592625384627089, -0.000525337976077255, -0.026237313713428034, 0.013605535519077703, -0.02281564399882891, -0.036245363516667704, 0.021923034670033058, 0.0011994442512306973, 0.0036651092786116053, 0.00593720683139763, -0.010907419602384725, -0.012483010593766051, 0.020313632945002807, -0.020124291346452768, 0.0172435952328815, -0.011110285866923352, -0.008351309884604276, 0.028265972633524122, -0.010143291806512817, -0.014295278583334927, 0.0009086698696572982, 0.001627153081970906, -0.0018883428387708612, -0.010718078003834685, 0.0030937040149562473, -0.015728864307126268, 0.013774590118644863, -0.017067778300320045, -0.028888094230483503, -0.02426275345728629, 0.006731764660082587, -0.006965060026111717, -0.010778938069460783, -0.012408626793473912, 0.0073031701565685805, -0.002006681105034, 0.010650456536536838, -0.031430679282275366, -0.0035805819788280254, -0.00652

#### This gives us some knoweledge on how we can use it in our chromadb, lets proceed further and implement it.

In [63]:
# Configuration
CHROMA_PATH = "chroma"
CSV_PATH = "Home.csv" 

df = pd.read_csv(CSV_PATH)
documents = []
for index, row in df.iterrows():
    documents.append(Document(page_content=row['Description'], metadata={'id': str(index)}))


# Split Text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

if chunks:
    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

# Save to Chroma
if os.path.exists(CHROMA_PATH):
    shutil.rmtree(CHROMA_PATH)

db = Chroma.from_documents(
    chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
)
db.persist()
print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

Split 20 documents into 51 chunks.
healthy living environment. Solar panels and a geothermal heating system reduce energy consumption. The vibrant Maple Grove community is family-friendly, with numerous parks, great schools, and a tight-knit community feel, making it a desirable location for those wishing to live harmoniously with
{'id': '3', 'start_index': 199}
Saved 51 chunks to chroma.


# Implementing Semantic Search and Augmented Response Generation

In [65]:
CHROMA_PATH = "chroma"

PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

query_text = "Would like to buy home with 3 bedrooms" 

# Prepare the DB.
embedding_function = OpenAIEmbeddings()
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

# Search the DB.
results = db.similarity_search_with_relevance_scores(query_text, k=3)
if len(results) == 0 or results[0][1] < 0.7:
    print(f"Unable to find matching results.")
else:
    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    print(f"Generated Prompt:\n{prompt}")

    model = ChatOpenAI()
    response_text = model.predict(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)

Generated Prompt:
Human: 
Answer the question based only on the following context:

This charming property in Maple Grove presents a 3-bedroom, 2.5-bathroom home enveloped in greenery. It's built using sustainable materials, featuring bamboo flooring and non-toxic paint, ensuring a healthy living environment. Solar panels and a geothermal heating system reduce energy consumption.

---

Just moments from the shore, this 4-bedroom, 3-bathroom home in Ocean Breeze offers the dream coastal lifestyle with a green twist. It features a modern, open layout with bamboo floors, a solar power system, and decks made from recycled materials, providing breathtaking views while being mindful of

---

In the Garden District, this delightful 3-bedroom, 2-bathroom home features a traditional design with modern, sustainable updates like rain gardens and a vegetable patch in its spacious backyard. With classic charm and solar-powered lighting, it presents a perfect blend of the past and the present.

---


  warn_deprecated(


Response: Yes, based on the context provided, there are three options available for purchasing a home with 3 bedrooms.
Sources: ['3', '16', '7']


### Let's make responses more unique.

In [69]:
CHROMA_PATH = "chroma"

PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Given the context provided above, craft a response that not only answers the question {question}, but also ensures that your explanation is distinct, captivating, and customized to align with the specified preferences. Strive to present your insights in a manner that resonates with the audience's interests and requirements
"""

query_text = "Would like to buy home in calm neighbourhood" 

# Prepare the DB.
embedding_function = OpenAIEmbeddings()
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

# Search the DB.
results = db.similarity_search_with_relevance_scores(query_text, k=3)
if len(results) == 0 or results[0][1] < 0.7:
    print(f"Unable to find matching results.")
else:
    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    print(f"Generated Prompt:\n{prompt}")

    model = ChatOpenAI()
    response_text = model.predict(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)

Generated Prompt:
Human: 
Answer the question based only on the following context:

In the peaceful neighborhood of Cedar Hills, discover a 4-bedroom, 3-bathroom home that focuses on energy efficiency and a low carbon footprint with its solar panel array and drought-resistant landscaping. Inside, enjoy spacious, light-filled rooms with sustainable materials and a design that

---

a warm and welcoming community for those who desire suburban tranquility with an eco-conscious mindset.

---

community feel, making it a desirable location for those wishing to live harmoniously with nature.

---

Given the context provided above, craft a response that not only answers the question Would like to buy home in calm neighbourhood, but also ensures that your explanation is distinct, captivating, and customized to align with the specified preferences. Strive to present your insights in a manner that resonates with the audience's interests and requirements

Response: Based on your desire for a calm

# Conclusion:

- I personally believe that RecursiveCharacterTextSplitter is better than CharacterTextSplitter in this scenario of ours.
- We could have used some new text embeddings like text-embedding-3-large and so on.
- Overall it did fine as we expected it to be.

# Approach on Multimodal:

Personally i can think of two ways of implementing it:

1. Adding image column which will contain images links, instead of embedding images, we can use direct embeddings on links and make it display to the user. In this case user has to click on the link to see the site.
2. Using CLIP, now this is a proper approach where we will embed the images along with text and store them in vectors, when user asks something, it will provide the answer with images, which makes it easier for user to choose.

There is one guide posted by LanceDB on how to implement and use Multimodal.                                                                                                                                                                                                             