This is a starter notebook for the project, you'll have to import the libraries you'll need, you can find a list of the ones available in this workspace in the requirements.txt file in this workspace. 

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "voc-11951781731266773652301680fbcbc47b594.20487951"
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationSummaryMemory, ConversationBufferMemory, CombinedMemory, ChatMessageHistory
from langchain import LLMChain
# from langchain.chains.question_answering import load_qa_chain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from typing import Any, Dict, Optional, Tuple


### Generating Real Estate Listings

In [2]:
#Generate real estate listings
model_name="gpt-4o"
temperature = 0.0
llm = ChatOpenAI(model_name=model_name, temperature=temperature)

instruction = """
Generate at least 10 real estate listings to introduce various properties. An example of a listing is:
Neighborhood: Green Oaks
Price: $800,000
Bedrooms: 3
Bathrooms: 2
House Size: 2,000 sqft

Description: Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.

Neighborhood Description: Green Oaks is a close-knit, environmentally-consci
"""

print("Generated real estate listings: ")
real_estate_listings = llm.predict(instruction)
print(real_estate_listings)

  llm = ChatOpenAI(model_name=model_name, temperature=temperature)


Generated real estate listings: 


  real_estate_listings = llm.predict(instruction)


### Listing 1
**Neighborhood:** Maplewood  
**Price:** $950,000  
**Bedrooms:** 4  
**Bathrooms:** 3  
**House Size:** 2,500 sqft  

**Description:** Discover this stunning colonial-style home in the heart of Maplewood. With 4 spacious bedrooms and 3 modern bathrooms, this property offers ample space for a growing family. The gourmet kitchen features granite countertops and stainless steel appliances, perfect for culinary enthusiasts. Enjoy cozy evenings by the fireplace in the expansive living room or entertain guests in the formal dining area. The beautifully landscaped backyard is ideal for summer barbecues and outdoor activities.

**Neighborhood Description:** Maplewood is known for its tree-lined streets and family-friendly atmosphere. With excellent schools and a vibrant community center, it's a perfect place for families to thrive.

### Listing 2
**Neighborhood:** Oceanview  
**Price:** $1,200,000  
**Bedrooms:** 5  
**Bathrooms:** 4  
**House Size:** 3,200 sqft  

**Description

In [None]:
# Save listings info to a .txt file
with open("listings.txt", "a", encoding="utf-8") as file:
    file.write(real_estate_listings.strip())


### Storing Listings in a Vector Database

#### 1. Split into chunks

In [3]:
# Load the saved text file
with open("listings.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Initialize the text splitter
# Do recursive splitting
"""
Recursive Splitting:

First, try splitting on "###" (listing-level) to get chunks;

Then, if there are some chunks are over the size limit (e.g. chunk_size=1000), 
then try "\n" (paragraph-level) to further split the chunks oversize;

Then, if there are some chunks are still oversize, then use "." (sentence) to further split;
"""
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # size of each chunk
    chunk_overlap=0,        # overlap between chunks
    separators=["###", "\n", "."]
)

# Split the text
chunks = text_splitter.split_text(raw_text)

# Clean up empty chunks (chunk.strip():Removes any leading or trailing whitespace (spaces, tabs, newlines) from each chunk.)
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

# Get character (not token!) count for each chunk
doc_lengths = [len(chunk) for chunk in chunks]
print(f"Character count of each chunk: {doc_lengths}")

# Example: print the first 3 chunks
for i, chunk in enumerate(chunks[-3:]):
    print(f"--- Chunk {i+1} ---\n")
    print(f"Character count of chunk {i+1}: {len(chunk)}\n")
    print(f"{chunk}\n")

Character count of each chunk: [753, 753, 663, 736, 696, 723, 687, 715, 690, 716, 851, 738, 724, 617, 685, 646, 645, 587, 600, 788]
--- Chunk 1 ---

Character count of chunk 1: 587

### Listing 18
**Neighborhood:** Pine Hill  
**Price:** $550,000  
**Bedrooms:** 3  
**Bathrooms:** 2  
**House Size:** 1,600 sqft  

**Description:** This charming ranch-style home in Pine Hill is perfect for first-time buyers. With 3 bedrooms and 2 bathrooms, the property features a cozy living room with a fireplace, an updated kitchen, and a sunroom. The large backyard is perfect for pets and outdoor activities.

**Neighborhood Description:** Pine Hill is a friendly neighborhood with excellent schools and community parks. It's a great place for families and young professionals.

--- Chunk 2 ---

Character count of chunk 2: 600

### Listing 19
**Neighborhood:** Silver Valley  
**Price:** $900,000  
**Bedrooms:** 4  
**Bathrooms:** 3  
**House Size:** 2,700 sqft  

**Description:** This elegant home in Sil

#### 2. Convert to embeddings and save in vector database

In [4]:
#LLM chatbot model
model_name="gpt-4o"
temperature = 0.0
llm_chat = ChatOpenAI(model_name=model_name, temperature=temperature)

# Initialize OpenAI embeddings (use embedding model: text-embedding-3-large)
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

# Convert chunk text to LangChain Document objects
split_docs = [Document(page_content=chunk) for chunk in chunks]

  llm_chat = ChatOpenAI(model_name=model_name, temperature=temperature)
  embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")


In [5]:
# Save chunks as embeddings to ChromaDB (persisted to ./chroma_store/)
# Path to Chroma DB
persist_directory = "./chroma_db"

# Check if the vector DB exists, if so, load the vector database; otherwise create one and save embedding to it
if os.path.exists(persist_directory) and os.listdir(persist_directory):
    #Load the saved vector database
    #By default, Chroma use cosine similarity as metric
    vectorstore = Chroma(
        persist_directory="./chroma_db",
        embedding_function=embedding_model, 
    )
    vectorstore = Chroma.from_documents(
        split_docs,
        embedding_model,
        persist_directory="./chroma_db" ##Chroma DB is persisted at ./chroma_db; If not specify persist_directory:Chroma will use an in-memory vector store (non-persistent), the vector store will disappear once your script ends(suits for tesing).
    )
    print("Chunks saved to ChromaDB!")
else:
    vectorstore = Chroma.from_documents(
        split_docs,
        embedding_model,
        persist_directory="./chroma_db" ##Chroma DB is persisted at ./chroma_db; If not specify persist_directory:Chroma will use an in-memory vector store (non-persistent), the vector store will disappear once your script ends(suits for tesing).
    )
    print("Chunks saved to ChromaDB!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Chunks saved to ChromaDB!


#### 3.Collect User Preference Information

In [6]:
"""
Collect buyer preferences, such as the number of bedrooms, bathrooms, location, 
and other specific requirements from a set of questions or telling the buyer to 
enter their preferences in natural language. You can hard-code the buyer preferences 
in questions and answers, or collect them interactively however you'd like, example:
"""

# Collect buyer preferences information
personal_questions = [   
                "How big do you want your house to be?" 
                "What are 3 most important things for you in choosing this property?", 
                "Which amenities would you like?", 
                "Which transportation options are important to you?",
                "How urban do you want your neighborhood to be?",   
            ]
personal_answers = [
    "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "A quiet neighborhood, good local schools, and convenient shopping options.",
    "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "A balance between suburban tranquility and access to urban amenities like restaurants and theaters."]


#### 4. Define a LLMChain with customized memory and RAG 

In [None]:
# Store previous Q&A information as chat history in the LLM memory
#ChatMessageHistory() is a chat history buffer where you manually store back-and-forth messages, which can be fed into LangChain memory objects later as conversational context.
history = ChatMessageHistory()
# add questions and answers to the history
for i in range(len(personal_questions)):
    history.add_ai_message(personal_questions[i])
    history.add_user_message(personal_answers[i])


# Set up a customized memory obejct for llm:
# 1.Set up a customized conversationBufferMemory memory object: preserve all previous Q&A chat history in memory, while for the following chat only keep the AI response in memory
#When save_context() is called, only the AI response (output_str) is added to chat history, ignoring the new user input (input_str).
class MementoBufferMemory(ConversationBufferMemory):
    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        input_str, output_str = self._get_input_output(inputs, outputs)
        self.chat_memory.add_ai_message(output_str)

"""
This creates a custom memory object that stores only the 
AI’s outputs (not the user's inputs), using the previously 
filled ChatMessageHistory.
"""
conversational_memory = MementoBufferMemory(
    chat_memory=history, #set previous Q&A information as history in the chat memory
    memory_key="questions_and_answers",  #can be used as the key to refer the memory to fill in the prompt template for LLM
    input_key="input" #the key in the inputs dict which is the user message that is passed to .predict() or .run()
)

# 2.Set up a summary memory object: uses the stored chat history and an LLM to produce a summary memory representation of the conversation.
summary_memory = ConversationSummaryMemory(
    llm=llm_chat,
    memory_key="recommendation_summary", #can be used as the key to refer the memory to fill in the prompt template for LLM
    input_key="input", #the key in the inputs dict which is the user message that is passed to .predict() or .run()
    buffer=f"The user answered {len(personal_questions)} personal questions about his or her preference on property. Use them to give recommendation of properties that the user will like.",
    return_messages=True)

# 3.Combine two memory objects to get a customized memory object:
memory = CombinedMemory(memories=[conversational_memory, summary_memory])

  summary_memory = ConversationSummaryMemory(


In [22]:
#prompt template
custom_prompt = PromptTemplate(
    input_variables=["recommendation_summary", "questions_and_answers","context", "input"],
    template="""
You are a helpful property recommendation assistant that will recommend user some property listings based 
on their personal preferences. Ask user questions to collect information about their preference if you haven't asked.

For each listing you recommend, please augment the description by tailoring it to resonate with the 
buyer’s specific preferences. This involves subtly emphasizing aspects of the property that align 
with what the buyer is looking for.

Always be honest. If unsure, say "I don't know".

The user previously answered a series of personal questions about their preferences. Here is a summary of those preferences:
-------------------
{recommendation_summary}
-------------------

So far, these personal questions and answers from the user, and what you (the AI) have responded with:
-------------------
{questions_and_answers}
-------------------

Relevant Information:
-------------------
{context}
-------------------

Now the user says:
"{input}"

Based on everything above, provide a relevant and helpful response.
"""
)


In [23]:
# Defines a LLMChain: this LCEL chain defines the entire data flow

#Define a helper function to print in the input prompt while inferencing LLM
def print_prompt(prompt):
    """
    Prints the formatted prompt and returns it.
    """
    print("----------- PROMPT TO LLM -----------")
    print(prompt.to_string())
    print("-------------------------------------")
    return prompt

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

chain = (
    RunnablePassthrough.assign(
        # Load memory variables using the .load_memory_variables() method
        memory_variables=lambda inputs: memory.load_memory_variables(inputs),
    )
    | RunnablePassthrough.assign(
        # Extract the specific memory keys and retrieve context in parallel
        recommendation_summary=lambda x: x["memory_variables"]["recommendation_summary"],
        questions_and_answers=lambda x: x["memory_variables"]["questions_and_answers"],
        #Combine previous chat history and current query as combined query to do similarity retrieval in vector database
        context=lambda x: retriever.get_relevant_documents(memory.load_memory_variables({"input": x["input"]})["questions_and_answers"] + "\nUser: " + x["input"]),
    )
    | custom_prompt
    | print_prompt
    | llm_chat
    | StrOutputParser()
)

#### 5. Semantic Search Implementation

In [None]:
def get_recommendation_and_update_memory(user_input):
    print(f"🤔 User Input: \"{user_input}\"")

    # Invoke the chain to get the AI's response
    response = chain.invoke({"input": user_input})

    print(f"\n✅ AI Recommendation:\n{response}")

    # # CRITICAL STEP: Manually save the context to your memory object
    # # This ensures the memory is updated for the next turn in the conversation.
    # memory.save_context({"input": user_input}, {"output": response})

    print("\n" + "="*60 + "\n")


# Run the recommender with a user query
get_recommendation_and_update_memory(
    "What are the top 3 properties in your listings that could be a good fit to me, which is not necessary to be a perfect match?"
)

🤔 User Input: "What are the top 3 properties in your listings that could be a good fit to me, 
    which is not necessary to be a perfect match?
    "
----------- PROMPT TO LLM -----------

You are a helpful property recommendation assistant that will recommend user some property listings based 
on their personal preferences. Ask user questions to collect information about their preference if you haven't asked.

For each listing you recommend, please augment the description by tailoring it to resonate with the 
buyer’s specific preferences. This involves subtly emphasizing aspects of the property that align 
with what the buyer is looking for.

Always be honest. If unsure, say "I don't know".

The user previously answered a series of personal questions about their preferences. Here is a summary of those preferences:
-------------------
[SystemMessage(content='The user answered 4 personal questions about his or her preference on property. Use them to give recommendation of properties th