## Step 1: Setting Up the Python Application
Created pip environment with neccessary packages. They are listed in requirements.txt

## Step 2: Generating Real Estate Listings
Generate real estate listings using a Large Language Model. Generate at least 10 listings


The code for generating the listings is provided in the file [generate_data.py](./generate_data.py). I have generated 20 listings using GPT-4, as well as images for those using DALL-E 3. I created listings in batches of 5 , to ensure I was within the token limit of an API call.
The separate listings are in the [listings](./listings/) directory, and the images are in the [images](./images/) directory

All the listings were combined into a single file, with links to the generated images using the script [create full datasets.py](./create_full_datasets.py). The full file is called [listings.csv](./listings.csv)
and is present in the root folder of the project



## Step 3: Storing Listings in a Vector Database

I have LanceDB to create two vector tables, one for text and one for image. While the application will have multimodal capabilties, I have chosen not to use CLIP embeddings for both. Instead I will process the two modalities separately.

For images as mentioned above, I have used CLIP embeddings from HuggingFace while for text I have used ADA-3 embeddings from OpenAI

The LanceDB folder is [sample-lancedb](./sample-lancedb/). The table for images is called [table_from_df_images.lance](./sample-lancedb/table_from_df_images.lance). The table for text is called [table_from_df_text.lance](./sample-lancedb/table_from_df_text.lance)

The code for embedding generation as well as vector DB creation and embedding storage is in the file [create_embeddings.py](./create_embeddings.py).

The embeddings are also stored in a csv file [listings_with_embeddings.csv](./listings_with_embeddings.csv)


## Step 4: Building the User Preference Interface

I hardcoded as set of 5 user instructions which I generated using the script [generate_testdata.py](./generate_testdata.py). The instructions are stored in [test_data.json](./test_data.json)

In [2]:
import json
from generate_data import *
from create_embeddings import *

  from .autonotebook import tqdm as notebook_tqdm


In [3]:

with open('./test_data.json') as f:
    test_user_scripts = json.loads(f.read())

In [4]:
customers = list(test_user_scripts.keys())

In [5]:
customers

['customer1', 'customer2', 'customer3', 'customer4', 'customer5']

In [6]:
customer_id = 0 # change this to test for different customers

In [7]:
# Format user chat history
def build_customer_chat_history(customer_id):
    outstring=""
    for q,a in test_user_scripts[customers[customer_id]].items():
        outstring+=f"Question- {q} : User Answer: {a}"
    return outstring

print(build_customer_chat_history(customer_id))
        
        
chat_history =  build_customer_chat_history(customer_id) 

Question- How big do you want your house to be? : User Answer: A comfortable three-bedroom house with a spacious kitchen and a cozy living room.Question- What are 3 most important things for you in choosing this property? : User Answer: A quiet neighborhood, good local schools, and convenient shopping options.Question- Which amenities would you like? : User Answer: A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.Question- Which transportation options are important to you? : User Answer: Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.Question- How urban do you want your neighborhood to be? : User Answer: A balance between suburban tranquility and access to urban amenities like restaurants and theaters.


In [8]:
user_prompt= f"""
    Please only use the customer chat history given below to create a desired listing for them. 
    Use the example given below and  format the results in json format.
    All the results should be saved inside a key called listings.
    Each result should have the following keys: Neighborhood, Price, Bedrooms, Bathrooms, House Size, Description, Neighborhood Description.
    Use only information from the chat history. If any of the fields are unavailable,list them as None.
    Customer Chat History: {chat_history}
    Example:{example_listing}
    """

In [9]:
# this function reformats the user inputs to a json structured output very similar to the one used to create embeddings

def get_reformatted_output(user_prompt):
    response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant with deep expertise in real estate."},
        {"role": "user", "content": user_prompt}
    ]
    )

    listings = json.loads(response.choices[0].message.content)["listings"]
    # print(listings)
    return listings

In [10]:
response = get_reformatted_output(user_prompt)[0]

In [11]:
# Converts json output of user query to string
def format_response(response):
    out_string=""
    for item,val in response.items():
        if val != 'None':
            out_string+=f"{item}: {val}  "
    return out_string

print(format_response(response))
            

Bedrooms: 3  Description: This captivating three-bedroom home is a perfect blend of comfort and sustainability, featuring a spacious kitchen ideal for family gatherings, a cozy living room for relaxing evenings, and a modern, energy-efficient heating system. The property boasts a beautiful backyard suitable for gardening enthusiasts and comes with a convenient two-car garage. Ideal for those valuing both their quiet moments at home and the energy efficiency of their living space.  Neighborhood Description: Looking for a serene yet connected living experience? This neighborhood merges the quietness required for a peaceful family life with the convenience of being close to excellent local schools, multiple shopping options, and easy access to urban amenities. Commuters will appreciate the easy access to a reliable bus line, proximity to a major highway for quick getaways, and bike-friendly roads for leisurely weekend rides. A perfect balance for those who appreciate suburban tranquility 

In [12]:
def get_user_preference(customer_id, img_path=None):
    # if user provides an image as reference, we shall also use that. The assumption is that the image has been 
    # loaded and placed in a path the application can access
    image = img_path
    # if img_path:
    #     try: 
    #         image = Image.open(image_path)
    #     except:
    #         pass

    chat_history =  build_customer_chat_history(customer_id) 
    user_prompt= f"""
    Please only use the customer chat history given below to create a desired listing for them. 
    Use the example given below and  format the results in json format.
    All the results should be saved inside a key called listings.
    Each result should have the following keys: Neighborhood, Price, Bedrooms, Bathrooms, House Size, Description, Neighborhood Description.
    Use only information from the chat history. If any of the fields are unavailable,list them as None.
    Customer Chat History: {chat_history}
    Example:{example_listing}
    """
    response = get_reformatted_output(user_prompt)[0]
    formatted_response = format_response(response)
    return formatted_response, image
    

    
    

## Step 5: Searching Based on Preferences

In [13]:
import lancedb
uri = "./sample-lancedb"
db = lancedb.connect(uri)

In [14]:
text_table = "table_from_df_text"
img_table = "table_from_df_images"

In [15]:
tbl_txt = db.open_table(text_table)
tbl_img = db.open_table(img_table)

In [16]:
def get_embeddings_user_prefs(resp):
    text_resp, img_resp = resp[0],resp[1]
    text_embs = get_embedding(text_resp)
    img_embs = None
    if img_resp:
        try:
            img_embs = create_clip_image_embeddings(img_resp, model_name)
        except:
            pass

    return text_embs,img_embs
        

In [17]:
resp = get_user_preference(customer_id)
embeddings = get_embeddings_user_prefs(resp)

In [17]:
def search_tables(embeddings,num_responses=5):
    text_embeddings = embeddings[0]
    img_embeddings = embeddings[1]
    df = tbl_txt.search(text_embeddings) \
    .metric("cosine") \
    .limit(num_responses) \
    .to_pandas()
    return df
    

In [None]:
print(search_tables(embeddings))