# Predicting a price with llama 
In the first part of this project we used machine learning to suggest a price for a new listing based on linear regression.

This notebook attempts to do the same but now we let AI suggest a price instead. The model we are using is llama3. There will be 2 different approaches to using the model.

- No data set
- Feeding a data set through RAG (Retrieval-Augmented Generation) 

In [None]:
!pip install requests
!pip install sentence-transformers faiss-cpu

In [1]:
import requests
import json
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [2]:
df = pd.read_csv('./data/supervised_cleaned_airbnb_data.csv')

## No data set

We wanted to see how well the model could handle the task without feeding it any context, such as the data set. The query is sent to the llama API and we got some varying responses. The suggested price was usually in the range 120-170$ with some big outliers every so often. 

In [None]:
# Straight query without feeding the model data through RAG

query = """
A 1-bedroom apartment in Amsterdam with room for 2 people, a max distance to the city center 5km and metro distance of 2,5km. Outside of weekends
Estimate a fair nightly price in USD for this property as if it were listed on Airbnb.
Only return the price as a number.
"""

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": query,
        "stream": False
    }
)

price = response.json()["response"]
print(f"Ollama estimated price: {price}")


## Preparing RAG

We can not feed the raw csv file to llama through the API. So we have to convert the file into a string that we can send through the API. Each row is converted (descriptions) then the embedder model takes each converted string and converts the string to a vector, which is then added to a numpy array.

The numpy array is then used to create FAISS indexes which helps find similar listings in a very efficient way (comparing the Euclidean distance of vectors) The dataset is now ready to be used.

In [17]:
# Prepare data using RAG

# returns a string format of the row and the features we wish to use
def row_to_description(row):
    return (
        f"A {row['bedrooms']}-bedroom apartment in {row['City']} with room for {row['person_capacity']} people, "
        f"max distance to the city center is {row['dist']} km and metro distance is {row['metro_dist']} km. "
        f"Nightly price: ${row['realSum']}"
    )

# Generate descriptions for all listings
descriptions = df.apply(row_to_description, axis=1).tolist()

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2') # this model gives a vector 384 values


# Generate embeddings
embeddings = embedder.encode(descriptions, convert_to_numpy=True)

# example: embedder.encode("Cozy apartment in Amsterdam")
# Could return a vector -> [0.01, -0.12, 0.38, ..., -0.02]  with a total of 384 values
# These vectors can then be compared for similarities

# Create FAISS index
# Creates a searchable vector "database" using simple but fast math.
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension) 
index.add(embeddings)
# The index is now ready to answer questions like:
# "What are the 10 most similar listings to this new description?"

print("Embedding done!")

Embedding done!


## Creating queries

Creating a good query is necessary when working with AI. We tested some different approaches and had varying results.
- First prompt with no extra descriptions or 4T's. Just the actual data we want to look for.
- Using 4T's to create a more descriptive query.
- Modified version of first prompt.
- Copy paste the row_to_description method return value.

We also tested with a top_k of 5 and 10 to see if the results were better depending on this variable.

For all queries we went for the same values and our intended price target would be $194.03 (matching the first entry in the data set)

In [27]:
# Queries and results
# "query" is the active variable used for the model. Rename variables to test out different queries.
# We are using the first entry in our data set as a description. The realSum is 194.03.

# Original prompt with no focus on the 4'ts.
# Result = $137.5 || top_k = 10
# Result = $108.1 || top_k = 5
query1 = """
A 1-bedroom apartment in City 1 with room for 2 people, a distance to the city center 5km and metro distance of 2.5km. Outside of weekends
"""

# Using the 4 T's in our prompt.
# Result = $123 || top_k = 10
# Result = $304.65 || top_k = 5
query2 = """
Calculate a recommended price for a listing with these parameters.

A 1-bedroom apartment in Amsterdam with room for 2 people, a distance to the city center 5km and metro distance of 2.5km. Outside of weekends

The city values in the data set are transformed to numeric values. These are the original values: 1. Amsterdam, 2. Athen, 3. Barcelona, 4. Berlin, 5. Budapest, 6. Lisbon, 7. London, 8. Paris, 9. Rome, 10. Vienna

Only return the price as a number.
"""

# Original prompt with City changed to Amsterdam instead of city 1 and less focus on the 4 T's.
# Result = $112.45 || top_k = 10
# Result = $122.25 || top_k = 5
query3 = """
A 1-bedroom apartment in Amsterdam with room for 2 people, a distance to the city center 5km and metro distance of 2.5km. Outside of weekends.

The city values in the data set are transformed to numeric values. These are the original values: 1. Amsterdam, 2. Athen, 3. Barcelona, 4. Berlin, 5. Budapest, 6. Lisbon, 7. London, 8. Paris, 9. Rome, 10. Vienna
"""

# Prompt created by copy pasting the row_to_description method string output with the values of the first row in the data set
# Result = $394.39 || top_k = 10
# Result = $264.1 || top_k = 5
query4 = """
A 1-bedroom apartment in 1 with room for 2 people,
max distance to the city center is 5 km and metro distance is 2.5 km.
"""

# Creates a vector of our query that can be used to look for similar euclidean distances
query_embedding = embedder.encode([query])

# Search top 5/10
top_k = 5
distances, indices = index.search(np.array(query_embedding), top_k)

# Retrieves the original text before vectorizing the string.
# example from earlier now goes from vector to string.
# [0.01, -0.12, 0.38, ..., -0.02] -> "Cozy apartment in Amsterdam"
retrieved_examples = [descriptions[i] for i in indices[0]]


# Combine the retrieved examples to a string that we can pass to the API request
context = "\n".join([f"{i+1}. {desc}" for i, desc in enumerate(retrieved_examples)])

final_prompt = f"""
Here are some similar Airbnb listings and their prices:

{context}

Now estimate the price for this listing:
{query}

Only return the estimated nightly price in USD as a number.
"""

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": final_prompt,
        "stream": False
    }
)

price = response.json()["response"]
print(f"Ollama (RAG-based) estimated price: {price}")


Ollama (RAG-based) estimated price: Based on the provided listings, I estimate the price for this listing to be:

$342.11


## Conclusion

We expected that our first prompt would yield poor results and the following queries would improve the result. However this was not the case. Adding the 4T's gave some of the worst results. This was very surprising to us. 

As our data set contained City as a numerical value, we tried with the actual city name instead and gave the query an explanation of the values. Our hope was that this could give a good result and a query could have an actual city name instead of a numerical value, making it more suited for humans to use.

As we dug deeper into understanding the selection process of top_k and what the vector comparison did, we tried copy pasting what the first row in the dataset would look like after the row_to_description. This should minimize the Euclidean distance in the vector and our expectation was to get a very good result. This was not the case, as this was actually the worst result we got.

When we compare the results of using AI to estimate a price and comparing it with our machine learning model, the model wins over the AI. No matter how much we tweaked our queries we could not improve the results by feeding the AI some data from our data set. The response we got from sending just a query and letting the AI do its thing were simply just better.