# Scenario
**Chicago Airbnb**

You and a group of friends are
considering purchasing a property in
Chicago that you can use as an
investment. You have heard from other
people that they have made a lot of
money by renting out either a room or
an entire unit (apartment or house). Your
friends ask you to analyze data so that
they can understand how much you
would charge per night based on the
type of dwelling you were to purchase.

**Dataset:**
https://www.kaggle.com/datasets/jinbonnie/chicago-airbnb-open-data

Time to test the model and optimization of the price on the 100 samples set aside at the beginning of the project.

# Imports

In [1]:
import pandas as pd
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from tqdm import tqdm
tqdm.pandas()

import utils

# Load the data

In [2]:
raw_test = pd.read_csv('live_test_data.csv')
raw_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              100 non-null    int64  
 1   name                            100 non-null    object 
 2   host_id                         100 non-null    int64  
 3   host_name                       100 non-null    object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   100 non-null    object 
 6   latitude                        100 non-null    float64
 7   longitude                       100 non-null    float64
 8   room_type                       100 non-null    object 
 9   price                           100 non-null    int64  
 10  minimum_nights                  100 non-null    int64  
 11  number_of_reviews               100 non-null    int64  
 12  last_review                     82 no

# Data Preparation

In [3]:
df = utils.prepare_data(raw_test)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 99 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   neighbourhood                 99 non-null     category
 1   latitude                      99 non-null     float64 
 2   longitude                     99 non-null     float64 
 3   room_type                     99 non-null     category
 4   price                         99 non-null     int64   
 5   log_days_since_last_review    99 non-null     float64 
 6   log_reviews_per_month         99 non-null     float64 
 7   log_number_of_reviews         99 non-null     float64 
 8   log_price                     99 non-null     float64 
 9   log_minimum_nights            99 non-null     float64 
 10  log_host_listings_count       99 non-null     float64 
 11  log_nights_booked             99 non-null     float64 
 12  host_listings_minimum_nights  99 non-null     float64 
 1

# Target and Features

In [4]:
X, y = df.drop(columns='log_nights_booked'), df['log_nights_booked']

# Test the model

In [5]:
mse = mean_squared_error(y, utils.pipe.predict(X))
mae = mean_absolute_error(y, utils.pipe.predict(X))
r2 = r2_score(y, utils.pipe.predict(X))
print(f'MSE: {mse}, MAE: {mae}, R2: {r2}')

MSE: 0.5266585059277162, MAE: 0.5011382765269756, R2: 0.4539547469053261


# Optimize the price

In [6]:
dropped = raw_test[~raw_test.index.isin(df.index)]
dropped

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
77,1755737,"Lake View 2BR/2Bath w/ Pool, Gym. Annual lease.",9233414,Ravi,,Lake View,41.94036,-87.64068,Entire home/apt,155,365,0,,,1,363


This sample has a minimum nights of 365, so the price will have to be determined outside the model.

In [7]:
raw_test = raw_test[raw_test.index.isin(df.index)]
raw_test['best_revenue'], raw_test['best_price'], raw_test['best_bookings'] = zip(*X.progress_apply(utils.optimize_income, axis=1))
raw_test[['name', 'best_revenue', 'best_price', 'best_bookings']]

100%|██████████| 99/99 [04:52<00:00,  2.95s/it]


Unnamed: 0,name,best_revenue,best_price,best_bookings
0,3 bdrm; free internet in pilsen/southloop,[40829.20867986918],891.111111,[45.81831397990307]
1,English Lavender Room,[33081.8582491561],396.161616,[83.50596549379026]
2,ROWULA HOUSE - WARM AFRICAN HOSPITALITY IN CHI...,[13486.656693609548],891.111111,[15.1346521499359]
3,Spacious Sedgwick Condo - Steps to Old Town,[40913.910315142806],901.010101,[45.408936336313204]
4,Quaint Serenity in Bronzeville,[40208.71041016412],891.111111,[45.12199422586996]
...,...,...,...,...
95,1BR LUX in Loop. Excellent spot!,[17566.592522791158],990.101010,[17.74222260514512]
96,"CLEAN DOWNTOWN APARTMENT, SAFE AREA + FREE PAR...",[42837.132038953016],891.111111,[48.07159455742857]
97,Skylit Boho Retreat - Wicker Park NO PARTIES,[39062.58372505547],871.313131,[44.83185472734166]
98,"Bright Gold Coast 1BR w/ Gym, Lounge, nr. Oak ...",[74138.83002862404],990.101010,[74.88006705604754]


These results are probably not what one should actually set the price of these listings as. The model wasn't able to capture the full effect of raising the price on the demand for the listing.

Though the price wasn't trying to be predicted, looking at the absolute difference to the actual price can give a sense just how far off these prices are.

In [8]:
mean_absolute_error(raw_test.price, raw_test.best_price)

732.2425262728293

# Summary/Reflection

I certainly learned a lot by doing this and I'd like to think that even though the price optimizing technique I tried to use didn't work out, it is still a method that could work given the right data and the some tweaking. One of the issues could be that for a given room type the model never sees an example of the price being so high, so it doesn't know to predict a yearly booking rate of 0 for a price of 1000 dollars to stay in a shared room. 

Separating the data by room type and training a model for each type could be a way to get around this. That'd provide a more accurate range for the optimization function to search over. Adding some synthetic data could also help. Samples with prices above what anyone would pay for such a listing with the target set to 0 could help the model learn that high prices result in low/no bookings.

I tried a different approach than just trying to make a model to predict the price based on the data. A model like this would probably gave better predictions than my methods, but such a model is only telling you what everyone else would set the price as, not what price will maximize your profit.

Cleaning the data was fairly difficult. Deciding how to deal with missing values and categorizing the neighbourhood column took a lot of thought. I enjoyed trying to make the optimization function and I still don't think this method for setting the price should be dismissed. 

Now that I've done this project, I think I have a better understanding of creating and deploying a machine learning system. It added more complexity to track all the changes I made to the data along the way and keep functions for use with the final test set. I think I did a lot of things the hard way, but doing it the hard way this time I think will make the next time a lot easier. 