3.1 RQ1: Can the availability of an Airbnb rental prop-
erty in the next 30 days be predicted using the scores
and reviews that the property has received?

Motivation: Airbnb allows renters to provide scores for
the cleanliness, location, value, and check-in experience
of the property they rented as well as a score for the
property’s owner communication. This research question
aims to determine the correlation and importance of
rental reviews towards the projected vacancy of a rental
property. Airbnb property owners may adjust their rental
strategies, including pricing, check-in methods, and
property cleanliness to increase the likelihood of rentals in
the upcoming month.

Proposed Methodology: The hypothesis which will be
evaluated during testing is that properties with higher
scores for cleanliness, location, value, and check-in
experience have fewer vacancies in the next month. Data
related to the review scores of rental properties will be
extracted, as well as the number of days that the property is
vacant in the next month. Properties that have received no
reviews will be excluded during testing. Property owners
should look to examine factors with a strong correlation
with projected vacancy to positively impact their profits

In [None]:
# ----- Import Statements ----- #
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


In [None]:
# ----- Read CSV File Into Dataframe ----- #
filepath = './../../data/Airbnb_London/listings.csv'



In [None]:
# ----- Perform Feature Extraction Here ----- #

# extract dataframe for regression model

In [None]:
# ----- Linear Regression Functions ----- #
def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

def train_test(model, data, is_log=False):
    data.dropna(inplace=True)
    train_data, test_data = train_test_split(data, test_size=0.2, shuffle=True)
    if is_log:
        y_train = train_data['SALE PRICE LOG']
        y_test = test_data['SALE PRICE LOG']
    else:
        y_train = train_data['SALE PRICE']
        y_test = test_data['SALE PRICE']
        
    x_train = train_data.drop(['SALE PRICE', 'SALE PRICE LOG'], axis=1)
    x_test = test_data.drop(['SALE PRICE', 'SALE PRICE LOG'], axis=1)
    
    model.fit(x_train, y_train)
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)
    if is_log:
        print("RMSE train log-scale: {}".format(rmse(y_train, y_train_pred)))
        print("RMSE test log-scale: {}".format(rmse(y_test, y_test_pred)))
    else:
        print("RMSE train: {}".format(rmse(y_train, y_train_pred)))
        print("RMSE test: {}".format(rmse(y_test, y_test_pred)))
    
    train_data.rename(columns={'SALE PRICE': 'Real Price'}, inplace=True)
    train_data['Predicted Price'] = y_train_pred
    train_data['Data Partition'] = 'train data'
    
    test_data.rename(columns={'SALE PRICE': 'Real Price'}, inplace=True)
    test_data['Predicted Price'] = y_test_pred
    test_data['Data Partition'] = 'test data'
    
    all_data = pd.concat([train_data, test_data])
    return all_data

In [None]:
# ----- Split Data & Run Regression ----- #

# make sure to run Q5 and Q4 code to initialize the data
eng_data = df_q6_eng # from Q5
eng_data['SALE PRICE LOG'] = np.log(1 + eng_data['SALE PRICE'])
raw_data = df_q6_raw # from Q4
raw_data['SALE PRICE LOG'] = np.log(1 + raw_data['SALE PRICE'])

##### Model A - Feature Engineered Model #####
print('##### Results for Model A: #####')
model_A = LinearRegression()
data_A = train_test(model_A, eng_data)
fig_q6_1 = px.scatter(data_A, x='Real Price', y='Predicted Price', trendline='ols', color='Data Partition', title='MODEL A: Real vs Predicted Brooklyn Rental Elevator Apartment Prices')
fig_q6_1.show()

##### Model B - Raw Feature Model #####
print('##### Results for Model B: #####')
model_B = LinearRegression()
data_B = train_test(model_B, raw_data)
fig_q6_2 = px.scatter(data_B, x='Real Price', y='Predicted Price', trendline='ols', color='Data Partition', title='MODEL B: Real vs Predicted Brooklyn Rental Elevator Apartment Prices')
fig_q6_2.show()

##### Compare Test Results from BOTH Models #####
data_A['Model'] = 'Model A'
data_B['Model'] = 'Model B'
part = 'test data'
data_AB = pd.concat([data_A.query('`Data Partition` == @part'), data_B.query('`Data Partition` == @part')])
data_AB = data_AB.filter(items=['Predicted Price', 'Real Price', 'Model'])
fig_q6_3 = px.scatter(data_AB, x='Real Price', y='Predicted Price', trendline='ols', color='Model', title='TEST RESULTS: Real vs Predicted Brooklyn Rental Elevator Apartment Prices')
fig_q6_3.show()

##### ##### ##### ##### #####