# Become a data driven Airbnb host 2

This is a blog continued from the first one: https://www.kaggle.com/tianyiwang/become-a-data-driven-airbnb-host-part-1. In this blog, we try to solve a business problem through machine learning --- predicting future earnings based on the current listing information. Imagine that you are an Airbnb host and you would like to know the expected earning for one of your listings next year. Once you have the model, you can play with the information of your listing to see how you might change your future earning if there's any variations of your current listing. 

![](https://kylekleinphotography.com/wp-content/uploads/KKP12002-1.jpg)

# 1. Load data and packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import datetime
from datetime import date
import sklearn
from sklearn.preprocessing import Imputer
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
pd.set_option("display.max_columns", 100)
pd.set_option('max_colwidth',200)

In [None]:
calendar = pd.read_csv("../input/boston/calendar.csv")
listings = pd.read_csv("../input/boston/listings.csv")

# 2. Investigate rows with high missing rates

First, let's check if there are special groups of listings with a lot of N/A columns ---- maybe they are very new or bad listings which we want to leave out when building our model.

In [None]:
# How much data is missing in each row of the dataset?
missing_values_in_rows = listings.isna().mean(axis=1).values
plt.rcParams['figure.figsize'] = (5, 3)
ax = plt.hist(missing_values_in_rows)
plt.title('Missing rates (rows)')

A bad senario will be that there's a group of rows with significantly higher missing rates. In our case, it's good to see that most of the listings have at least 70% of the fields flled, and the shape of the distribution is kind of ideal.

# 3. Feature engineering

We then create features from the existing columns. Once this is done, we will add some columns in the raw data that we can directly use without much engineering. The features we generated:
1. downtown_bos, east_bos: whether the listing is in one of the two neighborhoods. From the exploration, many of the listings are from these areas, maybe they are convenient/go-to areas in Boston.
2. host_age: generated from host_since (months)
3. reponse_rate, acceptance_rate: convert to numbers
4. one_or_two_listings, three_to_ten_listings, more_than_ten_listings: how many listings the host has
5. response_time_ordinal： generated from response_time
6. real_bed: if the bed is a real bed
7. neighborhoods_one_hot: one-hot coding to indicate if the listing is in one of 10 most popular neighborhoods
8. amenities_len: length of the amenities field to roughly indicate how many amenities the listing offers
9. cancellation_policy_ordinal: the larger the number the more flexible the cancellation policy is
10. price_per_bed_compared_to_nbh: the difference between the price per bed for that listing and the average price per bed for that neighborhood (**this is a slightly complicated feature to build! We will do it in the end**)

## 3.1 Generated features

In [None]:
def lower(s):
    try:
        return s.lower()
    except:
        return s

def perc_to_numbers(p):
    '''
    "30%" --> 0.3
    '''
    try:
        return float(p.split("%")[0])/100
    except:
        return float("nan")

In [None]:
summary = listings['summary'].apply(lower)
downtown_bos = summary.str.contains("downtown boston").astype("float32")
east_bos = summary.str.contains("east boston").astype("float32")
host_age = listings['host_since'].apply(lambda x: (datetime.datetime.today() - pd.to_datetime(x)).days/30)
reponse_rate = listings['host_response_rate'].apply(perc_to_numbers)
acceptance_rate = listings['host_acceptance_rate'].apply(perc_to_numbers)

A small investigation on the numbers of listings:

In [None]:
listings[['host_total_listings_count']].describe()

In [None]:
print("{:.2%} of the hosts have no more than 2 listings, {:.2%} of the hosts have more than 10 listings"\
.format(sum(listings['host_total_listings_count']<=2)/len(listings), sum(listings['host_total_listings_count']>10)/len(listings)))

Hosts with one or two listings might just have two rooms for rent in one apartment/house. We will group them together. 40% of the hosts have multiple properties and some of them even have hundred of listings. We thus create the groups:
- hosts with 1~2 listings
- hosts with 3~10 listings
- hosts with more than 10 listings

In [None]:
one_or_two_listings = listings['host_total_listings_count'] <= 2 
three_to_ten_listings = (listings['host_total_listings_count'] > 2) & (listings['host_total_listings_count'] < 10)
more_than_ten_listings = listings['host_total_listings_count'] > 10

In [None]:
response_time_dict = {
    'a few days or more':1,
    'within a day': 2,
    'within a few hours': 3,
    'within an hour': 4
}

Some records have `N/A` response time. Are they very new listings?

In [None]:
plt.hist(host_age[listings['host_response_time'].isna()])

The distribution of host_age for these listings is consistent with the distribution for all the listings. So we fill the nan values with `"within a few hours"` (the 2nd popular value for this column; We didn't choose to fill with `within an hour` to avoid overestimating the field. We believe that most hosts can reply within a few hours).

In [None]:
response_time_ordinal = listings['host_response_time']\
                        .fillna("within a few hours").apply(lambda x: response_time_dict[x])

In [None]:
real_bed = listings['bed_type'] == "Real Bed"
popular_neighborhoods = list(listings\
                        .groupby('neighbourhood_cleansed')\
                        .count()['id'].sort_values(ascending=False)[:10].index)
neighborhoods_one_hot = pd.get_dummies(listings[['neighbourhood_cleansed']])
amenities_len = listings['amenities'].apply(len)
cancellation_policy_dict = {
    'flexible': 1,
    'moderate': 2,
    'strict': 3,
    'super_strict_30': 4
}
cancellation_policy_ordinal = listings['cancellation_policy']\
                              .fillna("super_strict_30").apply(lambda x: cancellation_policy_dict[x])

In [None]:
final_df = pd.DataFrame({
    'id': listings['id'],
    'downtown_bos': downtown_bos,
    'east_bos': east_bos,
    'host_age': host_age,
    'reponse_rate': reponse_rate,
    'acceptance_rate': acceptance_rate,
    'one_or_two_listings': one_or_two_listings,
    'three_to_ten_listings': three_to_ten_listings,
    'more_than_ten_listings': more_than_ten_listings,
    'response_time_ordinal': response_time_ordinal,
    'real_bed': real_bed,
    'amenities_len': amenities_len,
    'cancellation_policy_ordinal': cancellation_policy_ordinal
}).join(neighborhoods_one_hot)

## 3.2. Binary columns

In [None]:
def convert_binary(x):
    try:
        return 1 if x == "t" else 0
    except:
        return 0

In [None]:
binary_pd = listings[['host_is_superhost',
              'host_has_profile_pic',
              'host_identity_verified',
              'is_location_exact',
              'requires_license',
              'instant_bookable',
              'require_guest_profile_picture',
              'require_guest_phone_verification']].applymap(convert_binary)

In [None]:
final_df = final_df.join(binary_pd)

## 3.3 Convert money value strings to numbers

In [None]:
def convert_money(s):
    '''
    "$250.00" --> 250
    '''
    try:
        return float(eval(s.split("$")[1]))
    except:
        return float("nan")

In [None]:
money_pd = listings[['price','extra_people']].applymap(convert_money)

In [None]:
final_df = final_df.join(money_pd)

## 3.4 Add `price_per_bed_compared_to_nbh`
-- The difference between the price per bed for that listing and the average price per bed for that neighborhood

In [None]:
listings['price_num'] = final_df['price']
listings['price_per_bed'] = listings['price_num'] / listings['beds'].replace(0,1)
avg_price_per_bed = listings.groupby('host_neighbourhood')['price_per_bed'].mean().reset_index()\
                              .rename({'price_per_bed':'price_per_bed_nbh'}, axis=1)
listings = listings.merge(avg_price_per_bed, on="host_neighbourhood")
listings['price_per_bed_compared_to_nbh'] = listings['price_per_bed'] - listings['price_per_bed_nbh']

In [None]:
x = plt.hist(listings['price_per_bed_compared_to_nbh'])
plt.title('Difference between price per bed v.s. neighborhood average')

## 3.5 Add other numeric columns and id column (`id`)

In [None]:
final_df = final_df.join(listings[[
                                   'accommodates',
                                   'bathrooms',
                                   'bedrooms',
                                   'beds',
                                   'number_of_reviews',
                                   'review_scores_rating',
                                   'review_scores_accuracy',
                                   'review_scores_cleanliness',
                                   'review_scores_checkin',
                                   'review_scores_communication',
                                   'review_scores_location',
                                   'review_scores_value',
                                   'price_per_bed_compared_to_nbh'
                                   ]])

## 3.6 Deal with missing values

Columns with missing values:

In [None]:
m = final_df.isna().sum()
m[m>0]

### 3.6.1 Columns that we want to set the NaN values as 0s

In [None]:
final_df[['downtown_bos',
            'east_bos',
            'price_per_bed_compared_to_nbh',
            'number_of_reviews']] = final_df[['downtown_bos',
                                                            'east_bos',
                                                            'price_per_bed_compared_to_nbh',
                                                            'number_of_reviews']].fillna(0)

### 3.6.2 Columns that need be filled with the means

In [None]:
final_df_sub1 = final_df[['reponse_rate',
                  'review_scores_rating',
                  'review_scores_accuracy',
                  'review_scores_cleanliness',
                  'review_scores_checkin',
                  'review_scores_communication',
                  'review_scores_location',
                  'review_scores_value']].copy()

final_df[['reponse_rate',
          'review_scores_rating',
          'review_scores_accuracy',
          'review_scores_cleanliness',
          'review_scores_checkin',
          'review_scores_communication',
          'review_scores_location',
          'review_scores_value']] = pd.DataFrame(Imputer(missing_values=float('nan'), 
                                                         strategy="mean", 
                                                         axis=0)\
                                                         .fit_transform(final_df_sub1),
                                                 columns = final_df_sub1.columns)

### 3.6.3 Columns that need be filled with the most common values

In [None]:
final_df_sub2 = final_df[['bathrooms','bedrooms','beds','acceptance_rate']].copy()

final_df[['bathrooms','bedrooms','beds','acceptance_rate']] = pd.DataFrame(Imputer(missing_values=float('nan'), 
                                                         strategy="most_frequent", 
                                                         axis=0)\
                                                         .fit_transform(final_df_sub2),
                                                         columns = final_df_sub2.columns)

### 3.6.4 Other missing values filling

In [None]:
final_df['accommodates'] = final_df['accommodates'].fillna(1)

Check the table again: 

In [None]:
m = final_df.isna().sum()
m[m>0]

`price` is a very importance piece of information and we don't want to impute this value here. We will just drop the listings with `NaN` price.

In [None]:
final_df = final_df[final_df['price'] > 0]

## 3.7 Make sure all the columns except the id column are numeric

In [None]:
final_df.iloc[:,1:] = final_df.iloc[:,1:].astype("float32")

In [None]:
final_df.describe()

# 4. Calculate future earnings

We first check that all listings were scraped on the same day:

In [None]:
print(listings['calendar_last_scraped'].min())
print(listings['calendar_last_scraped'].max())

Another thing that we are concerning is that the availability data are probably much lower than what they actually will be, especially for the days that are in far future.

In [None]:
calendar['available'] = calendar['available'].apply(convert_binary)
calendar['year_month'] = calendar['date'].apply(lambda x: x[:7])
calendar.groupby('year_month')['available'].mean().reset_index().plot()
plt.title('Vacancy rates with months')

The data was scraped on September 2016, and on that month about 27% of the listings are available, while in the next month 41% of rooms are available. However, for the next year until 2017 December, each month about half of the rooms are booked already. We believe that the future earnings we calculate out of the data are meaningful. Or if we have more historical data, we might find out that usually when the month actually comes, the booking rate will rise by 20% (an example). In that way we can adjust our predicted results to approximate the real values.

According to my understanding, the `calendar` data shows the availablity of the listings for the following one year. When the room is not available, we won't have the price information. We know that the price for a listing is very seasonal and usually the price will increase a lot during busy seasons. Since we don't know the prices for the listings that were already booked, we will just use the price in the `listing` data to approximate that.

In [None]:
calendar = calendar.merge(listings[['id','price']].rename({'id':'listing_id'}, axis=1), on='listing_id')
calendar['price_x'] = calendar['price_x'].apply(convert_money)
calendar['price_y'] = calendar['price_y'].apply(convert_money)
calendar_available = calendar[calendar['available']==1]
calendar_booked = calendar[calendar['available']==0]

Remember that in `calendar` data, if a room is available on a certain date, we can see its price on that day. We also have the price information in `listing` data. Thus we would like to compare the available prices in `calendar` data and those in `listing` data to see if the theoratical prices are very different from the prices in reality.

In [None]:
a = plt.hist(calendar_available['price_x'] - calendar_available['price_y'], bins=100)
plt.title("DIfference between prices in listings and actual prices")

The differences are mostly very small.

In [None]:
earnings = calendar_booked.groupby('listing_id')['price_y'].sum().reset_index().rename({'price_y':'future_earnings'}, axis=1)
a = plt.hist(earnings['future_earnings'])
plt.title("Future Earnings")

How many records have future earnings as 0?

In [None]:
sum(earnings['future_earnings'] == 0)

That's not a lot. The earnings data are very right skewed. We will do a log transformation on it:

In [None]:
earnings['log_future_earnings'] = np.log(earnings['future_earnings']+1)

In [None]:
def y_to_earnings(y):
    '''
    The function to convert the log earnings
    '''
    return np.exp(y)-1

In [None]:
a = plt.hist(earnings['log_future_earnings'])
plt.title("Log Future Earnings")

# 5. Align the features table   

In [None]:
features = final_df.copy()
id_earnings = features[['id']].merge(earnings.rename({'listing_id':'id'}, axis=1).drop('future_earnings', axis=1), on="id", how="left").fillna(0)
print("{:.2%} of the listings don't have future earnings".format(sum(id_earnings['log_future_earnings']==0)/len(id_earnings)))

# 6. Split the dataset to testing and training sets

In [None]:
X_all = features.drop('id', axis=1)
y_all = id_earnings['log_future_earnings']
X_train, X_test, y_train, y_test = train_test_split(X_all,
                                                                  y_all,
                                                                  test_size = 0.2,
                                                                  random_state = 0)
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

# 7. Build the gradient boosting regressor 

In [None]:
def regressor(X, y, params, random_state=1):
    '''
    This function was borrowed from. https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py
    It takes the X, y and parameters of the model and plots how deviance change as we have more iterations and the important variables
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)
    print("Training set has {} samples.".format(X_train.shape[0]))
    print("Testing set has {} samples.".format(X_test.shape[0]))
    
    # Fit regression model
    clf = GradientBoostingRegressor(**params)
    clf.fit(X_train, y_train)
    mse = mean_squared_error(y_test, clf.predict(X_test))
    print("MSE: %.4f" % mse)
    
    # Plot training deviance
    test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
    for i, y_pred in enumerate(clf.staged_predict(X_test)):
        test_score[i] = clf.loss_(y_test, y_pred)

    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    plt.title('Deviance')
    plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
             label='Training Set Deviance')
    plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
             label='Test Set Deviance')
    plt.legend(loc='upper right')
    plt.xlabel('Boosting Iterations')
    plt.ylabel('Deviance')
    
    # Plot feature importance
    feature_importance = clf.feature_importances_
    
    # make importances relative to max importance
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5
    im = pd.DataFrame({'feature': features.columns[1:], 'relative importance': feature_importance}).sort_values('relative importance')[-20:]
    plt.subplot(1, 2, 2)
    plt.barh(im['feature'], im['relative importance'])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()
    
    plt.subplots_adjust(wspace=30)
    return clf, im

In [None]:
#the first trial
params = {
    'n_estimators': 200,
    'max_depth': 6,
    'min_samples_split': 2,
    'learning_rate': 0.01,
    'loss': 'ls',
}

In [None]:
model, im = regressor(X_train, y_train, params, random_state=1)

Our first model has a MSE of 5 --- which is about 8 dollars. The most important variable is `is_location_exact` --- we don't quite know what it means. The most important 9 features:

In [None]:
np.exp(5**0.5)-1

In [None]:
im.sort_values('relative importance', ascending=False)

In [None]:
plt.rcParams['figure.figsize'] = (20,20)
for i in range(9):
    plt.subplot(3,3,i+1)
    sns.regplot(x=X_train[im.sort_values('relative importance', ascending=False)['feature'].values[i]], y=y_train)

Of course the future earnings are not determined by only one feature. From the scatter plot we can have an idea of whether a variable will positively or negatively affect the future earning.

We also see that 11% of the listings don't have future earnings. Are they very new listings? Are they of a certain kind?

In [None]:
no_earnings = id_earnings[id_earnings['log_future_earnings']==0].merge(features, on='id')
have_earnings = id_earnings[id_earnings['log_future_earnings']>0].merge(features, on='id')

In [None]:
plt.rcParams['figure.figsize'] = (20,10)

plt.subplot(2,4,1)
a = plt.hist(no_earnings['host_age'])
plt.title("host_age_no_earnings")
plt.subplot(2,4,2)
a = plt.hist(no_earnings['is_location_exact'])
plt.title("is_location_exact_no_earnings")
plt.subplot(2,4,3)
a = plt.hist(no_earnings['one_or_two_listings'])
plt.title("one_or_two_listings_no_earnings")
plt.subplot(2,4,4)
a = plt.hist(no_earnings['price'])
plt.title("price_no_earnings")

plt.subplot(2,4,5)
a = plt.hist(have_earnings['host_age'])
plt.title("host_age_have_earnings")
plt.subplot(2,4,6)
a = plt.hist(have_earnings['is_location_exact'])
plt.title("is_location_exact_have_earnings")
plt.subplot(2,4,7)
a = plt.hist(have_earnings['one_or_two_listings'])
plt.title("one_or_two_listings_have_earnings")
plt.subplot(2,4,8)
a = plt.hist(have_earnings['price'])
plt.title("price_have_earnings")

We can see that listings with no earnings are mostly the listings with `is_location_exact` of 0. Are the no earning listings with `True` is_location_exact very new listings?

In [None]:
plt.rcParams['figure.figsize'] = (6,4)

a = plt.hist(no_earnings[no_earnings['is_location_exact']==1]['host_age'])
plt.title('host_age of no earning listings with True is_location_exact')

Didn't see that pattern here.

# 8. Tune the model

In [None]:
rg = GradientBoostingRegressor()
parameters = {'learning_rate': [0.01, 0.05, 0.08, 0.1],
                    'max_depth': [4, 6, 8, 10],
                    'max_features': [None, 50, 40, 30, 20, 15, 10]}
scorer = "neg_mean_squared_error"

In [None]:
grid_obj = GridSearchCV(rg, parameters, scoring=scorer)

In [None]:
grid_fit = grid_obj.fit(X_train, y_train)

This is the best model!

In [None]:
best_model = grid_fit.best_estimator_
best_model

In [None]:
predicted_log_earnings = best_model.predict(X_test)
print("MSE of the tuned model: {:.3}".format(np.mean(np.square(predicted_log_earnings - y_test))))
difference = abs(y_to_earnings(predicted_log_earnings) - y_to_earnings(y_test))
print("{:.2%} have less than $1000 absolute error".format(sum(difference<1000)/len(y_test)))
print("{:.2%} have less than $5000 absolute error".format(sum(difference<5000)/len(y_test)))
print("{:.2%} have less than $10000 absolute error".format(sum(difference<10000)/len(y_test)))
print("{:.2%} have less than $20000 absolute error".format(sum(difference<20000)/len(y_test)))

In [None]:
plt.rcParams['figure.figsize'] = (20,5)
a = plt.hist(difference, bins=1000)
plt.title('Absolute difference')

In [None]:
plt.rcParams['figure.figsize'] = (20,5)

plt.subplot(1,2,1)
plt.hist(y_to_earnings(y_test)[difference<1000], bins=100)
plt.title('Earnings of listings with less than $1000 predicting error')

In [None]:
plt.rcParams['figure.figsize'] = (15,3)

plt.subplot(1,2,1)
a = plt.hist(y_to_earnings(predicted_log_earnings))
plt.title("predicted earnings")

plt.subplot(1,2,2)
a = plt.hist(y_to_earnings(y_test))
plt.title("actual earnings")

When tested on the test set, our model predicted half of the listings with less than $5000 absolute difference. From the histograms of the predicted earnings and actual earnings, we can see that our model is more conservative and produce smaller numbers for the listings that might have very high earnings.

In [None]:
plt.rcParams['figure.figsize'] = (8, 8)
x = np.linspace(0,300000)
sns.regplot(x=y_to_earnings(y_test), y=y_to_earnings(predicted_log_earnings))
plt.plot(x, x, linewidth=2)
plt.xlabel("actual earnings")
plt.ylabel("difference")

# 9. What if we leave out listings with no future earnings when training the model? 

11% of the listings don't have future earnings --- they are not booked at all for the next year at the time when the data was scraped. Most of them have `Is_location_exact` as 0. They might be outliers. Let's train the model again only with the listings that have future earnings.

In [None]:
def prepare_train_test_data(X_all, y_all, test_size=0.2):
    '''
    Split the data into traning set and testing set
    '''
    X_train, X_test, y_train, y_test = train_test_split(X_all,
                                                                      y_all,
                                                                      test_size = 0.2,
                                                                      random_state = 0)
    print("Training set has {} samples.".format(X_train.shape[0]))
    print("Testing set has {} samples.".format(X_test.shape[0]))
    return X_train, X_test, y_train, y_test

def tune_train_model(rg, parameters, scorer, X_train, y_train):
    '''
    Grid search on parameters to find the best model
    '''
    grid_obj = GridSearchCV(rg, parameters, scoring=scorer)
    grid_fit = grid_obj.fit(X_train, y_train)
    best_model = grid_fit.best_estimator_
    return best_model

def evaluate_model(best_model, X_test, y_test):
    '''
    Fit the model on testing set; calculate the MSE; calculate percentages of the records with less than X amount of absolute error
    '''
    predicted_log_earnings = best_model.predict(X_test)
    print("MSE of the tuned model: {:.3}".format(np.mean(np.square(predicted_log_earnings - y_test))))
    difference = abs(y_to_earnings(predicted_log_earnings) - y_to_earnings(y_test))
    print("{:.2%} have less than $1000 absolute error".format(sum(difference<1000)/len(y_test)))
    print("{:.2%} have less than $5000 absolute error".format(sum(difference<5000)/len(y_test)))
    print("{:.2%} have less than $10000 absolute error".format(sum(difference<10000)/len(y_test)))
    print("{:.2%} have less than $20000 absolute error".format(sum(difference<20000)/len(y_test)))
    return predicted_log_earnings, difference

In [None]:
parameters1 = {'learning_rate': [0.01, 0.05, 0.08, 0.1],
                    'max_depth': [4, 6, 8, 10],
                    'max_features': [None, 50]}

In [None]:
have_future_earnings_index = id_earnings['log_future_earnings'] > 0
X_train_new, X_test_new, y_train_new, y_test_new = prepare_train_test_data(X_all[have_future_earnings_index.values], y_all[have_future_earnings_index.values])
best_model = tune_train_model(rg, parameters, scorer, X_train_new, y_train_new)

In [None]:
best_model

In [None]:
predicted_log_earnings, difference = evaluate_model(best_model, X_test_new, y_test_new)

The MSE is much smaller and apparently, this new model performs much better.

In [None]:
plt.rcParams['figure.figsize'] = (20,5)
a = plt.hist(difference, bins=1000)
plt.title('Absolute difference')

In [None]:
plt.rcParams['figure.figsize'] = (8, 8)
x = np.linspace(0,y_to_earnings(max(y_test_new)))
sns.regplot(x=y_to_earnings(y_test_new), y=y_to_earnings(predicted_log_earnings))
plt.plot(x, x, linewidth=2)
plt.xlabel("actual earnings")
plt.ylabel("predicted earnings")

We can see that there are quite a few listings that actually don't have any future earnings but were predicted to have earnings by the model. We are then curious --- what if we only train and test the model on the listings where `Is_location_exact` is 1 and have future earnings.

In [None]:
new_index = (id_earnings['log_future_earnings'] > 0) & (X_all.reset_index().drop("index", axis=1)['is_location_exact'] == 1)
X_train_new, X_test_new, y_train_new, y_test_new = prepare_train_test_data(X_all.reset_index().drop("index", axis=1)[new_index.values], y_all[new_index.values])
best_model = tune_train_model(rg, parameters, scorer, X_train_new, y_train_new)

In [None]:
best_model

In [None]:
predicted_log_earnings, difference = evaluate_model(best_model, X_test_new, y_test_new)

In [None]:
plt.rcParams['figure.figsize'] = (8, 8)
x = np.linspace(0,y_to_earnings(max(y_test_new)))
sns.regplot(x=y_to_earnings(y_test_new), y=y_to_earnings(predicted_log_earnings))
plt.plot(x, x, linewidth=2)
plt.xlabel("actual earnings")
plt.ylabel("predicted earnings")

The model performs slightly better.

# 10. Interesting take-aways 

From the model we can see that it's important to have `Is_location_exact` as 1 (although we are not quite sure what it means). Price and `Price_per_bed_compared_to_neighborhood` have positive effect on the future earnings but I think it's mostly because listings with higher price have better amenities, locations, etc. So don't worry if your listing is more expensive than most of the other listings in the neighbordhood. If your room is truly in better condition and provides more exciting stuffs for the guests, you will have better future earnings. The length of the amenity field also has a positive effect. Other important factors are acceptance rate and response rate.

To improve the model, I think it will help to get more enrichment on the address data. For example, we can get the house price data using the addresses. We can also have a look how many hotels there are in the neighborhood of the listing. 