#  Predicting Rent Prices

AirBnB is a marketplace for short term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. As a host, if we try to charge above market price for a living space we'd like to rent, then renters will select more affordable alternatives which are similar to ours. If we set our nightly rent price too low, we'll miss out on potential revenue. In our case, we want to use data on local listings to predict the optimal price for us to set.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

Let's start by giving a brief description of the columns of the data we are going to work with:

- `host_response_rate`: the response rate of the host
- `host_acceptance_rate`: number of requests to the host that convert to rentals
- `host_listings_count`: number of other listings the host has
- `latitude`: latitude dimension of the geographic coordinates
- `longitude`: longitude part of the coordinates
- `city`: the city the living space resides
- `zipcode`: the zip code the living space resides
- `state`: the state the living space resides
- `accommodates`: the number of guests the rental can accommodate
- `room_type`: the type of living space (Private room, Shared room or Entire home/apt
- `bedrooms`: number of bedrooms included in the rental
- `bathrooms`: number of bathrooms included in the rental
- `beds`: number of beds included in the rental
- `price`: nightly price for the rental
- `cleaning_fee`: additional fee used for cleaning the living space after the guest leaves
- `security_deposit`: refundable security deposit, in case of damages
- `minimum_nights`: minimum number of nights a guest can stay for the rental
- `maximum_nights`: maximum number of nights a guest can stay for the rental
- `number_of_reviews`: number of reviews that previous guests have left

Now, let's read the dataframe.

In [2]:
df = pd.read_csv('dc_airbnb.csv')

Let's have a look at the first few rows of the data.

In [3]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


Now, let's try to have more info about the data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null object
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(5), int64(5), object(9

We can clearly see that there are some missing values. Also, we need to inspect the data to decide about which columns are useful for our purpose.

# Data Cleaning

Before building the machine learning model, we will start by cleaning both Dataframes. 

The following columns contain non-numerical values:
- `room_type`: e.g. Private room
- `city`: e.g. Washington
- `state`: e.g. DC

while these columns contain numerical but non-ordinal values:
- `latitude`: e.g. 38.913458
- `longitude`: e.g. -77.031
- `zipcode`: e.g. 20009
    
Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself:
- `host_response_rate`
- `host_acceptance_rate`
- `host_listings_count`

Let's remove these 9 columns from both Dataframes.

In [5]:
cols = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count']
df.drop(columns = cols, inplace = True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 10 columns):
accommodates         3723 non-null int64
bedrooms             3702 non-null float64
bathrooms            3696 non-null float64
beds                 3712 non-null float64
price                3723 non-null object
cleaning_fee         2335 non-null object
security_deposit     1426 non-null object
minimum_nights       3723 non-null int64
maximum_nights       3723 non-null int64
number_of_reviews    3723 non-null int64
dtypes: float64(3), int64(4), object(3)
memory usage: 290.9+ KB


Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):
- `bedrooms`
- `bathrooms`
- `beds`

Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns that have a large number of missing values:
- `cleaning_fee` : 37.3% of the rows
- `security_deposit` : 61.7% of the rows
and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

In [7]:
df = df.dropna(subset=['bedrooms','bathrooms','beds'])

In [8]:
df.drop(columns = ['cleaning_fee', 'security_deposit'], inplace = True)

In [9]:
df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,4,1.0,1.0,2.0,$160.00,1,1125,0
1,6,3.0,3.0,3.0,$350.00,2,30,65
2,1,1.0,2.0,1.0,$50.00,2,1125,1
3,2,1.0,1.0,1.0,$95.00,1,1125,0
4,4,1.0,1.0,1.0,$50.00,7,1125,0


From the data, we can notice that the `price` column is stored as object since it contains "$" and commas. Let's remove them and convert it to float type.

In [10]:
stripped_commas = df['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
df['price'] = stripped_dollars.astype('float')

We will just confirm that there's no more missing values before proceeding.

In [11]:
df.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

Before proceeding with building the model, let's split the dataframe to training and test sets.

In [12]:
cols = list(df.columns)
cols.remove('price')
X = df[cols].values
y = df[['price']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now, we will scale all features columnns.

In [13]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now, we will build a 10-Fold cross-validator.

In [14]:
cv = KFold(n_splits=10, random_state=0)

## k-nearest neighbor

Now, we will fit a k-nearest neighbor regressor with different number of neighbors *k*.

In [19]:
knn = KNeighborsRegressor()

In [20]:
parameters = {'n_neighbors': np.arange(1,10)}
knn_grid = GridSearchCV(estimator = knn,
                       param_grid = parameters,
                       scoring = 'neg_mean_squared_error',
                       cv = cv)
knn_grid = knn_grid.fit(X_train, y_train)

best_accuracy = knn_grid.best_score_
best_parameters = knn_grid.best_params_
# Printing the smallest rmse
print(np.sqrt(-best_accuracy))
# Printing the best parameters
print(best_parameters)

117.2736762245927
{'n_neighbors': 9}


In [21]:
y_pred = knn_grid.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 97.68870530603249


We can see that the k-nearset neighbors with *k=9* gives the best result in terms of the RMSE metric.

# Random forest

Let'stry a random forest model to see if we can improve the results.

In [22]:
rf = RandomForestRegressor(random_state=0)

In [23]:
n_estimators = [10, 50, 100]
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'n_estimators' : n_estimators,
               'max_depth': max_depth_values,
               'max_features': max_features_values}
rf_grid = GridSearchCV(estimator=rf, 
                       param_grid=tree_params,
                       scoring='neg_mean_squared_error', 
                       n_jobs=-1, 
                       cv=cv, 
                       verbose=1)
rf_grid.fit(X_train, y_train)

best_accuracy = rf_grid.best_score_
best_parameters = rf_grid.best_params_
# Printing the smallest rmse
print(np.sqrt(-best_accuracy))
# Printing the best parameters
print(best_parameters)

Fitting 10 folds for each of 60 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   36.9s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   53.4s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.1min finished
  self.best_estimator_.fit(X, y, **fit_params)


113.95766150959807
{'max_depth': 5, 'max_features': 4, 'n_estimators': 100}


The best parameters for the random forest are
- Maximum depth = 5
- Maximum features = 4
- Number of estimators = 100

In [24]:
y_pred = rf_grid.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 121.87839340238614


We can see that the predictions have slightly improved. However, on the test set, the k-nearest neighbor is doing better and besides using *100* estimators to contruct the random forest is quite computationally expensive. 