# Rental Accomodation Booking Predictor

For the last 3 months I have been running some web scraping jobs to extract the publicly available data for short term holiday rentals in NSW from an AirBnB type of website. Extracting the property details as well as the public calendar allows us to get both the property features as well as see the current state of bookings, specifically which dates are currently booked and for how long.

The image below shows the location of all the properties, with Sydney shown as the red dot:

![Property Locations](Airbnb_Property_Locations_NSW.PNG)


Features that we are able to extract include:
* Number of bedrooms
* Number of guests allowed
* Price per night
* Overall rating and the number of reviews given
* Distance and direction from Sydney
* Number of photos of the property on the listing
* Word count of the main description
* The type of property (cottage, apartment, farmstay, house etc)
* Other features extracted from the listing begin with 'f_' such as ```f_clothes_dryer```
* Calendar showing current forward bookings

The calendar information shows arrival and departure dates as part of the html code, which can be parsed with Python and then collated to provide the number of days booked. Future improvements will include breaking this down into the number of weekends, public holidays, and school holiday bookings.

![Calendar](Airbnb_Calendar.PNG)

From manual inspection there are a small number of property listings who get high bookings, with a large number of property listings having very few or zero bookings.

The hypothesis is that there is an identifiable feature list which is desirable for bookings, and therefore that bookings can be predicted accurately for a new property when given the full set of features of that property.



Based on the tutorial at https://mitrai.com/tech-guide/using-aws-sagemaker-linear-regression-to-predict-store-transactions/

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import time
import json
import boto3
import re

from sklearn.model_selection import train_test_split

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role


In [None]:
bucket = 'deloitte-sagemaker'
prefix = 'sagemaker/linear_time_series_forecast'

# Define IAM role
#role = get_execution_role()
#role

## 2. Data Import and Validation

In [None]:
hd_1 = pd.read_csv('housing_sample.csv')

# Fill any 'NaN' values with zero
hd_1.fillna(0,inplace=True)

# Drop the 'ext_at' column
hd_1.drop('ext_at',axis=1, inplace=True)

# Exclude any rows there the 'init_price' is unknown.
hd_2 = hd_1[hd_1['init_price'] > 0 ]

# Drop the 'property_id' column
hd_3 = hd_2.drop('property_id',axis=1)

h_data = hd_3


display(h_data.head())

In [None]:
# Check for any missing postcodes
mp = h_data[h_data['postcode'].isnull()]
mp.postcode

In [None]:
# Make sure that all fields are non null
h_data.info()

## 3. Training and Test Data Sets

We will use the ```total_booked_days``` column as being the y-value we want to predict, then create training and test data sets using an 80/20 split.

In [None]:
# Define which column is the y variable
y = h_data.total_booked_days

# Split into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(h_data, y, test_size=0.20, random_state=42)

X_train.head()

## 4. Convert Data and Upload to S3

In [None]:
# Convert training dataset to RecordIO format as required by Amazon Sagemaker

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(X_train).astype('float32'), np.array(y_train).astype('float32'))
buf.seek(0)

In [None]:
# Upload training data to S3

key = 'housing_data.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('Uploaded training data location: {}'.format(s3_train_data))

In [None]:
# Convert validation dataset to RecordIO format as required by Amazon Sagemaker

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(X_test).astype('float32'), np.array(y_test).astype('float32'))
buf.seek(0)

In [None]:
# Upload validation data to S3

key = 'housing_validation.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', key)).upload_fileobj(buf)
s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, key)
print('Uploaded validation data location: {}'.format(s3_validation_data))

### 5. Run LinearLearner Predictor 

In [None]:
# Specify the containers

containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest',
             'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest',
             'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest',
             'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest'}

In [None]:
# Start a Sagemaker session

sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(containers[boto3.Session().region_name]
                                       ,role
                                       ,train_instance_count=1
                                       ,train_instance_type='ml.m5.large'
                                       ,output_path='s3://{}/{}/output'.format(bucket, prefix)
                                       ,sagemaker_session=sess)

linear.set_hyperparameters(feature_dim=77
                           ,mini_batch_size=100
                           ,predictor_type='regressor'
                           ,epochs=10
                           ,num_models=32
                           ,loss='absolute_loss')

linear.fit({'train': s3_train_data, 'validation': s3_validation_data})

In [None]:
linear_predictor = linear.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

Create a function to convert our numpy arrays into a format that can be handled by the HTTP POST request we pass to the inference container. In this case it’s a simple CSV string. The results will be published back as JSON. For these common formats we can use the Amazon SageMaker Python SDK’s built in csv_serializer and json_deserializer functions.

In [None]:
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

## 7. Run Model with Test Dataset

Now that the model has been deployed to an endpoint we can call this model with the test dataset, then extract the predictions into the ```one_step``` variable

In [None]:
test_X = X_test.as_matrix()

result = linear_predictor.predict(test_X)

one_step = np.array([r['score'] for r in result['predictions']])

In [None]:
# Find some values from the test data to validate where the number of booked days is non zero

X_test['total_booked_days'][0:10]

In [None]:
# Since the first item (index position 0) has some bookings we can look at the result vs the predicted result
test_index = 0

pred_price = one_step[test_index]
print("Prediction: " + str(pred_price))

act_price = X_test.iloc[test_index]['total_booked_days']
print("Actual: " + str(act_price))

print("Difference: {0:.2f}%".format((((act_price - pred_price)/act_price)*100)))

Work out the difference between the prediction and the results for all the test set

In [None]:
from numpy import inf

res1 = (np.abs(y_test - one_step) / y_test)

# If the result is 'inf' then default to zero
res1[res1 == inf] = 0

print("Median differencet: " + str(np.median(res1)))

# Check the first few records
res1[0:5]

## 8. Graph Results
Graph the actual values vs predicted values for a slice of the dataset.

**Note:** The predicted values are all within a fraction of the actual result, ie 17.0 vs 16.818 for the first element. This means that the results are hard to see visually.

In [None]:
print('One-step = ', np.median(res1))

plt.figure(figsize=(20,10))

plt.plot(one_step[200:250], label='forecast')
plt.plot(np.array(y_test[200:250]), label='actual')
plt.legend()
plt.show()

In [None]:
# Run some housing details manually to see what changes

#linear_predictor.predict() #expected label to be 1

### (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
#sess.delete_endpoint(linear_predictor.endpoint)