# Capstone Project: AirBNB Price Point 

## Overview:

When visiting a city for the first time or renting out your home it can be a struggle figuring out if you are getting a fair price for the AirBNB rental or whether you priced your home appropriately to make a profit. The purpose of this project is to use machine learning to calculate a fair approximate value for a home to assist renters and hosts using AirBNB.

The data used for this project was pulled from AirBNB’s website by Tom Slee (http://tomslee.net/airbnb-data-collection-get-the-data). The dates of the data are from 26 randomly selected days over the past five years. I downloaded all the separate csv files for each date using glob and joined them together to form one large dataframe. The data used totals 159,181 rows for the city of Chicago.

## Data Wrangling:

I began by performing exploratory data analysis to catch any discrepancies that would impact the model. I removed all columns that contained greater than 90% of missing data including: 'bathrooms', 'borough', 'country', ' 'location', and 'survey_id'. I got rid of the original 'city' column since it was missing large amounts of data. I also removed columns that were specific to AirBNB’s platform which did not serve as helpful indicators including 'host_id', 'latitude', 'longitude', ‘last_modified', and 'room_id'. I also got rid of null values in ‘price’, ‘bedrooms’, ‘accommodates’, and ‘overall_satisfaction’ that accounted for less than 10% of the column.

There were two category columns ‘room_type’ (which included Entire home/apt, Private room and Shared room) and ‘neighborhood’ that were listed as strings. For model comparison I  changed these columns to integers with dummy variables.

The ‘overall_satisfaction’ column rates customer satisfaction on a range of 0 to 5.0. It needed to be adjusted to reflect an accurate portrayal of the data. There were 19,831  rows that had an ‘overall_satisfaction’ value of zero, of which 19,795 were from homes that had less than 3 reviews. 

Out of all the 26536 homes with less than 3 reviews, only 6,741 had an overall satisfaction rating above 0. Upon further investigation of AirBNB’s platform it appeared that for many cities the ‘overall_satisfaction’ rating is not available until after 3 reviews are listed. I believe the 25% of listings that had an overall satisfaction score higher than 0 with a rating from only 1 or 2 customers would also not be an accurate reflection of the quality of the listing with so few visitors. All listings with less than 3 reviews were removed from the data set. 

The next piece of the dataframe to adjust were the null values under minimum stay. Since 45% of the column was missing (not quite the 90% threshold), I replaced the null values with the median value of 2.

I noticed that in some of the earlier surveys some of the listing prices were calculated by month. In order to remedy this and get rid of outliers I cut out all listings priced at greater than or equal to 500 a night (with two extreme outliers of 9,999 a night) which accounts for less than 1% of the dataframe. 

I realized a small number of minimum stays exceeded a year when the mean stay is around 2 days. I got rid of the extreme outlier of a stay of 500 days. Finally to avoid multicollinearity I got rid of the column ‘accommodates’ since it is closely related to the column ‘bedrooms’. This left me with a clean data set of 90,223 rows remaining.

In [None]:
#![Clean Data Set](/notebooks/Downloads/data_wrangling_json/Clean_Chicago_Price_Data.png)

In [None]:
#<img src="Clean_Chicago_Price_Data.png">

In [None]:
#<img src="files/images/Clean_Chicago_Price_Data.png">

In [20]:
#<img src="files/Clean_Chicago_Price_Data.png">

In [None]:
#no <img src="https://github.com/CFrensko/Capstone-Project-1/blob/master/Images/Clean_Chicago_Price_Data.png">

In [None]:
#no<img src="/CFrensko/Capstone-Project-1/blob/master/Images/Clean_Chicago_Price_Data.png">

<img src="/Images/Clean_Chicago_Price_Data.png">

In [None]:
#<img src="notebooks/Images/Clean_Chicago_Price_Data.png">

## Exploratory Data Analysis:

After the data was cleaned I created a scatterplot matrix and heatmap without the categorical data to get an initial overview of the data set.

## Linear Regression:

After the data was cleaned, I separated the data into a 70% training and 30% test set and fit the training data to a linear regression model to predict price. The R^2 value for the test set was 0.5283. This means 52.83% of the variability in Y(the price) can be explained using X(the variables/198 coefficents). Further analysis showed that only 40 percent of the test set values were within 20 dollars of the price. This is not that exciting. The mean price was 113 with a standard deviation of 74.

The mean absolute error (MAE) was 34.8245. This shows the model predicts the price with an error of around 35 dollars. MAE is the average horizontal/verticle distance between each point and the Y=X line.

The root mean squared error (RMSE) value was 51.4111. This shows our model was able to predict the value of every AirBNB home in the test set within $51.41 of the real price. The RMSE is the measure of the differences between the values predicted by the model and the values observed.

Next I calculated and plotted the residuals of both the train and test sets to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The advantage of the fitted vs. residual plot is that the model how it performs on each individual case. Residual plots are a good way to visualize the errors in data. If it is accurate the data should be randomly scattered around line zero. The data showed a linear trend but there were a few points outside of the data. This means the model contains structure, since it is not capturing something. Maybe there is a interaction between 2 variables not considered, or maybe there is measurement of time dependent data.

I then created a histogram for the residual error of the test set. It was mostly normally distributed and centered around zero.

I next plotted a probability plot to check the normality of the distribution of the residual errors. The values are ordered and compared to an idealized normal Gaussian distribution. It shows that the residuals are not normal with extreme values at the ends. These values should be removed or another model should be used. These outliers are seen in the upper and lower sides of the quantile plot.

## Supervised Learning:

After linear regression did not have the best performance I decided to apply a variety of supervised regression models including decision tree, random forest, k-nearest neighbors, multi-layer perceptron, gradient boosting, lasso regression, ridge regression and elastic net. This allowed me to immediately determine which model was worth looking into futher.

The best performing model was random forest with a R^2 value of 68.5% with the similar decision tree model trailing in second with a R^2 value of 64%. The worst performers were lasso and elastic net. The random forest's RMSE value was 42, showing almost a 20% improvement from the linear regression model.

## Random Forest Feature Analysis:

When performing a feature performance analysis on Random forest the top weighted features for determining price included room type being an entire house or appartment at 66%, number of bedrooms at 26%, the neighborhood being near north at 3%, the neighborhood of the loop at 1%, overall satisfaction at almost 1% and reviews at a half percent. 

A few things to consider when building an impurity based ranking model is they are biased towards prefering variables with more categories and act differently when ranking two or more correlated variables. If variables are correlated the model will not have a preference for one over the other. Once it weighs the importance of one, the importance of the other correlated variables is significantly reduced. This is great when we want to use feature selection to reduce overfitting, but when interpretting the data it can lead to incorrectly weighing the importance of variables as strong vs weak predictors.

Next I calculated and plotted the residuals of both the train and test sets to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The data was pretty equally distributed over the line zero indicating a good fit but there is still a bit of structure on the ends.

The histogram of the residuals shows the data is skewing to the right showing the random forest model may be predicting larger prices than are accurate.

## Further Analysis:

To further improve the model hyperparameter tuning should be used. 