# Capstone Project: AirBNB Price Point 

## Overview:

When visiting a city for the first time or renting out your home it can be a struggle figuring out if you are getting a fair price for the AirBNB rental or whether you priced your home appropriately to make a profit. The purpose of this project is to use machine learning to calculate a fair approximate value for a home to assist renters and hosts using AirBNB.

The data used for this project was pulled from AirBNB’s website by Tom Slee (http://tomslee.net/airbnb-data-collection-get-the-data). The dates of the data are from 26 randomly selected days over the past five years. I downloaded all the separate csv files for each date using glob and joined them together to form one large dataframe. The data used totals 159,181 rows for the city of Chicago.

![Chicago Lat & Long Price Map Image](/Images/Chicago Lat & Long Price Map.png)

## Data Wrangling:

I began by performing exploratory data analysis to catch any discrepancies that would impact the model. I removed all columns that contained greater than 90% of missing data including: 'bathrooms', 'borough', 'country', ' 'location', and 'survey_id'. I got rid of the original 'city' column since it was missing large amounts of data. I also removed columns that were specific to AirBNB’s platform which did not serve as helpful indicators including 'host_id', 'latitude', 'longitude', ‘last_modified', and 'room_id'. I also got rid of null values in ‘price’, ‘bedrooms’, ‘accommodates’, and ‘overall_satisfaction’ that accounted for less than 10% of the column.

![Unfiltered Chicago Data Frame Image](/Images/Unfiltered Chicago Dataframe.png)

There were two category columns ‘room_type’ (which included Entire home/apt, Private room and Shared room) and ‘neighborhood’ that were listed as strings. For model comparison I  changed these columns to integers with dummy variables.

![Chicago Dataframe Dummy Variables for Categories Image](/Images/Chicago Dataframe Dummy Variables for Categories.png)

The ‘overall_satisfaction’ column rates customer satisfaction on a range of 0 to 5.0. It needed to be adjusted to reflect an accurate portrayal of the data. There were 19,831  rows that had an ‘overall_satisfaction’ value of zero, of which 19,795 were from homes that had less than 3 reviews. 

Out of all the 26536 homes with less than 3 reviews, only 6,741 had an overall satisfaction rating above 0. Upon further investigation of AirBNB’s platform it appeared that for many cities the ‘overall_satisfaction’ rating is not available until after 3 reviews are listed. I believe the 25% of listings that had an overall satisfaction score higher than 0 with a rating from only 1 or 2 customers would also not be an accurate reflection of the quality of the listing with so few visitors. All listings with less than 3 reviews were removed from the data set. 

The next piece of the dataframe to adjust were the null values under minimum stay. Since 45% of the column was missing (not quite the 90% threshold), I replaced the null values with the median value of 2.

![Chicago Dataframe End Null Results](/Images/Chicago Dataframe End Null Results.png)

I noticed that in some of the earlier surveys some of the listing prices were calculated by month. In order to remedy this and get rid of outliers I cut out all listings priced at greater than 500 a night (with two extreme outliers of 9,999 a night) which accounts for less than 1% of the dataframe. 

I realized a small number of minimum stays exceeded a year when the mean stay is around 2 days. I got rid of the extreme outlier of a stay of 500 days. Finally to avoid multicollinearity I got rid of the column ‘accommodates’ since it is closely related to the column ‘bedrooms’. This left me with a clean data set of 90,355 rows remaining.

## Exploratory Data Analysis:

After the data was cleaned I created a scatterplot matrix and heatmap without the neighborhood categorical data to get an initial overview of the data set (neighborhood data would crowd the overview). The scatterplot matrix does not show any strong distinct trends we can make assupmtions on. Further investigation is needed.

![Scatterplot Matrix Image](/Images/Scatterplot Matrix.png)

The correlation heatmap shows which variables have what impact on each other. We can focus on our variable of choice, the price. The price shows to have a slight positive impact on entire house/apt. Shared room type, private room type and reviews show a negative influence on the price variable.

![Correlation Heatmap Image](/Images/Correlation Heatmap.png)

## Linear Regression:

After the data was cleaned, I separated the data into a 80% training and 20% test set and fit the training data to a linear regression model to predict price. The $R^{2}$ value for the test set was 0.5218. This means 52.18% of the variability in Y (the price) can be explained using X (the variables/198 coefficents). Further analysis showed that only 40 percent of the test set values were within 20 dollars of the price. This is not that exciting. The mean price was 113 with a standard deviation of 76.

The mean absolute error (MAE) was 35.5648. This shows the model predicts the price with an error of around 36 dollars. MAE is the average horizontal/verticle distance between each point and the Y=X line.

The root mean squared error (RMSE) value was 53.2029. This shows our model was able to predict the value of every AirBNB home in the test set within $53.20 of the real price. The RMSE is the measure of the differences between the values predicted by the model and the values observed.

Next I calculated and plotted the residuals of both the train and test sets to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The advantage of the fitted vs. residual plot is the model shows the performance on each individual case. Residual plots are a good way to visualize the errors in data. If it is accurate the data should be randomly scattered around line zero. The data showed a linear trend but there were a few points outside of the data. This means the model contains structure, since it is not capturing something. Maybe there is a interaction between two variables not considered, or maybe there is measurement of time dependent data.

![Linear Regression Residual Plot Image](/Images/Linear Regression Residual Plot.png)

I then created a histogram for the residual error of the test set. It was mostly normally distributed with an idealized Gaussian distribution centered around zero with a slight skew to the right. This shows the linear model predicts prices slightly higher than expected.

![Linear Regression Residual Error Histogram Image](/Images/Linear Regression Risidual Histogram of Test Set.png)

I next plotted a probability plot to check the normality of the distribution of the residual errors. It shows that the residuals are not normal with extreme values at the ends. These values should be removed or another model should be used. These outliers are seen in the upper and lower sides of the quantile plot.

![Linear Regression Probability Plot Image](/Images/Linear Regression Probability Plot.png)

## Supervised Learning:

After linear regression did not have the best performance I decided to apply a variety of supervised regression models. I used the same test/train split as used on the linear regression model (80/20) and fit the training data with decision tree, random forest, k-nearest neighbors, multi-layer perceptron, gradient boosting, lasso regression, ridge regression and elastic net to predict prices. This strategy allowed me to immediately determine which model was worth looking into futher.

![Chicago Data RF R2 Image](/Images/Chicago Data RF R2.png)

The best performing model was random forest with a $R^{2}$ value of 69% on the test set with decision tree (a similar model to random forest) in second place with a $R^{2}$ value of 64%. The worst performers were lasso and elastic net. The random forest's RMSE value was 42.5, showing a 20% improvement from the linear regression model. This means on average random forest predicts the price of homes in the test set within $42.5 of the real price.

![Chicago DF Supervised Learning Image](/Images/Chicago DF Supervised Learning.png)

## Random Forest Feature Analysis:

When performing a feature performance analysis on Random forest the top weighted features for determining price included room type being an entire house or appartment at 33%, reviews at 19%, number of bedrooms at 14%, the minimum stay at 5%, the overall satisfaction at 4% and the neighborhood Near North at 2% with subsequent neighborhoods trailing lower to 0%. Those listed below are the top 30 out of 196 features.

![Chicago DF Random Forest Features Image](/Images/Chicago DF Random Forest Features.png)

A few things to consider when building an impurity based ranking model is they are biased towards prefering variables with more categories and act differently when ranking two or more correlated variables. If variables are correlated the model will not have a preference for one over the other. Once it weighs the importance of one, the importance of the other correlated variables is significantly reduced. This is great when we want to use feature selection to reduce overfitting, but when interpretting the data it can lead to incorrectly weighing the importance of variables as strong vs weak predictors.

![Chicago DF Random Forest Features Bar Plot Image](/Images/Chicago DF Random Forest Features Bar Plot.png)

Next I calculated and plotted the residuals of both the train and test sets to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The data was pretty equally distributed over the line zero indicating a good fit but there is still a bit of structure on the ends.

![Chicago DF RF Residual Plot Image](/Images/Chicago DF RF Residual Plot.png)

I also ploted the histogram of the residuals shown below. This demonstrates that the data is skewing slightly to the left indicating the random forest model is under predicting prices.

![Chicago DF RF Residual Histogram Image](/Images/Chicago DF RF Residual Histogram.png)

Next I tuned the random forest hyperparameters to distinguish the best model for this set of data. A model may perform well on the training set but if it is overfit it will be useless in a real application. Hyperparameter optimization accounts for overfitting through cross validation.

I set tuning values for number of estimators/trees in random forest (start = 200, stop = 500, num = 10), number of features to consider at every split ('auto', 'sqrt'), maximum number of levels in the tree (10, 110, num = 11), minimum number of samples required to split a node (2, 5, 10), minimum number of samples required at each leaf node (1, 2, 4), and method of selecting samples for training each tree (bootstrap = [True, False]). I used random search of parameters, using 3 fold cross validation to fit each of 100 candidates totalling 300 fits. The best parameter fit for the model is shown below. 

Now that we have the proper tuning for the random forest model, I did a comparison analysis of the tuned model vs. the default parameters. The tuned model performed 3% better with 72% of the variability explained by the features.

![Random Forest Best Parameters Image](/Images/Random Forest Best Parameters.png)

![Random Forest Hyperparameter Comparison Image](/Images/Random Forest Hyperparameter Comparison.png)

## Further Analysis:

Using the random forest model to predict prices on the test set gave us a $R^{2}$ value of 0.6952. This indicates 69.52% of the variability in price can be explained using this model. With hyperparameters tuned the model gives us a 72% explanation of the variablity of the test set. This is a pretty good indicator of price considering the vast number of variables that influence Chicago's rentals. However there is still much room for improvement. 

To further improve the model I would recommend scraping AirBnb's website to gather more data on a larger set of random dates spread out throughout the year. This would be good to use a time series analysis on to note if time/season/weather influences a fluctuation in price that would decrease prices for the renter or let the host know when they can raise their rates. 

Once these improvement have been made, the model can be used to generate an even more accurate prediction of prices for AirBnb home rentals in Chicago. This model can then be placed in an application to generate key price ranges dependent on the type of home you are looking to rent or list.