# CAPSTONE PROJECT: AIRBNB PRICE POINT

## 1. DESCRIPTIVE ABSTRACT

When visiting a city for the first time or renting out your home it can be a struggle figuring out if you are getting a fair price for the Airbnb rental or whether you priced your home appropriately to make a profit. The purpose of this project is to use machine learning to calculate a fair approximate value for a home to assist renters and hosts using Airbnb.

The data used for this project was pulled from AirBNB’s website by Tom Slee (http://tomslee.net/airbnb-data-collection-get-the-data) for the city of Chicago. The dates of the data are from 26 randomly selected days over the past five years. The data was downloaded as separate csv files for each date and joined them together to form one large dataframe totaling 159,181 rows. Below is a visual of the data used for this project.

The data samples for Chicago will first be explored with data analysis along with data wrangling. After the data is cleaned, a linear regression machine learning model will be applied to predict the values of the homes. Later the data will be explored using supervised regression models including decision tree, random forest, k-nearest neighbors, multi-layer perceptron, gradient boosting, lasso regression, ridge regression and elastic net. Next feature importance analysis and hyper parameter tuning will be applied to the best model. These results will all be tied together to see the impact each factor may have on overall price. The deliverables for this project will be the code, this report detailing the work completed and a blog post telling the story.

![Chicago Lat & Long Price Map Image](/Images/Chicago Lat & Long Price Map.png)

The code for this map can be found here: https://github.com/CFrensko/Capstone-Project-1/blob/master/Data%20Exploration%20AirBNB%20Chicago%20Lat%20%26%20Long.ipynb

## 2. INTRODUCTION

Airbnb is an online platform that allows people to lease or rent short-term lodging in over 81,000 cities around the world. It is a popular service in the city of Chicago for visitors to use as an alternative to staying at hotels. If you are a homeowner or your apartment complex allows airbnb listings it can be a fun way to make extra money. In some cases it is also an opportunity to host visitors and share in the experience of welcoming a visitor to your city.

One of the problems with hopping on the website for the first time in a new city is figuring out what a reasonable price is for a unit when you are unfamiliar with the neighborhoods. As the customer looking for a rental, how do you determine a fair budget if the listings are sparse in one neighborhood without a range of equivalent prices? Maybe the neighborhood is a really beautiful niche but transportation to all the local entertainment will really cost you. What is your break even price point for an authentic experience, priceless? Maybe you should avoid the hassle and disappointment by booking a hotel instead.

It can also be frustrating for a new host to structure the pricing of their unit without knowing the fair market value. You could be losing out on money by not following the market trends. Look at how much your friend just made last weekend with the music festival Lollapalooza in town. How do you make sure you are not undervaluing your home?

The solution to this problem would be to have a dynamic application that would give you a reasonable price range for a unit with your specifications in different neighborhoods. Instead of doing all that extra research you could just open the app and receive an instant unbiased transparent fair price estimation for your ideal home. This application could also be beneficial for pricing out monthly rental rates if you are planning on moving.
The solution of the application lies in the ability of machine learning to do quick historical searches and comparisons of variables with supervised learning. The ideal supervised learning model would make the task of pricing your rental or finding a reasonably priced home seamless.


### 2.1 APPROACH

The approach for solving this problem is taking the random data samples for Chicago and performing an exploratory data analysis along with data wrangling. After the data is cleaned, a linear regression machine learning model will be applied to predict the values of the homes. Later the data will be explored using supervised regression models including decision tree, random forest, k-nearest neighbors, multi-layer perceptron, gradient boosting, lasso regression, ridge regression and elastic net. Next feature importance analysis and hyper parameter tuning will be applied to the best model. These results will all be tied together to see the impact each factor may have on overall price. The deliverables for this project will be the code, this report detailing the work completed and a blog post telling the story.

## 3. DATA SET

### 3.1 DATA ACQUISITION

The data used for this project was pulled from AirBNB’s website by Tom Slee (http://tomslee.net/airbnb-data-collection-get-the-data) for the city of Chicago. The data was downloaded as separate csv files for each date and joined together to form one large dataframe totaling 159,181 rows. The dates of the data are from 26 randomly selected days over the past five years. 

### 3.2  DATA CLEANING

I began by performing exploratory data analysis to catch any discrepancies that would impact the model. I removed all columns that contained greater than 90% of missing data including: 'bathrooms', 'borough', 'country', ' 'location', and 'survey_id'. I got rid of the original 'city' column since it was missing large amounts of data. I also removed columns that were specific to AirBNB’s platform which did not serve as helpful indicators including 'host_id', 'latitude', 'longitude', ‘last_modified', and 'room_id'. I also got rid of null values in ‘price’, ‘bedrooms’, ‘accommodates’, and ‘overall_satisfaction’ that accounted for less than 10% of the column.

![Unfiltered Chicago Data Frame Image](/Images/Unfiltered Chicago Dataframe.png)

There were two category columns ‘room_type’ (which included Entire home/apt, Private room and Shared room) and ‘neighborhood’ that were listed as strings. For model comparison I  changed these columns to integers with dummy variables.

![Chicago Dataframe Dummy Variables for Categories Image](/Images/Chicago Dataframe Dummy Variables for Categories.png)

The ‘overall_satisfaction’ column rates customer satisfaction on a range of 0 to 5.0. It needed to be adjusted to reflect an accurate portrayal of the data. There were 19,831  rows that had an ‘overall_satisfaction’ value of zero, of which 19,795 were from homes that had less than 3 reviews. 

Out of all the 26536 homes with less than 3 reviews, only 6,741 had an overall satisfaction rating above 0. Upon further investigation of AirBNB’s platform it appeared that for many cities the ‘overall_satisfaction’ rating is not available until after 3 reviews are listed. I believe the 25% of listings that had an overall satisfaction score higher than 0 with a rating from only 1 or 2 customers would also not be an accurate reflection of the quality of the listing with so few visitors. All listings with less than 3 reviews were removed from the data set. 

The next piece of the dataframe to adjust were the null values under minimum stay. Since 45% of the column was missing (not quite the 90% threshold), I replaced the null values with the median value of 2.

![Chicago Dataframe End Null Results](/Images/Chicago Dataframe End Null Results.png)

I noticed that in some of the earlier surveys some of the listing prices were calculated by month. In order to remedy this and get rid of outliers I cut out all listings priced at greater than 500 a night (with two extreme outliers of 9,999 a night) which accounts for less than 1% of the dataframe. 

I realized a small number of minimum stays exceeded a year when the mean stay is around 2 days. I got rid of the extreme outlier of a stay of 500 days. Finally to avoid multicollinearity I got rid of the column ‘accommodates’ since it is closely related to the column ‘bedrooms’. This left me with a clean data set of 90,355 rows remaining.

The data wrangling is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Data%20Wrangling%20WHOLE%20AirBNB%20Large%20City%20Chicago%20PRICE%20FINAL.ipynb

### 3.3 EXPLORATORY DATA ANALYSIS

After the data was cleaned I created a scatterplot matrix and heatmap without the neighborhood categorical data to get an initial overview of the data set (neighborhood data would crowd the overview). The scatterplot matrix does not show any strong distinct trends we can make assumptions on. Further investigation is needed.

![Scatterplot Matrix Image](/Images/Scatterplot Matrix.png)

The correlation heatmap shows which variables have what impact on each other. We can focus on our variable of choice, the price. The price shows to have a slight positive impact on entire house/apt. Shared room type, private room type and reviews show a negative influence on the price variable.

![Correlation Heatmap Image](/Images/Correlation Heatmap.png)

The exploratory data analysis (EDA) is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Data%20Exploration%20AirBNB%20Chicago%20PRICE.ipynb

## 4. MODELING

Knowing our dataset is clean with the variables free of excess noise we use supervised machine learning algorithms to build a predictive model. Since it is shown through EDA that the price is correlated with the other independent features we first use multiple linear regression to measure the linear relationship and analyze the residuals.

### 4.1 LINEAR REGRESSION

The data was separated into a 80% training and 20% test set and the training data is then fit to the linear regression model to predict price. Next both the test and train sets of the residuals were calculated and plotted to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The advantage of the fitted vs. residual plot is the model shows the performance on each individual case. Residual plots are a good way to visualize the errors in data. If it is accurate the data should be randomly scattered around line zero. I then created a histogram for the residual error of the test set and plotted a probability plot to check the normality of the distribution of the residual errors. 

The $R^{2}$ value for the test set was 0.5218. This means 52.18% of the variability in Y (the price) can be explained using X (the variables/198 coefficients). Further analysis showed that only 40 percent of the test set values were within 20 dollars of the price. This is not that exciting. The mean price was 113 with a standard deviation of 76.

The mean absolute error (MAE) was 35.5648. This shows the model predicts the price with an error of around 36 dollars. MAE is the average horizontal/vertical distance between each point and the Y=X line.

The root mean squared error (RMSE) value was 53.2029. This shows our model was able to predict the value of every AirBNB home in the test set within $53.20 of the real price. The RMSE is the measure of the differences between the values predicted by the model and the values observed.

The plotted residuals shown below showed a linear trend but there were a few points outside of the data. This means the model contains structure, since it is not capturing something. Maybe there is a interaction between two variables not considered, or maybe there is measurement of time dependent data.

![Linear Regression Residual Plot Image](/Images/Linear Regression Residual Plot.png)

The histogram for the residual error of the test set was mostly normally distributed with an idealized Gaussian distribution centered around zero with a slight skew to the right. This shows the linear model predicts prices slightly higher than expected.

![Linear Regression Residual Error Histogram Image](/Images/Linear Regression Risidual Histogram of Test Set.png)

The probability plot to check the normality of the distribution of the residual error shows that the residuals are not normal with extreme values at the ends. These values should be removed or another model should be used. These outliers are seen in the upper and lower sides of the quantile plot.

![Linear Regression Probability Plot Image](/Images/Linear Regression Probability Plot.png)

The linear regression analysis is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Prediction%20%26%20Evaluation%20Chicago%20AirBNB%20Price%20Linear%20Model%20FINAL.ipynb

### 4.2 SUPERVISED LEARNING MODEL COMPARISON

After linear regression did not have the best performance I decided to apply a variety of supervised regression models. I used the same test/train split as used on the linear regression model (80/20) and fit the training data with decision tree, random forest, k-nearest neighbors, multi-layer perceptron, gradient boosting, lasso regression, ridge regression and elastic net to predict prices. This strategy allowed me to immediately determine which model was worth looking into further.

![Chicago Data RF R2 Image](/Images/Chicago Data RF R2.png)

The best performing model was random forest with a $R^{2}$ value of 69% on the test set with decision tree (a similar model to random forest) in second place with a $R^{2}$ value of 64%. The worst performers were lasso and elastic net. The random forest's RMSE value was 42.5, showing a 20% improvement from the linear regression model. This means on average random forest predicts the price of homes in the test set within $42.5 of the real price.

![Chicago DF Supervised Learning Image](/Images/Chicago DF Supervised Learning.png)

The supervised learning The linear regression analysis is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Prediction%20%26%20Evaluation%20WHOLE%20AirBNB%20Chicago%20PRICE%20Supervised%20Regression%20Models%20FINAL.ipynb

### 4.3 RANDOM FOREST

When performing a feature performance analysis on Random forest the top weighted features for determining price included room type being an entire house or apartment at 33%, reviews at 19%, number of bedrooms at 14%, the minimum stay at 5%, the overall satisfaction at 4% and the neighborhood Near North at 2% with subsequent neighborhoods trailing lower to 0%. Those listed below are the top 30 out of 196 features.

![Chicago DF Random Forest Features Image](/Images/Chicago DF Random Forest Features.png)

A few things to consider when building an impurity based ranking model is they are biased towards preferring variables with more categories and act differently when ranking two or more correlated variables. If variables are correlated the model will not have a preference for one over the other. Once it weighs the importance of one, the importance of the other correlated variables is significantly reduced. This is great when we want to use feature selection to reduce overfitting, but when interpreting the data it can lead to incorrectly weighing the importance of variables as strong vs weak predictors.

![Chicago DF Random Forest Features Bar Plot Image](/Images/Chicago DF Random Forest Features Bar Plot.png)

Next I calculated and plotted the residuals of both the train and test sets to observe the estimates of experimental error obtained by subtracting the observed responses from the predicted responses. The data was pretty equally distributed over the line zero indicating a good fit but there is still a bit of structure on the ends.

![Chicago DF RF Residual Plot Image](/Images/Chicago DF RF Residual Plot.png)

I also plotted the histogram of the residuals shown below. This demonstrates that the data is skewing slightly to the left indicating the random forest model is under predicting prices.

![Chicago DF RF Residual Histogram Image](/Images/Chicago DF RF Residual Histogram.png)

The random forest analysis is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Prediction%20%26%20Evaluation%20Chicago%20AirBNB%20Price%20Random%20Forest%20FINAL.ipynb

### 4.4 RANDOM FOREST HYPERPARAMETER TUNING

Next I tuned the random forest hyperparameters to distinguish the best model for this set of data. A model may perform well on the training set but if it is overfit it will be useless in a real application. Hyperparameter optimization accounts for overfitting through cross validation.

I set tuning values for number of estimators/trees in random forest (start = 200, stop = 500, num = 10), number of features to consider at every split ('auto', 'sqrt'), maximum number of levels in the tree (10, 110, num = 11), minimum number of samples required to split a node (2, 5, 10), minimum number of samples required at each leaf node (1, 2, 4), and method of selecting samples for training each tree (bootstrap = [True, False]). I used random search of parameters, using 3 fold cross validation to fit each of 100 candidates totalling 300 fits. The best parameter fit for the model is shown below. 

![Random Forest Best Parameters Image](/Images/Random Forest Best Parameters.png)

Now that we have the proper tuning for the random forest model, I did a comparison analysis of the tuned model vs. the default parameters. The tuned model performed 3% better with 72% of the variability explained by the features.

![Random Forest Hyperparameter Comparison Image](/Images/Random Forest Hyperparameter Comparison.png)

The random forest analysis is given without the complete Python code for my project. This can be found on my github: https://github.com/CFrensko/Capstone-Project-1/blob/master/Prediction%20%26%20Evaluation%20Chicago%20AirBNB%20Price%20Hyperparameter%20Tuning.ipynb

## 5. CONCLUSIONS

Our results may be summarized as follows:

* Using the random forest model to predict prices on the test set gave us a $R^{2}$ value of 0.6952. 
* This indicates 69.52% of the variability in price can be explained using this model. 
* With hyperparameters tuned the model gives us a 72% explanation of the variability of the test set. 
* This is a pretty good indicator of price considering the vast number of variables that influence Chicago's rentals. 
* However there is still much room for improvement! 

### 5.1 FUTURE WORK

Since the focus of this work is to not only predict the price for Airbnb rentals but also to understand the fluctuation in the variables more research is needed. In particular:

* To further improve the model I would recommend scraping AirBnb's website to gather a larger more robust set of data.
* Specifically, more data should be gathered for a continuous set of dates throughout the year to use for a time series analysis. This would determine if events, weather or season have an influence on the fluctuation of price.
* It would also be helpful to include amenities such as kitchen/laundry and facilities such as parking in the next analysis.
* Taking data from monthly or annual rental cost of homes in Chicago could also improve its accuracy.
* This improved model can then be placed in an application to generate key price ranges dependent on the type of home you are looking to rent or list!
* The model can be scaled to create a price model for apartments on longer term leases for those interested in monthly or yearly leases.

### 5.2 CLIENT RECOMMENDATIONS

Once these improvement have been made, the model can be used to generate an even more accurate prediction of prices for AirBnb home rentals in Chicago.

* Before you book that trip to a new city check out the price ranges for select neighborhoods you plan on visiting. Is it a fair price for that neighborhood and amenities? If so, is it within your budget for an authentic experience? Maybe look at different weeks or times of year that typically offer lower rates.

* If you are hosting you will easily structure the pricing of your unit with fair market values listed. You will anticipate market trends and have the option to make a bit more during events in your area.

* Also if you are looking at long term leases in unfamiliar neighborhoods you can pull out the application to help you navigate the negotiation for a more affordable price based on similar listings!

## 6. ACKNOWLEDGEMENTS

Special thanks to Springboard & my mentor AJ Sanchez! 