# CS3315 Final Project
# Authors: Cameron Woods and Micky Hall 

## Introduction 

Throughout this notebook we will be attempting to create a model using supervised learning techniques that can accurately predict the price of an airbnb for a night. We will be using a dataset from kaggle that includes 226,029 rows. Each row represents an individual airbnb listing and includes 15 features and a label which will be discussed in depth at a later point in this notebook. 

### General Process 

We began this project by taking all of our data and lightly processing it and attempting to fit it into a bare bones model using linear regression to see how well it would perform. We then analysed the results and looked for reasons in the large skew in our predictions. We then went back and looked at the data as a whole and began to munge our data into a more palatable set for future models. We then checked for any increase in performance from our model. We then began to look into feature engineering and hyper parameter tuning for a greater fit. We slowly worked our way to a better model. We then decided to attempt to run our data in a neural network and see how well that would predict our label. All in all our models ended up predicting with a mse of %%%.  

In [1]:
'''
Data loading and initial scrub. 
These features have been dropped because they can either be accurately captured 
in another feature, we felt they were irrelevant to our prediction, or they were too
sparse to be able to munge
'''

import pandas as pd 
df = pd.read_csv("AB_US_2020.csv")
# For refrencing later 
unclean_df = df 

df = df.drop(["name","host_name", "neighbourhood_group","city","neighbourhood",
                "last_review","id"],axis=1)

df["reviews_per_month"] = df["reviews_per_month"].fillna(0)


In [2]:
'''
One hot encode the room type feature to be able to represent the differnt type
of property you can rent
'''

def oneHot(category, hot):
    if category == hot:
        return 1
    else:
        return 0

dict={}
for room in df['room_type'].tolist():
    dict[room]=1
    
for room in dict.keys():
    df[room] = df['room_type'].apply(oneHot, hot=room)

df = df.drop(['room_type'],axis=1)


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#drop zeros and negative prices, if any
df = df[df.price > 0]
#drop highest price, likely an outlier
df = df[df.price < 24999]

# Split data into features and label 
X = df.drop(["price"],axis=1)
y = df["price"]

# Create a training and a validation set of data. 80/20 split 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

# Scale the data for better predictions 
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_val = sc.transform(X_val)



In [4]:
below50 = df[df.price <= 50]
above500 = df[df.price >= 500]
print(len(below50))
print(len(above500))
print(225940/4)
y.describe()


25639
0
56485.0


count    210164.000000
mean        142.301136
std          95.469315
min          21.000000
25%          75.000000
50%         115.000000
75%         185.000000
max         499.000000
Name: price, dtype: float64

## Initial Cleaning of Data

We began with a dataset that contained the attributes:

id, name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count, availability_365, and city. 

After reviewing the data set we decided that it would be best to drop name, host_name, neighbourhood_group, city, neighbourhood, last_review, and id. We dropped name and host_name because they are similar on many dissimilar listings and the value they represent is better caputred in the host_id. We dropped neighbourhood_group, city, and neighbourhood because many of these values were missing, and they can also be represented by the latitude and longitude values given. We dropped last review because it was just the last review of the property which would require us to do some form of semantic analysis to convert to a meaningful attribute. And finally we dropped id because it was just a unique identifier for each listing that held no real value for the models. 

After dropping these attributes we one hot encoded the room_type attribute so that all property types could be represented, and we filled in all empty values in the reviews_per_month since an empty value is likely to represent no reviews. 

After dropping and correcting our values we split our labels and features apart and then split them into training and validation sets and then scale them for our models. 

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
from math import sqrt

sgd_reg = LinearRegression()
sgd_reg.fit(X_train, y_train) 
y_val_predict = sgd_reg.predict(X_val)
val_error = sqrt(mean_squared_error(y_val, y_val_predict))
print("The RMSE for our Linear Regression Model is {}".format(val_error))

The RMSE for our Linear Regression Model is 504.1372989338935


In [None]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor()

forest.fit(X_train, y_train)
y_val_predict_forest = forest.predict(X_val)
val_error = sqrt(mean_squared_error(y_val, y_val_predict_forest))
print("The RMSE for our Random Forest Regressor is {}".format(val_error))

## Initial running of models and light analysis  

WeAfter lightly cleaning our data we run it through a Linear Regressor and a Random Forrest Regressowithout any hyperparameter tuning and ended up with: 

RMSE of Linear Regressor = 504.1366
RMSE of Random Forest Regressor = 379.4682 

When we initially look at this it seems that we are making decent predictions considering we our predicting within 1 standard deviation of error. However, when you look at the data our 75th percentile starts at a price of around 200, so we are grossly over predicting for most of our data. From here we can start to look at the data and munge it some more to try to get better estimates. r 