# CS3315 Final Project
# Authors: Cameron Woods and Micky Hall 

## Introduction 

Throughout this notebook we will be attempting to create a model using supervised learning techniques that can accurately predict the price of an airbnb for a night. We will be using a dataset from kaggle that includes 226,029 rows. Each row represents an individual airbnb listing and includes 15 features and a label which will be discussed in depth at a later point in this notebook. 

### General Process 

We began this project by taking all of our unprocessed data and attempting to fit it into a bare bones model using linear regression to see how well it would perform. We then analysed the results and looked for reasons in the large skew in our predictions. We then went back and looked at the data as a whole and began to munge our data into a more palatable set for future models. We then checked for any increase in performance from our model. We then began to look into feature engineering and hyper parameter tuning for a greater fit. We slowly worked our way to a better model. We then decided to attempt to run our data in a neural network and see how well that would predict our label. All in all our models ended up predicting with a mse of %%%.  

In [84]:
'''
Data loading and initial scrub. 
These features have been dropped because they can either be accurately captured 
in another feature, we felt they were irrelevant to our prediction, or they were too
sparse to be able to munge
'''

import pandas as pd 
df = pd.read_csv("AB_US_2020.csv")
# For refrencing later 
unclean_df = df 

df = df.drop(["name","host_name", "neighbourhood_group","city","neighbourhood",
                "last_review","id"],axis=1)

df["reviews_per_month"] = df["reviews_per_month"].fillna(0)


In [85]:
'''
One hot encode the room type feature to be able to represent the differnt type
of property you can rent
'''

def oneHot(category, hot):
    if category == hot:
        return 1
    else:
        return 0

dict={}
for room in df['room_type'].tolist():
    dict[room]=1
    
for room in dict.keys():
    df[room] = df['room_type'].apply(oneHot, hot=room)

df = df.drop(['room_type'],axis=1)


In [86]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

#drop zeros and negative prices, if any
df = df[df.price > 75]
#drop highest price, likely an outlier
df = df[df.price < 200]

X = df.drop(["price"],axis=1)
y = df["price"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_val = sc.transform(X_val)



In [87]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
from math import sqrt

sgd_reg = LinearRegression()
sgd_reg.fit(X_train, y_train) 
y_val_predict = sgd_reg.predict(X_val)
val_error = sqrt(mean_squared_error(y_val, y_val_predict))
print(val_error)

32.913017358916726


In [19]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor()

forest.fit(X_train, y_train)
y_val_predict_forest = forest.predict(X_val)
val_error = sqrt(mean_squared_error(y_val, y_val_predict))
print("The RMSE for our Random Forest Regressor is {}".format(val_error))

The RMSE for our Random Forest Regressor is 150130.1425196851


In [None]:
df2 = pd.read_csv("AB_US_2020.csv")
del df2["name"]
del df2["host_name"]
del df2["neighbourhood_group"]
del df2["city"]
del df2["neighbourhood"]
del df2["last_review"]

dict2={}
for room in df2['room_type'].tolist():
    dict2[room]=1
    
for room in dict2.keys():
    df2[room] = df2['room_type'].apply(oneHot, hot=room)

df2 = df2.drop(['room_type'],axis=1)

In [None]:
#drop zeros and negative prices, if any
df2 = df2[df2.price > 0]
#drop highest price, likely an outlier
df2 = df2[df2.price < 24999]

In [None]:
df2.info()

In [None]:
# Redo basic linear regression

df2["reviews_per_month"]=df2["reviews_per_month"].fillna(0)

df2.to_csv("munge_plus_reviewspermonth.csv")

X2 = df2.drop(["price"],axis=1)
y2 = df2["price"]
X2_train, X2_val, y2_train, y2_val = train_test_split(X2, y2, test_size=0.20, random_state=42)


sc = StandardScaler()
X2_train = sc.fit_transform(X2_train)
X2_val = sc.transform(X2_val)


In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
from math import sqrt

sgd_reg = LinearRegression()
sgd_reg.fit(X2_train, y2_train) 
y2_val_predict = sgd_reg.predict(X2_val)
val_error = sqrt(mean_squared_error(y2_val, y2_val_predict))
print(val_error)

NameError: name 'X2_train' is not defined

In [12]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=80, min_samples_leaf=2, max_features="log2", bootstrap=False)

forest.fit(X_train, y_train)
y_val_predict_forest = forest.predict(X_val)
print(np.sqrt(mean_squared_error(y_val,y_val_predict_forest)))

371.9635192644459


In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import time
n_estimators = [int(x) for x in np.arange(start=10, stop = 100, step = 10)]
max_features = ["log2"]
min_samples_leaf = [2]
bootstrap = [False]
max_depth = [100,150,200,250,300,None]
random_grid = {"n_estimators": n_estimators,
               "max_features": max_features,
               "min_samples_leaf": min_samples_leaf,
               "bootstrap": bootstrap,
               "max_depth":max_depth}

rand = RandomForestRegressor()
start = time.time()
rand_search = RandomizedSearchCV(estimator = rand, param_distributions=random_grid, n_iter = 15, cv=5, verbose =2, random_state=42, n_jobs=-1)
rand_search.fit(X_train,y_train)
rand_search.best_params_
end = time.time()
elapsed = end - start
print(elapsed)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  6.5min finished
424.6918635368347


In [11]:
print(rand_search.best_params_)

{'n_estimators': 80, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False}
