# CS3315 Final Project
# Authors: Cameron Woods and Micky Hall 

## Introduction 

Throughout this notebook we will be attempting to create a model using supervised learning techniques that can accurately predict the price of an airbnb for a night. We will be using a dataset from kaggle that includes 226,029 rows. Each row represents an individual airbnb listing and includes 15 features and a label which will be discussed in depth at a later point in this notebook. 

### General Process 


We began this project by taking all of our data and lightly processing it and attempting to fit it into a bare bones model using linear regression to see how well it would perform. We then analysed the results and looked for reasons in the large skew in our predictions. We then went back and looked at the data as a whole and began to munge our data into a more palatable set for future models. We then checked for any increase in performance from our model. We then began to look into feature engineering and hyper parameter tuning for a greater fit. We slowly worked our way to a better model. We then decided to attempt to run our data in a neural network and see how well that would predict our label. All in all our models ended up predicting with a mse of %%%.   

In [1]:
import time
import pandas as pd 
import numpy as np
from math import sqrt
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from keras.wrappers.scikit_learn import KerasRegressor
import tensorflow as tf 
import  tensorflow.keras as keras 


In [2]:
'''
Data loading and initial scrub. 
These features have been dropped because they can either be accurately captured 
in another feature, we felt they were irrelevant to our prediction, or they were too
sparse to be able to munge
'''
df = pd.read_csv("AB_US_2020.csv")
# For refrencing later 
unclean_df = df 

df = df.drop(["name","host_name","city","neighbourhood",
                "last_review","id","neighbourhood_group"],axis=1)

df["reviews_per_month"] = df["reviews_per_month"].fillna(0)


In [3]:
'''
One hot encode the room type feature to be able to represent the differnt type
of property you can rent
'''
def oneHot(category, hot):
    if category == hot:
        return 1
    else:
        return 0

dict={}
for room in df['room_type'].tolist():
    dict[room]=1
    
for room in dict.keys():
    df[room] = df['room_type'].apply(oneHot, hot=room)

df = df.drop(['room_type'],axis=1)


In [4]:

#drop zeros and negative prices, if any
df = df[df.price > 0]
#drop highest price, likely an outlier
df = df[df.price < 24999]

def data_dump(df):
    X = df.drop(["price"],axis=1)
    y = df["price"]
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_val = sc.transform(X_val)
    return X,y,X_train,y_train,X_val,y_val
def data_loss(original_df, current_df):
    orig = original_df.shape[0]
    clean = current_df.shape[0]
    dropped = orig-clean
    perc = (dropped/orig) *100
    print("Shape of our current dataset = {}".format(clean))
    print("Shape of the original dataset = {}".format(orig))
    print("So we have only dropped {} rows, which is {} percent of our data".format(dropped,perc))
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

Shape of our current dataset = 225940
Shape of the original dataset = 226029
So we have only dropped 89 rows, which is 0.03937547836782006 percent of our data


## Initial Cleaning of Data

We began with a dataset that contained the attributes:

id, name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count, availability_365, and city. 

After reviewing the data set we decided that it would be best to drop name, host_name, neighbourhood_group, city, neighbourhood, last_review, and id. We dropped name and host_name because they are similar on many dissimilar listings and the value they represent is better caputred in the host_id. We dropped neighbourhood_group, city, and neighbourhood because many of these values were missing, and they can also be represented by the latitude and longitude values given. We dropped last review because it was just the last review of the property which would require us to do some form of semantic analysis to convert to a meaningful attribute. And finally we dropped id because it was just a unique identifier for each listing that held no real value for the models. 

After dropping these attributes we one hot encoded the room_type attribute so that all property types could be represented, and we filled in all empty values in the reviews_per_month since an empty value is likely to represent no reviews. 

After dropping and correcting our values we split our labels and features apart and then split them into training and validation sets and then scale them for our models. 

In [5]:
def regressor_tester(reg, X_train, y_train, X_val,y_val,y, name,fit=True):
    if(fit==True):
        reg.fit(X_train, y_train) 
    y_val_predict = reg.predict(X_val)
    RMSE = sqrt(mean_squared_error(y_val, y_val_predict))
    MAE = mean_absolute_error(y_val,y_val_predict)
    r2 = r2_score(y_val,y_val_predict)
    mean = y.describe()["mean"]
    std_dev = y.describe()["std"]
    print("For our {} Regressor the RMSE is {}, MAE is {}, and r2 is {}".format(name,RMSE,MAE,r2))
    print("The current mean value for our label is {} and a single standard deviation is {}".format(mean,std_dev))
    return y_val_predict 

In [None]:
lin = LinearRegression()
lin_predict = regressor_tester(lin,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

## Initial running of models and light analysis  


After lightly cleaning our data we run it through a Linear Regressor and a Random Forrest Regressowithout any hyperparameter tuning and ended up with: 

RMSE of Linear Regressor = 504.1366
RMSE of Random Forest Regressor = 379.4682 

When we initially look at this it seems that we are making decent predictions considering we our predicting within 1 standard deviation of error. However, when you look at the data our 75th percentile starts at a price of around 200, so we are grossly over predicting for most of our data. From here we can start to look at the data and munge it some more to try to get better estimates.

In [None]:
#pred_df = pd.DataFrame({"Actual":y_val,"Predict":lin_predict})
#sns.lmplot(x="Actual",y="Prediction Error",data=pred_df)
#pred_df = pd.DataFrame({"Actual":y_val,"Predict":forest_predict})
#sns.lmplot(x="Actual",y="Prediction Error",data=pred_df)

In [None]:
plt.rcParams["figure.figsize"] = 12,12
sns.heatmap(df.corr(),annot=True)

In [None]:
#sns.histplot(data=df, x="price",kde=True)

In [None]:
#drop higher priced listings 
df = df[df.price < 2500]
#sns.histplot(data=df, x="price",kde=True)


In [None]:
df = df[df.price < 1000]
#sns.histplot(data=df, x="price",kde=True)

In [None]:
#plt.rcParams["figure.figsize"] = 12,12
#sns.heatmap(df.corr(),annot=True)

In [None]:
# after having dropped prices > 1000
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

In [None]:
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

In [None]:
#pred_df = pd.DataFrame({"Actual":y_val,"Predict":lin_y_val_predict})
#sns.lmplot(x="Actual",y="Prediction Error",data=pred_df)
#pred_df = pd.DataFrame({"Actual":y_val,"Predict":forest_y_val_predict})
#sns.lmplot(x="Actual",y="Prediction Error",data=pred_df)

In [None]:
df = df.drop(["minimum_nights"],axis=1)
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

In [None]:
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

In [None]:
df = df.drop(["host_id"],axis=1)
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

In [None]:
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

In [6]:
df = df[df.price < 600]
df = df.drop(["Hotel room"],axis=1)
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

Shape of our current dataset = 214702
Shape of the original dataset = 226029
So we have only dropped 11327 rows, which is 5.011303859239301 percent of our data


In [10]:
cols = list(df.columns)
cols.sort() 
df = df[cols]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214702 entries, 0 to 226028
Data columns (total 12 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Entire home/apt                 214702 non-null  int64  
 1   Private room                    214702 non-null  int64  
 2   Shared room                     214702 non-null  int64  
 3   availability_365                214702 non-null  int64  
 4   calculated_host_listings_count  214702 non-null  int64  
 5   host_id                         214702 non-null  int64  
 6   latitude                        214702 non-null  float64
 7   longitude                       214702 non-null  float64
 8   minimum_nights                  214702 non-null  int64  
 9   number_of_reviews               214702 non-null  int64  
 10  price                           214702 non-null  int64  
 11  reviews_per_month               214702 non-null  float64
dtypes: float64(3), i

In [11]:
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

Shape of our current dataset = 214702
Shape of the original dataset = 226029
So we have only dropped 11327 rows, which is 5.011303859239301 percent of our data
For our Linear Regressor the RMSE is 95.44174486620469, MAE is 68.10286878885323, and r2 is 0.21046828039987608
The current mean value for our label is 148.66011029240528 and a single standard deviation is 107.94989078801439
For our Random Forest Regressor the RMSE is 78.87424300602446, MAE is 52.9542321146689, and r2 is 0.4607834047603928
The current mean value for our label is 148.66011029240528 and a single standard deviation is 107.94989078801439


In [None]:
#Convert neighbourhood into label encoded featuresS
labelencoder = LabelEncoder()
df["neighbourhood"] = unclean_df["neighbourhood"]
df["neighbourhood"] = labelencoder.fit_transform(df["neighbourhood"])
X,y,X_train,y_train,X_val,y_val = data_dump(df)
data_loss(unclean_df,df)

In [None]:
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

## After Data Analysis 

In [None]:

n_estimators = list(range(10,101,10))
max_features = ["log2","auto","sqrt"]
min_samples_leaf = [1,2,4,6]
bootstrap = [False,True]
max_depth = [100,150,200,250,300,None]
random_grid = {"n_estimators": n_estimators,
               "max_features": max_features,
               "min_samples_leaf": min_samples_leaf,
               "bootstrap": bootstrap,
               "max_depth":max_depth}

forest = RandomForestRegressor()
forest_rs = RandomizedSearchCV(estimator = forest, param_distributions=random_grid, n_iter = 50, cv=5, verbose =2, random_state=42, n_jobs=-1)
forest_rs.fit(X_train,y_train)
print(forest_rs.best_params_)

In [None]:
forest_rs_predict = regressor_tester(forest_rs,X_train,y_train,X_val,y_val,y,"Random Search Random Forest",False)


In [None]:

params = {"min_child_weight": [1,5,10,20,50],
          "gamma":[0.001,0.01,0.1,1,10,100],
          "subsample":[0.5,0.75,1],
          "max_depth":[1,2,3,10,20,30,100,200,300],
          "colsample_bytree":[0.5,1,2,5,10,20]}

xgb = XGBRegressor()
xgb_rs = RandomizedSearchCV(xgb,param_distributions=params,n_iter=100,cv=5,n_jobs=-1,verbose=3,random_state=42)
xgb_rs.fit(X_train,y_train)
print(xgb_rs.best_params_)


In [None]:
xgb_rs_predict = regressor_tester(xgb_rs,X_train,y_train,X_val,y_val,y,"Random Search XGB",False)


In [None]:
sgd = SGDRegressor(max_iter=5000,tol=-np.infty, warm_start=True,penalty=None,
                             learning_rate="constant",eta0=0.5,early_stopping=True)
sgd_predict = regressor_tester(sgd,X_train,y_train,X_val,y_val,y,"Stochastic Gradient Descent")


In [None]:
#pg128 polynomial regression 

poly_features = PolynomialFeatures(degree=5,include_bias=False)
X_poly = poly_features.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X_poly, y, test_size=0.20, random_state=42)

lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Polynomial Regression")

## After Hyperparameter Tuning 

In [None]:
X,y,X_train,y_train,X_val,y_val = data_dump(df)

def baseline_adam_model():
    nn = keras.models.Sequential()
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(20,activation="relu"))
    nn.add(keras.layers.Dense(1))
    nn.compile(loss="mean_squared_error",optimizer="adam")
    return nn
def deeper_adam_model():
    nn = keras.models.Sequential()
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(30,activation="relu"))
    nn.add(keras.layers.Dense(1))
    nn.compile(loss="mean_squared_error",optimizer="adam")
    return nn
early_stop = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)

In [None]:
bl_adam_nn = KerasRegressor(build_fn=baseline_adam_model, epochs=100, batch_size=5,verbose=1)
bl_adam_nn.fit(X_train,y_train)
y_val_predict_nn = bl_adam_nn.predict(X_val)
print(np.sqrt(mean_squared_error(y_val,y_val_predict_nn)))



In [None]:
bl_adam_model = baseline_adam_model()
bl_adam_model.fit(X_train,y_train,epochs=100,
            validation_data=(X_val,y_val),
            callbacks=[early_stop],verbose=1)
y_val_predict_nn = bl_adam_model.predict(X_val)
print(np.sqrt(mean_squared_error(y_val,y_val_predict_nn)))


In [None]:
deep_adam_nn = KerasRegressor(build_fn=deeper_adam_model, epochs=100, batch_size=5,verbose=1)
deep_adam_nn.fit(X_train,y_train)
y_val_predict_nn = deep_adam_nn.predict(X_val)
print(np.sqrt(mean_squared_error(y_val,y_val_predict_nn)))

In [None]:
deep_adam_model = deeper_adam_model()
deep_adam_model.fit(X_train,y_train,epochs=100,
            validation_data=(X_val,y_val),
            callbacks=[early_stop],verbose=1)
y_val_predict_nn = deep_adam_model.predict(X_val)
print(np.sqrt(mean_squared_error(y_val,y_val_predict_nn)))

## After Neural Network 

In [None]:
plt.rcParams["figure.figsize"] = 12,12
sns.heatmap(df.corr(),annot=True)

In [None]:
data_loss(unclean_df,df)
X,y,X_train, y_train,X_val,y_val = data_dump(df)

In [None]:
df = df[df.price<350]
data_loss(unclean_df,df)

In [None]:
plt.rcParams["figure.figsize"] = 12,12
sns.heatmap(df.corr(),annot=True)

In [None]:
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

In [None]:
X.info()
df = df.drop(["number_of_reviews","reviews_per_month","calculated_host_listings_count",
              "availability_365","neighbourhood"],axis=1)


In [None]:
X,y,X_train, y_train,X_val,y_val = data_dump(df)
lin_reg = LinearRegression()
lin_predict = regressor_tester(lin_reg,X_train,y_train,X_val,y_val,y,"Linear")

forest = RandomForestRegressor()
forest_predict = regressor_tester(forest,X_train,y_train,X_val,y_val,y,"Random Forest")

## Our attemp at overfitting 

In [12]:
import pandas as pd 
seatle_df = pd.read_csv("listings.csv")
# For refrencing later 

#df["reviews_per_month"] = df["reviews_per_month"].fillna(0)

In [13]:
seatle_col = list(seatle_df.columns)
df_col = list(df.columns)
to_del = []
for col in seatle_col:
    if col not in df_col:
        to_del.append(col)
to_del.remove("room_type")
seatle_df = seatle_df.drop(to_del,axis=1)

In [14]:
seatle_df["reviews_per_month"] = seatle_df["reviews_per_month"].fillna(0)

seatle_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   host_id                         3818 non-null   int64  
 1   latitude                        3818 non-null   float64
 2   longitude                       3818 non-null   float64
 3   room_type                       3818 non-null   object 
 4   price                           3818 non-null   object 
 5   minimum_nights                  3818 non-null   int64  
 6   availability_365                3818 non-null   int64  
 7   number_of_reviews               3818 non-null   int64  
 8   calculated_host_listings_count  3818 non-null   int64  
 9   reviews_per_month               3818 non-null   float64
dtypes: float64(3), int64(5), object(2)
memory usage: 298.4+ KB


In [None]:
df.info()

In [None]:
seatle_df.info()

In [15]:
dict={}
for room in seatle_df['room_type'].tolist():
    dict[room]=1
    
for room in dict.keys():
    seatle_df[room] = seatle_df['room_type'].apply(oneHot, hot=room)

seatle_df = seatle_df.drop(['room_type'],axis=1)
cols = list(seatle_df.columns)
cols.sort()
seatle_df = seatle_df[cols]

In [23]:
X,y,X_train,y_train,X_val,y_val = data_dump(seatle_df)
mon = y[0] 
mon = mon.strip("$")
int(mon)

ValueError: invalid literal for int() with base 10: '85.00'

In [16]:
X,y,X_train,y_train,X_val,y_val = data_dump(seatle_df)
seatle_predict = forest.predict(X)
RMSE = sqrt(mean_squared_error(y,seatle_predict))
MAE = mean_absolute_error(y,seatle_predict)
print("RMSE = {} MAE = {}".format(RMSE,MAE))

ValueError: could not convert string to float: '$85.00'

## Seatle bby 