# Airbnb Price Forecasting

Data Source: [NYC Airbnb Price Dataset on Kaggle Platform](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data)

This dataset has around 49,000 observations with 16 columns.

Main Steps:

- [Data Preprocessing](#Data-Preprocessing)
- [Multiple Linear Regression Model](#Multiple-Linear-Regression-Model)
- [Random Forest Model](#Random-Forest-Model)
- [Gradient Boosting Model](#Gradient-Boosting-Model)
- [Conclusion](#Conclusion)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv("AB_NYC_2019.csv")
df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


In [3]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

# Data Preprocessing

In [4]:
df.fillna({'reviews_per_month': 0}, inplace=True) # deal with numerical missing values

In [5]:
df.drop(['last_review', 'id', 'host_name'], axis = 1, inplace =True) 

In [6]:
df = df.loc[df["price"]>0] # delete extreme values

In [7]:
df.isnull().sum() # Examine missing values

name                              16
host_id                            0
neighbourhood_group                0
neighbourhood                      0
latitude                           0
longitude                          0
room_type                          0
price                              0
minimum_nights                     0
number_of_reviews                  0
reviews_per_month                  0
calculated_host_listings_count     0
availability_365                   0
dtype: int64

Feature Engineering: Bag-of-Words (BoW)

In [8]:
# CountVectorizer counts the number of times a token shows up in the document and uses this value as its weight.
vectorizer = CountVectorizer(analyzer = 'word', 
                             stop_words = 'english', 
                             min_df = 0.05, 
                             binary = True)

df_vectorizer = vectorizer.fit(df.name.fillna(''))

df_tdm = pd.DataFrame(df_vectorizer.transform(df.name.fillna('')).toarray(), columns = df_vectorizer.get_feature_names(), index = df.index)

df_tdm

Unnamed: 0,apartment,apt,beautiful,bedroom,brooklyn,cozy,east,manhattan,park,private,room,spacious,studio,sunny,williamsburg
0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
48891,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1
48892,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
48893,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [9]:
df2 = pd.concat([df, df_tdm], axis = 1)

df2.drop(columns = 'name', inplace = True)

df2

Unnamed: 0,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,...,cozy,east,manhattan,park,private,room,spacious,studio,sunny,williamsburg
0,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,...,0,0,0,1,0,0,0,0,0,0
1,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,...,0,0,0,0,0,0,0,0,0,0
2,4632,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,0.00,...,0,0,0,0,0,0,0,0,0,0
3,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,...,1,0,0,0,0,0,0,0,0,0
4,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.10,...,0,0,0,1,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,8232441,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,0.00,...,0,0,0,0,0,0,0,0,0,0
48891,6570630,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,0.00,...,0,1,0,0,0,1,0,0,0,1
48892,23492952,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,0.00,...,0,0,0,0,0,0,0,1,1,0
48893,30985759,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,0.00,...,1,0,0,0,0,0,0,0,0,0


One-Hot Encode Categorical Data

In [10]:
enc = OneHotEncoder(handle_unknown = 'ignore', 
                    sparse = False)

enc_df = df2.select_dtypes(['object'])

enc = enc.fit(enc_df)

encoded_df = pd.DataFrame(enc.transform(enc_df))

encoded_df.columns = enc.get_feature_names(enc_df.columns)
encoded_df.index = enc_df.index

df_Enc = pd.concat([df2.drop(enc_df.columns, axis = 1), encoded_df], axis = 1)

df_Enc.head()

Unnamed: 0,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,apartment,...,neighbourhood_Williamsburg,neighbourhood_Willowbrook,neighbourhood_Windsor Terrace,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodrow,neighbourhood_Woodside,room_type_Entire home/apt,room_type_Private room,room_type_Shared room
0,2787,40.64749,-73.97237,149,1,9,0.21,6,365,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2845,40.75362,-73.98377,225,1,45,0.38,2,355,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,4632,40.80902,-73.9419,150,3,0,0.0,1,365,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4869,40.68514,-73.95976,89,1,270,4.64,1,194,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,7192,40.79851,-73.94399,80,10,9,0.1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [11]:
labels = df_Enc['price'].values
labels = np.log10(labels)
train1 = df_Enc.drop(['price'], axis =1)

Split Data 

In [12]:
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state = 2)

# Multiple Linear Regression Model

In [13]:
lr = LinearRegression()
lr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [14]:
y_pred = lr.predict(x_test)
eva = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
eva.head(10)

Unnamed: 0,Actual,Predicted
0,250.0,147.0
1,85.0,76.0
2,178.0,226.0
3,99.0,116.0
4,150.0,57.0
5,99.0,157.0
6,39.0,64.0
7,80.0,77.0
8,100.0,122.0
9,100.0,89.0


In [15]:
lr.score(x_test, y_test)

0.5599734667839513

# Random Forest Model

In [16]:
rf = RandomForestRegressor(max_depth=8, n_estimators = 100, random_state = 0)
rf.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=8, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

In [17]:
# Predicting the Test set results
y_pred = rf.predict(x_test)
eva = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
eva.head(10)

Unnamed: 0,Actual,Predicted
0,250.0,129.0
1,85.0,78.0
2,178.0,195.0
3,99.0,120.0
4,150.0,61.0
5,99.0,135.0
6,39.0,50.0
7,80.0,74.0
8,100.0,121.0
9,100.0,97.0


In [18]:
rf.score(x_test, y_test)

0.5913505821362515

# Gradient Boosting Model

In [19]:
clf = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth =5, min_samples_split =2, learning_rate = 0.1, loss = 'ls')
clf.fit(x_train, y_train)
# n_estimators- the number of boosting stages to perform
# max_Depth- the depth of the tree node
# min_samples_split- number of sample to be split for learning the data
# learning rate- rate of learning the data
# loss - loss function to be optimized, 'ls' refers to least squares regression

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=5,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=400,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [20]:
# Compare actual and predicted values
y_pred = clf.predict(x_test)
eva = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
eva.head(10)

Unnamed: 0,Actual,Predicted
0,250.0,142.0
1,85.0,71.0
2,178.0,247.0
3,99.0,104.0
4,150.0,58.0
5,99.0,107.0
6,39.0,52.0
7,80.0,68.0
8,100.0,115.0
9,100.0,89.0


In [21]:
clf.score(x_test, y_test)

0.6254349283048387

# Conclusion

Price prediction models are not performing well. The best model is Gradient Boosting Model with 62.54% score.<br>
There's no more feature engineering that we could conduct given this dataset. It will be much better if we have more information about the properties, including number of bedrooms, house area in square feet, floor numbers, the year it was built, etc.