# Business Understanding: Airbnb

## **An Overview**

Airbnb is a platform for individuals to rent out their primary residences as lodgings for travelers. Typically, renters seek accommodations with a homey feel that hotels cannot provide, while most hosts are willing to rent out their homes to supplement their income. The majority of Airbnb’s revenue comes from service fees from bookings charged to both guests and hosts [1], and it currently covers more than 81,000 cities and 191 countries worldwide [2].

## **The Advantages of Airbnb**

* Wide Selection: 
Airbnb hosts list many different kinds of properties—single rooms, a suite of rooms, apartments, moored yachts, houseboats, entire houses, even a castle—on the Airbnb website.
* Free Listings: 
Hosts do not have to pay to list their properties.
* Hosts Can Set Their Own Price: 
It is up to each host to decide how much to charge per night, per week, or per month.
* Customizable Searches: 
Guests can search the Airbnb database—not only by date and location, but by price, type of property, amenities, and the language of the host. They can also add keywords (such as “close to the XXX”) to further narrow their search.
* Protections for Guests and Hosts: 
As a protection for guests, Airbnb holds the guest’s payment for 24 hours after check-in before releasing the funds to the host. For hosts, Airbnb’s Host Guarantee program “provides protection for up to $1,000,000 in damages to covered property in the rare event of guest damage, in eligible countries.” [2]

## **The Disadvantages of Airbnb**

* What You See May Not Be What You Get:
Individual hosts create their own listings, and some may be more honest than others. However, previous guests often post comments about their experiences, which can provide a more objective view.
* Potential Damages:
Probably the biggest risk for hosts is that their property will be damaged.
* Added Fees:
Airbnb imposes a number of additional fees (as, of course, do hotels and other lodging providers). Guests pay a guest service fee of 0% to 20% on top of the reservation fee, to cover Airbnb’s customer support and other services, and while listings are free, Airbnb charges hosts a service fee of at least 3% for each reservation to cover the cost of processing the transaction.
* It Is Not Legal Everywhere. [2]

To sum up, for hosts, participating in Airbnb is a way to earn extra income from their property, but with the main risk that the guest might do serious damage to it. For guests, the advantage can be relatively inexpensive and homey accommodations than a hotel room, but with the risk that the property will not be as appealing as the listing made it seem.

Source:
[1] Nath, Trevir. “How Airbnb Makes Money.” Investopedia.com. N.p., 16 Nov. 2020. Web. 29 Nov. 2020.
[2] Folger, Jean. “Airbnb: Advantages and Disadvantages.” Investopedia.com. N.p., 28 Aug. 2020. Web. 29 Nov. 2020.


# Data Understanding

## **About Data Understanding**

In this section, we will give an overview of how the dataset looks like. We will introduce the source and the size of this dataset. We will present the number of instances and attributes (which are equivalent to the number of rows and columns). Specifically, we will discuss the meaning of the attributes and the data type associated with them. 

## **Data Understanding of the Dataset**

In this project, we investigate the [Airbnb ratings dataset](http://www.kaggle.com/samyukthamurali/airbnb-ratings-dataset) which we found on Kaggle. Our goal is to discover which attributes are the most influential to determining the prices of the Airbnbs in the U.S.. The original dataset contains four sub-datasets: LA_Listings, NY_Listings, airbnb_ratings_new and  airbnb-reviews. The first three datasets contain 59.9k instances and 35 attributes, including customer ID, host ID, locations, layouts, furnishings, prices of the residences, review scores, etc. The last dataset contains 1325 instances and 6 attributes, which are customer ID, host ID, review ID, reviewer name, date and comments. 


For our case, we picked attributes that are meaningful to our analysis from the four datasets and form our own dataset by filtering and combining the data (which will be discussed in the data preparation section). This new dataset has 295,452 instances and 19 attributes. The description of each attribute is listed below[3]. A specific description of the data of each attribute(count, sd, mean, min, 25\% quantile, 50\% quantile, 75\% quantile, max) are shown in the last table of the data prepartion section. 

**Listing ID**: the ID number of an Airbnb 

**Host ID** the ID of the host 

**Host total listings count**: the total number of host listings 

**Longitude**: the longitude of the Airbnb 

**Accommodates**: the number of people an Airbnb can accommodate

**Bathrooms**: number of bathrooms 

**Bedrooms**: number of bedrooms 

**Price**: price of an Airbnb per day

**Minimum nights**: the minimum number of nights a guest stay 

**Maximum nights**: the maximum number of nights a guest stay 

**Availability 365**: the number of days available in a year 

**Number of reviews**: the total number of reviews

**Review Scores Accuracy**: how accurately did the listing page represent an Airbnb? 

**Review Scores Cleanliness**: how clean and tidy did the guests feel about an Airbnb? 

**Review Scores Checkin**: how smoothly did check-in go?

**Review Scores Communication**: how well did the guests communicate with the hosts before and during the stay?

**Review Scores Location**: how did guests feel about the neighborhood? (Whether there's an accurate description for proximity and access to transportation, shopping centers, city center, etc., and a description that includes special considerations, like noise, and family safety.) 

**Review Scores Value**：did the guest feel that the listing provided good value for the price? 

**Reviews per month**: the number of reviews a host receives per month 


## **Data Type:**

Asides from knowing the meanings of the attributes, it is also necessary to know the data type of the attibutes, since the data type of an attribue affects the methods we can use to analyze and understand the data. 

In general, an attribute can be classified as one of the four data types[4]: 

**Nominal**: data that can be categorized by labelling them in exclusive groups (names for categories, classes and states of things, etc)

**Ordinal**: data that can be categorized and ranked, but cannot know the differences between data 

**Interval**: a numerical scale where the order of the data is known as well as the difference between the data, but there is no true zero point

**Ratio**: data that can be measured on a scale that not only produces the order of the data but also the difference between data. This type of data has a true zero.

**In our case**, Listing ID and Host ID are of nominal data type, and the rest of the attributes are of ratio data type. 


Sources:

[3] “How Do Star Ratings Work for Stays? - Airbnb Help Center.” Airbnb, www.airbnb.com/help/article/1257/how-do-star-ratings-work-for-stays. 

[4] https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/


# Data Preparation 

To begin the exloratory data analysis and train machine learning model, we need to do the data preparation first.

We will import, combine, and filter the data we need and output an csv file for the further use.

### Import Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Then, import libraries and define functions for plotting the data using matplotlib.

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
import pandas as pd
import scipy.stats

Now, let's read in the data and prepare to combine the csvs into one.


In [None]:
# Read the LA_Lising.csv
df = pd.read_csv('/kaggle/input/airbnb-ratings-dataset/LA_Listings.csv', encoding='ISO-8859-1')
df.head()

In [None]:
# Read the NY_Listings.csv
df2 = pd.read_csv('/kaggle/input/airbnb-ratings-dataset/NY_Listings.csv', encoding='ISO-8859-1')
df2.head()

Finnally, we need to read 'airbnb_ratings_new.csv'.

In [None]:
# Read the airbnb_ratings_new.csv
df3 = pd.read_csv('/kaggle/input/airbnb-ratings-dataset/airbnb_ratings_new.csv', encoding='ISO-8859-1')
pd.set_option('display.max_columns', None)
df3.head()

We can see that in 'airbnb_ratings_new.csv' file, this dataset includes airbnb list from a lot of different countries such as Italy, China Hong Kong, Austria... But we only want to analysis the listings inside U.S because the price changes a lot in diffrent countris, so we need to filter the 'Country' collumn.

In [None]:
df_filtered = df3[df3['Country'] == 'United States']

df_filtered.head()


Now, let's get more infomation with our datasets.

In [None]:
df.describe()


In [None]:
df2.describe()


In [None]:
df_filtered.describe()

Now, let's combine those three datasets into one:

In [None]:
combinedDf = df.append(df2)
df_final = combinedDf.append(df_filtered)

df_final.describe()

**Now, 'df_final' has 295,452 lines of data and ready to use.**

# Exploratory Data Analysis 

In This part, we will do exploratory data analysis by examine the correlation between Price with number of bedrooms, bathrooms and review scores.

## Distribution plots of variables

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
import scipy.stats

In [None]:
# Density Plot and Histogram of variable "Price"
sns.distplot(df_final['Price'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

From the graph we can see, becaue of the large range of 'Price', we need to 
filter those unnessary data which could influence our analysis. After observation, we found set the range from 0 to 500 is appropriate.

In [None]:
# Filter the Price to below 500
PriceFilteredData = df_final[df_final['Price'] < 500]

# Density Plot and Histogram of variable "Price"
sns.distplot(PriceFilteredData['Price'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Now, let's see the distribution of numbers of Bedrooms:

In [None]:
# Density Plot and Histogram of variable "Bedrooms"
sns.distplot(df_final['Bedrooms'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

We can see that most houses have bedrooms from 0 to 6, so let's filter the data 

In [None]:
# Filter the Bedrooms to below 6
BedroomsFilteredData = df_final[df_final['Bedrooms'] < 6]

# Density Plot and Histogram of variable "Price"
sns.distplot(PriceFilteredData['Bedrooms'], hist=True, kde=False, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

From this graph, we can see that the shape of the distributions of 'Numbers of Bedrooms' and the distributions of 'Price' are very similar, which indicates the possibilities between them, and we will do further investigations later. Before that, let's do more distribution graph on other variables.

In [None]:
# Filter the Bathrooms to below 6
BedroomsFilteredData = df_final[df_final['Bathrooms'] < 6]

# Density Plot and Histogram of variable "Bathrooms"
sns.distplot(BedroomsFilteredData['Bathrooms'], hist= True, kde=False, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Let's see the distribution with more varibles:

In [None]:
# Density Plot and Histogram of variable "Bedrooms"
sns.distplot(df_final['Availability 365'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Review Scores Value'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Reviews per month'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Let's filter the Varible:

In [None]:
# Filter the Bathrooms to below 6
ReviewsFilteredData = df_final[df_final['Reviews per month'] < 10]


# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(ReviewsFilteredData['Reviews per month'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})


In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Number of reviews'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Filter the data:

In [None]:
# Filter the Bathrooms to below 6
ReviewsFilteredData = df_final[df_final['Number of reviews'] < 60]


# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(ReviewsFilteredData['Number of reviews'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

We can see that most listing have 0 - 10 reviews.

Next, let's do the correlation test to find out the potential correa

# Correlation Test

Firstly, let's do Pairplot using Seanborn Library, to see if there exist some correlation between two variables:

In [None]:
BedroomsFilteredData = PriceFilteredData[PriceFilteredData['Bedrooms'] < 6]
BathroomsFilteredData = BedroomsFilteredData[BedroomsFilteredData['Bathrooms'] < 6]
filteredData = BedroomsFilteredData[BedroomsFilteredData['Reviews per month'] < 10]

In [None]:
cols=['Price','Bedrooms','Bathrooms','Review Scores Value','Reviews per month','Review Scores Accuracy']
sns.pairplot(filteredData[cols])
plt.show()

From above graph, we cannot find linear pattern between 'Price' and other variables, and we will do correlation test to ensure this assumption. 

A correlation coefficient measures the extent to which two variables tend to change together. The coefficient describes both the strength and the direction of the relationship.

As we know, The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables.

In our case, because we want to find the correlation between the 'Price' which is continous and the 'Number of Bedrooms' which is ordinal, and the 'Number of Badthrooms' which is also ordinal, the 'Reviews per month' - ordinal, and 'Reviews Score' - ordinal, we should use Spearman method.


Now, let's do spearman correlation test on our data: 

In [None]:
filteredData.corr(method='spearman')

From the result table, we found that 'Price' and 'Accommodates' have a correlation coefficient of 0.55, which indicates they are moderately correlated, and 'number of Bedrooms' has a correlation coefficient of 0.46 with 'Price', which is the second highest value in all variables, which can be understand, because more bedrooms a house has, the higher the price can be, and more people a house can accommodates, more expensive it will be.

# Machine Learning: Using Linear Regression for Predicting the Prices of Airbnb Residences 

## **Setup**

In [None]:
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
#Scikit-learn for implemeting LinearRegression from a existing algorithm.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.utils import resample

# Common imports
import numpy as np
import pandas as pd

from IPython.display import clear_output

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

np.random.seed(42)

import warnings
warnings.filterwarnings('ignore')

def computeCost(X, y, theta):
    return 1/(2*y.size)*np.sum(np.square(X.dot(theta)-y))

## **Partition the dataset**

Since we would like to test how well our predictive model would perform in the df_final data, we need to test the model's performance with data that has not shown before. To achieve this, we will first partition our dataset in a training and a test set. We will do an 70/30 partition using train_test_split from sklearn.model_selection

In [None]:
from sklearn.model_selection import train_test_split
df_final, test = train_test_split(df_final, test_size=0.3, random_state=43)

## **Remove rows with NA values**

To make sure the dataset we used to train the model does not contain any NA values, we remove the rows with NA values from our training and test sets.

In [None]:
# Drop any rows with null values
df_final.dropna(axis=0, how='any', inplace=True)
test.dropna(axis=0, how='any', inplace=True)

column_names = ['Host total listings count', 'longitude', 'Accommodates', 'Bathrooms', 'Bedrooms', 'Minimum nights', 'Maximum nights', 'Availability 365', 'Number of reviews',
                'Review Scores Accuracy', 'Review Scores Cleanliness', 'Review Scores Checkin', 'Review Scores Communication', 'Review Scores Location', 'Review Scores Value', 'Reviews per month']
X = df_final[column_names]
y = df_final['Price']

X_test = test[column_names]
y_test = test['Price']

## **Fit Multivariate Linear Regression Model**

We fit the training set to the linear regression model to decide which parameters are useful for our analysis (does not have zero coefficient). 

In [None]:
#Fit Multivariate Linear Regression Model
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

#Calculate the model's parameters error
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)

params = pd.Series(model.coef_, index=X.columns)
print(pd.DataFrame({'effect': params.round(0),
                    'error': err.round(0)}))

## **Remove unnecessary parameters**

Now we remove the variables that have coefficients of zero from the training and test set. 

In [None]:
# Drop any rows with null values
df_final.dropna(axis=0, how='any', inplace=True)
test.dropna(axis=0, how='any', inplace=True)

column_names = ['Accommodates', 'Bathrooms', 'Bedrooms', 'Review Scores Accuracy', 'Review Scores Cleanliness', 'Review Scores Checkin', 'Review Scores Communication', 'Review Scores Location', 'Review Scores Value', 'Reviews per month']
X = df_final[column_names]
y = df_final['Price']

X_test = test[column_names]
y_test = test['Price']


We perform multivariate linear regression model again on the new training set to make sure none of the variables have unmeaningful coefficients (coefficients=0). 

In [None]:
#Fit Multivariate Linear Regression Model
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

#Calculate the model's parameters error
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)

params = pd.Series(model.coef_, index=X.columns)
print(pd.DataFrame({'effect': params.round(0),
                    'error': err.round(0)}))

# Ridge Linear Regression 

In our project, we decided to use Ridge linear regression model since it can be computed very efficiently and hardly have any computational cost comparing to the original linear regression model [5]. Ridge Regression, also known as $𝐿_2$ regularization, puts constraint on the model coefficients. The penalty on the model fit is: 
 
$P = \lambda\sum_{j=1}^{N}\theta^2_j$ [5]

where $\lambda$ is the penalty term that regularizes the coefficients such that if the coefficients take large values, the optimization function will be penalized [6].

Ridge regression can reduce model complexity and prevent over-fitting which may result from simple linear regression [6]. When 𝜆→0 , we recover the standard linear regression result; when  𝜆→∞ , all model responses will be suppressed [5].

Source:

[5] Day_26_Regularized_Linear_Regression.ipynb

[6] https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b

## **Setup**

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

#import warnings
#warnings.filterwarnings('ignore')

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
#Scikit-learn for implemeting LinearRegression from a existing algorithm.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

# Common imports
import numpy as np
import os

from IPython.display import clear_output

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

np.random.seed(42)


def computeCost(X, y, theta):
    return 1/(2*y.size)*np.sum(np.square(X.dot(theta)-y))

## **Fit data to Ridge Regression Model**

Since we do not know what degree of polynomial to use, we tried out multiply ones to figure out the most suitable degree value for our model.

### **Degree = 1**

### **Import Ridge Linear Regression**

In [None]:
#Import Ridge Linear Regression
from sklearn.linear_model import Ridge

lambda_term=10
Degree_of_the_Polynomial_Model=1
polybig_features = PolynomialFeatures(degree=Degree_of_the_Polynomial_Model, include_bias=False)
std_scaler = StandardScaler()

Ridge_lin_reg = Ridge(alpha=lambda_term)

ridge_regression_pipeline = Pipeline([
        ("poly_features", polybig_features),
        ("std_scaler", std_scaler),
        ("Ridge_lin_reg", Ridge_lin_reg),])

#Fit Ridge Linear Regression Model
ridge_regression_pipeline.fit(X, y)

## **Calculate the model's parameters error**

Here we use bootstrap resamplings of the data to compute the uncertainties of the model's parameters.

In the output, **effect** represents the coefficient estimate of each parameter and **error** is the variation of the coefficient.

Take the parameter Bedrooms as an example. For each additional bedroom in an Airbnb, the price of that Airbnb increases 34$\pm$1 dollars. 

In [None]:
np.random.seed(1)
err = np.std([ridge_regression_pipeline.fit(*resample(X, y)).named_steps.Ridge_lin_reg.coef_
              for i in range(1000)], 0)

params = pd.Series(ridge_regression_pipeline.named_steps.Ridge_lin_reg.coef_, 
                   index=ridge_regression_pipeline.named_steps.poly_features.get_feature_names(X.columns))
pd.DataFrame({'effect': params.round(0),'error': err.round(0)})

## **Calculate model RMSE**

Next, we calculate some performance metrics to visualize how well our Linear Regression model did on the training set and testing set. 

RMSE stands for Root Mean Squared Error. It is the error rate by the square root of Mean Squared Error: 

$$ RMSE = \sqrt{MSE}= \sqrt{ \frac{1}{m}\sum_{i=1}^m\left( h_{\theta}(x^{(i)}) - y^{(i)}\right)^2}$$

In [None]:
#Training Set

RMSE_training_Ridge=np.sqrt(mean_squared_error(y, ridge_regression_pipeline.predict(X)))
print("RMSE Ridge LR from Training=" + str(round(RMSE_training_Ridge,4)))

#Plot predicted values 
df_final['predicted'] = ridge_regression_pipeline.predict(X)
df_final[['Price', 'predicted']].plot(alpha=0.5);

In [None]:
#Testing Set

RMSE_test_Ridge=np.sqrt(mean_squared_error(y_test, ridge_regression_pipeline.predict(X_test)))
print("RMSE Ridge LR from Test=" + str(round(RMSE_test_Ridge,4)))

test['predicted'] = ridge_regression_pipeline.predict(X_test)
test[['Price', 'predicted']].plot(alpha=0.5);

## **Degree = 2**

In [None]:
#Import Ridge Linear Regression
from sklearn.linear_model import Ridge

lambda_term=10
Degree_of_the_Polynomial_Model=2
polybig_features = PolynomialFeatures(degree=Degree_of_the_Polynomial_Model, include_bias=False)
std_scaler = StandardScaler()

Ridge_lin_reg = Ridge(alpha=lambda_term)

ridge_regression_pipeline = Pipeline([
        ("poly_features", polybig_features),
        ("std_scaler", std_scaler),
        ("Ridge_lin_reg", Ridge_lin_reg),])

ridge_regression_pipeline.fit(X, y)

In [None]:
#Calculate the model's parameters error
np.random.seed(1)
err = np.std([ridge_regression_pipeline.fit(*resample(X, y)).named_steps.Ridge_lin_reg.coef_
              for i in range(1000)], 0)

params = pd.Series(ridge_regression_pipeline.named_steps.Ridge_lin_reg.coef_, 
                   index=ridge_regression_pipeline.named_steps.poly_features.get_feature_names(X.columns))
pd.DataFrame({'effect': params.round(0),'error': err.round(0)})

In [None]:
#Calculate model RMSE for Training Set

RMSE_training_Ridge=np.sqrt(mean_squared_error(y, ridge_regression_pipeline.predict(X)))
print("RMSE Ridge LR from Training=" + str(round(RMSE_training_Ridge,4)))

#Plot predicted values 
df_final['predicted'] = ridge_regression_pipeline.predict(X)
df_final[['Price', 'predicted']].plot(alpha=0.5);

In [None]:
#Calculate model RMSE for Test Set

RMSE_test_Ridge=np.sqrt(mean_squared_error(y_test, ridge_regression_pipeline.predict(X_test)))
print("RMSE Ridge LR from Test=" + str(round(RMSE_test_Ridge,4)))

test['predicted'] = ridge_regression_pipeline.predict(X_test)
test[['Price', 'predicted']].plot(alpha=0.5);

## **Degree = 3**


In [None]:
#Import Ridge Linear Regression
from sklearn.linear_model import Ridge

lambda_term=10
Degree_of_the_Polynomial_Model=3
polybig_features = PolynomialFeatures(degree=Degree_of_the_Polynomial_Model, include_bias=False)
std_scaler = StandardScaler()

Ridge_lin_reg = Ridge(alpha=lambda_term)

ridge_regression_pipeline = Pipeline([
        ("poly_features", polybig_features),
        ("std_scaler", std_scaler),
        ("Ridge_lin_reg", Ridge_lin_reg),])

ridge_regression_pipeline.fit(X, y)


In [None]:
#Calculate the model's parameters error
np.random.seed(1)
err = np.std([ridge_regression_pipeline.fit(*resample(X, y)).named_steps.Ridge_lin_reg.coef_
              for i in range(1000)], 0)

params = pd.Series(ridge_regression_pipeline.named_steps.Ridge_lin_reg.coef_, 
                   index=ridge_regression_pipeline.named_steps.poly_features.get_feature_names(X.columns))
pd.DataFrame({'effect': params.round(0),'error': err.round(0)})

In [None]:
#Calculate model RMSE for Training Set

RMSE_training_Ridge=np.sqrt(mean_squared_error(y, ridge_regression_pipeline.predict(X)))
print("RMSE Ridge LR from Training=" + str(round(RMSE_training_Ridge,4)))

#Plot predicted values 
df_final['predicted'] = ridge_regression_pipeline.predict(X)
df_final[['Price', 'predicted']].plot(alpha=0.5);


In [None]:
#Calculate model RMSE for Test Set

RMSE_test_Ridge=np.sqrt(mean_squared_error(y_test, ridge_regression_pipeline.predict(X_test)))
print("RMSE Ridge LR from Test=" + str(round(RMSE_test_Ridge,4)))

test['predicted'] = ridge_regression_pipeline.predict(X_test)
test[['Price', 'predicted']].plot(alpha=0.5);

# **Conclusion**

* Degree of the polynomial model equals 1:

RMSE Ridge LR from Training = 89.8186

RMSE Ridge LR from Testing = 91.3785

* Degree of the polynomial model equals 2:

RMSE Ridge LR from Training = 87.2148

RMSE Ridge LR from Testing = 88.8126

* Degree of the polynomial model equals 3:

RMSE Ridge LR from Training = 85.6238 

RMSE Ridge LR from Testing = 87.9751

After comparing performance metrics (RMSE from training and testing respectively) for three different degrees (1,2,3) and visually inspect how well each of our Linear Regression models did on the training set and testing set, we decided to use the Ridge Linear Regression model with degree of 1. Utilizing a higher degree of the polynomial model did not, to significant extent, increase the performance of our model.

However, according to the plots depicting both actual and predicted price of Airbnb, they (actual prices and predicted prices) are not overlapped with each other, which indicates that we have missed some key features. Either our features are not complete (i.e., Airbnb’s price based on more than just these variables) or there are some nonlinear relationships that we have failed to take into account. Nevertheless, our rough approximation is enough to give us some insights, and we can take a look at the coefficients of the Ridge Linear Model (with degree of 1) to estimate how much each feature contributes to the price of Airbnb:

Here's a recap of the parameter estimates we got from fitting the training set to the Ridge Linear Regression model with degree 1: 

                                    effect	error
                                
    Accommodates	                 40.0	 1.0

    Bathrooms	                     7.0	 1.0

    Bedrooms	                    34.0	  1.0

    Review Scores Accuracy	         2.0	 1.0
 
    Review Scores Cleanliness	    12.0	  1.0
 
    Review Scores Checkin	        -9.0	  1.0

    Review Scores Communication	    -4.0	1.0

    Review Scores Location	        27.0	 1.0
    
    Review Scores Value	           -20.0	1.0

    Reviews per month	            -11.0	  0.0
    
  
From the result above, we see that the price of an Airbnb has a negative relationship with the review scores of checkin, communication and value as well as reviews per month, while accommodates, bathrooms, bedrooms, review scores accuracy, cleanliness and location are positively related to price. Among these, accommodates and bedrooms are particularly influential to the price of an Airbnb -- for each additional accommodates, the price of an Airbnb will increases 40.0 $\pm$1 dollars, and for each additional bedroom, the price of an Airbnb will increase 34$\pm$1 dollars. This result corresponds with our analysis of correlation in the EDA section, where we found out price is mostly correlated with accommodates and bedrooms.

