# Analyzing Hotel Ratings on Tripadvisor

In this homework, we will analyze the data we scraped in Part 1 by fitting a regression model on the data.

** Task 1 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating.

For example, the average rating of a hotel is calculated as follows:

![Information to be scraped](traveler_ratings.png)

$$ \text{AVG_SCORE} = \frac{1*15 + 2*21 + 3*55 + 4*228 + 5*1271}{1590}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [1]:
import csv
import numpy as np 
import pandas as pd

In [4]:
# Read in trabeler_ratings.csv
df_rating = pd.read_csv('traveler_ratings.csv')
df_rating.head()

Unnamed: 0,hotel_name,rating,count
0,"Marriott Vacation Club Pulse at Custom House, ...",Excellent,507
1,"Marriott Vacation Club Pulse at Custom House, ...",Very good,175
2,"Marriott Vacation Club Pulse at Custom House, ...",Average,13
3,"Marriott Vacation Club Pulse at Custom House, ...",Poor,13
4,"Marriott Vacation Club Pulse at Custom House, ...",Terrible,12


In [5]:
# store travel_ratings in dataframe
df_attr = pd.read_csv('attribute_ratings.csv', usecols = ['hotel_name', 'attribute', 'star_value'])
df_attr.head()

Unnamed: 0,hotel_name,attribute,star_value
0,"Marriott Vacation Club Pulse at Custom House, ...",Location,5
1,"Marriott Vacation Club Pulse at Custom House, ...",Rooms,5
2,"Marriott Vacation Club Pulse at Custom House, ...",Service,5
3,"Marriott Vacation Club Pulse at Custom House, ...",Value,5
4,"Marriott Vacation Club Pulse at Custom House, ...",Sleep Quality,5


In [4]:
print ('number of hotels in travel_ratings.csv: %d'%(df_rating.hotel_name.unique().shape[0]))
print ('number of hotels in attribute_ratings.csv: %d'%(df_attr.hotel_name.unique().shape[0]))

number of hotels in travel_ratings.csv: 82
number of hotels in attribute_ratings.csv: 82


In [107]:
# Compute average score for each hotel and whether a hotel is excellent on one go
avg_score = {}
tmp = {}
excellence = {}
for row in df_rating.itertuples():
    if row.hotel_name not in avg_score:
        avg_score[row.hotel_name] = 0
        excellence[row.hotel_name] = 0
    tmp[row.rating] = int(row.count.replace(',', ''))
    
    if len(tmp)==5:
        avg_score[row.hotel_name] = (1*tmp['Terrible'] + 2*tmp['Poor'] + 3*tmp['Average'] + 4*tmp['Very good'] + 5*tmp['Excellent'])/sum(tmp.values())
        if (tmp['Excellent']/sum(tmp.values()) >= 0.6):
            excellence[row.hotel_name] = 1
        tmp = {}

In [7]:
# store avg_score in a Series
df_avg_score = pd.Series(avg_score)
df_avg_score.head()

Aloft Boston Seaport                             3.949686
Americas Best Value Inn                          2.850000
Ames Boston Hotel, Curio Collection by Hilton    4.283757
BEST WESTERN PLUS Roundhouse Suites              3.686801
Battery Wharf Hotel, Boston Waterfront           4.420423
dtype: float64

In [8]:
attribute = ['Location', 'Rooms', 'Service', 'Cleanliness', 'Value', 'Sleep Quality']
grouped = df_attr.groupby('hotel_name')

In [9]:
# Compute average attribute rating for each hotel
attr_rating = {}
for hotel_name, group in grouped:
    attr_rating[hotel_name] = {}
    attrs = group.attribute.unique()
    for attr in attrs:
        attr_rating[hotel_name][attr] = group[group.attribute==attr].star_value.mean()

In [11]:
# store average attribute ratings in a DataFrame
df_attr_rating = pd.DataFrame.from_dict(attr_rating, orient='index')
df_attr_rating.head()

Unnamed: 0,Cleanliness,Location,Service,Sleep Quality,Value,Rooms
Aloft Boston Seaport,4.52381,3.970588,4.258824,4.518519,3.705882,3.967742
Americas Best Value Inn,2.818182,2.666667,2.846154,3.0,2.818182,2.692308
"Ames Boston Hotel, Curio Collection by Hilton",4.636228,4.76555,4.376513,4.273171,4.017417,4.40412
BEST WESTERN PLUS Roundhouse Suites,4.186186,3.236593,4.100785,3.880131,3.923994,4.05296
"Battery Wharf Hotel, Boston Waterfront",4.73755,4.586835,4.463929,4.597561,4.020661,4.555079


In [12]:
# Add a constant column
df_attr_rating['avg_score'] = df_avg_score
df_attr_rating['intercept'] = 1.0
df_attr_rating.head()

Unnamed: 0,Cleanliness,Location,Service,Sleep Quality,Value,Rooms,avg_score,intercept
Aloft Boston Seaport,4.52381,3.970588,4.258824,4.518519,3.705882,3.967742,3.949686,1.0
Americas Best Value Inn,2.818182,2.666667,2.846154,3.0,2.818182,2.692308,2.85,1.0
"Ames Boston Hotel, Curio Collection by Hilton",4.636228,4.76555,4.376513,4.273171,4.017417,4.40412,4.283757,1.0
BEST WESTERN PLUS Roundhouse Suites,4.186186,3.236593,4.100785,3.880131,3.923994,4.05296,3.686801,1.0
"Battery Wharf Hotel, Boston Waterfront",4.73755,4.586835,4.463929,4.597561,4.020661,4.555079,4.420423,1.0


In [13]:
import statsmodels.api as sm

In [15]:
# Get dependent and independent variables
X = df_attr_rating[attribute+['intercept']]
y = df_attr_rating['avg_score']

In [65]:
# splite data to train and test sets, with 7:3 ratio
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [68]:
model = sm.OLS(y_train, X_train)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              avg_score   R-squared:                       0.973
Model:                            OLS   Adj. R-squared:                  0.970
Method:                 Least Squares   F-statistic:                     304.2
Date:                Thu, 17 Nov 2016   Prob (F-statistic):           1.49e-37
Time:                        18:48:56   Log-Likelihood:                 66.777
No. Observations:                  57   AIC:                            -119.6
Df Residuals:                      50   BIC:                            -105.3
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Location          0.1347      0.036      3.787

The R-squared is 0.973, which is very close to 1 and it means the OLS performs well.

If a confidence interval includes 0, then the correspongding variable may not be of predictive value. In this model, Cleanliness, Value and Sleep Quality are such an attributes. Therefore, the important factors that determining the average rating of a hotel are **Location, Rooms and Service**.

In [71]:
from sklearn.metrics import mean_squared_error
y_pred = results.predict(X_test)
MSE_test = mean_squared_error(y_test, y_pred)
print ('The MSE of OLS on the test set is %f.'%(MSE_test))

The MSE of OLS on the test set is 0.002738.


In [77]:
# train again with only significant attributes
Xsignif = X[['Location','Rooms', 'Service', 'intercept']]
Xsignif_train,Xsignif_test,ysig_train,ysig_test = train_test_split(Xsignif, y, test_size=0.3, random_state=0)
model1 = sm.OLS(ysig_train, Xsignif_train)
results1 = model1.fit()
print (results1.summary())
ysig_pred = results1.predict(Xsignif_test)
MSEsig = mean_squared_error(ysig_test, ysig_pred)
print ('Training with significant attributes only, MSE of OLS on test set is %f.'%(MSEsig))

                            OLS Regression Results                            
Dep. Variable:              avg_score   R-squared:                       0.971
Model:                            OLS   Adj. R-squared:                  0.969
Method:                 Least Squares   F-statistic:                     583.4
Date:                Thu, 17 Nov 2016   Prob (F-statistic):           1.49e-40
Time:                        19:00:07   Log-Likelihood:                 64.003
No. Observations:                  57   AIC:                            -120.0
Df Residuals:                      53   BIC:                            -111.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Location       0.1436      0.036      4.033      0.0

Retrain the model with only significant attributes, the R-squared is 0.971, dropped 0.002, which is still very close to 1. Thus with only Location, Rooms and Service, we are able to predict the average rating of a hotel pretty accurately. The MSE on test set is 0.003735.

-------

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [111]:
df_excellent = pd.Series(excellence)
df_excellent.head()

Aloft Boston Seaport                             0
Americas Best Value Inn                          0
Ames Boston Hotel, Curio Collection by Hilton    0
BEST WESTERN PLUS Roundhouse Suites              0
Battery Wharf Hotel, Boston Waterfront           1
dtype: int64

In [118]:
# get training and test data
Xtrain,Xtest,ytrain,ytest = train_test_split(X, df_excellent, test_size=0.3, random_state=0)

In [127]:
# train logistic regression model on training data
logit = sm.Logit(ytrain, Xtrain.drop('intercept', axis=1))
result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.394224
         Iterations 8


0,1,2,3
Dep. Variable:,y,No. Observations:,57.0
Model:,Logit,Df Residuals:,51.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 17 Nov 2016",Pseudo R-squ.:,0.3807
Time:,19:50:25,Log-Likelihood:,-22.471
converged:,True,LL-Null:,-36.281
,,LLR p-value:,4.316e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Location,0.0673,1.382,0.049,0.961,-2.641 2.775
Rooms,19.9198,6.215,3.205,0.001,7.738 32.101
Service,2.1946,5.424,0.405,0.686,-8.437 12.826
Cleanliness,-17.0128,6.275,-2.711,0.007,-29.312 -4.714
Value,-2.2063,3.491,-0.632,0.527,-9.049 4.636
Sleep Quality,-2.3748,3.666,-0.648,0.517,-9.560 4.811


In [150]:
# test the model on test data
ypred = result.predict(Xtest.drop('intercept', axis=1))
ypred = [1 if y >= 0.1 else 0 for y in ypred] #0.1 is the threshold for determining whether a sample is 0 or 1

truePositives = falsePositives = trueNegatives = falseNegatives = 0
for i in range(len(ytest)):
    if ytest[i]==1 and ypred[i]==1:
        truePositives += 1
    elif ytest[i]==1:
        falseNegatives += 1
    elif ypred[i]==1:
        falsePositives += 1
    else:
        trueNegatives +=1

if (truePositives + falsePositives) == 0:
    precision = 0
else:
    precision = truePositives / (truePositives + falsePositives)

if (truePositives + falseNegatives) == 0:
    recall = 0
else:
    recall = truePositives / (truePositives + falseNegatives)
print ('Logistic regression in predicting the excellent of a hotel,\nPrecision is %f, recall is %f.'%(precision, recall))

Logistic regression in predicting the excellent of a hotel,
Precision is 0.461538, recall is 0.857143.


Using logistic regression, when training with all attributes, the optimization converges after 8 iterations. When testing on the test data, we got precision = 0.461538, recall = 0.857143.
From the confidence interval of the logistic model's parameters, only 'Rooms' does not include 0. So retrain the model using only the attribute 'Rooms'.

In [162]:
logit1 = sm.Logit(ytrain, Xtrain[['Rooms']])
result1 = logit1.fit()
result1.summary()

Optimization terminated successfully.
         Current function value: 0.659920
         Iterations 4


0,1,2,3
Dep. Variable:,y,No. Observations:,57.0
Model:,Logit,Df Residuals:,56.0
Method:,MLE,Df Model:,0.0
Date:,"Thu, 17 Nov 2016",Pseudo R-squ.:,-0.03677
Time:,20:31:44,Log-Likelihood:,-37.615
converged:,True,LL-Null:,-36.281
,,LLR p-value:,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Rooms,-0.1263,0.066,-1.912,0.056,-0.256 0.003


In [173]:
ypred = result1.predict(Xtest['Rooms'])
ypred = [1 if y >= 0.36 else 0 for y in ypred]

truePositives = falsePositives = trueNegatives = falseNegatives = 0
for i in range(len(ytest)):
    if ytest[i]==1 and ypred[i]==1:
        truePositives += 1
    elif ytest[i]==1:
        falseNegatives += 1
    elif ypred[i]==1:
        falsePositives += 1
    else:
        trueNegatives +=1

if (truePositives + falsePositives) == 0:
    precision = 0
else:
    precision = truePositives / (truePositives + falsePositives)

if (truePositives + falseNegatives) == 0:
    recall = 0
else:
    recall = truePositives / (truePositives + falseNegatives)
print ('Logistic regression in predicting the excellent of a hotel using only \'Rooms\',\nPrecision is %f, recall is %f.'%(precision, recall))

Logistic regression in predicting the excellent of a hotel using only 'Rooms',
Precision is 0.181818, recall is 0.571429.


-------