(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [1]:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import re
import sys
import time
import os
import logging
import argparse
import requests
import codecs
import json

base_url = "https://www.tripadvisor.com"
query_url = "/Hotel_Review-g60745-d89599-Reviews-Omni_Parker_House-Boston_Massachusetts.html"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36"
reviews = []

def get_page_review_boxes(soup):
    '''
        Gets the reviews on the current page
    '''
    review_boxes = [soup.find('div', {'class' : 'reviewSelector   track_back'})]
    for entry in soup.findAll('div', {'class': 'reviewSelector  '}):
        review_boxes.append(entry)
    return review_boxes

def get_next_reviews_page(soup):
    '''
         Gets the next page with reviews. if none returned then all reviews grabbed.
    '''
    nextPageLink = soup.find('a', {'class' : 'nav next rndBtn ui_button primary taLnk'})
    if nextPageLink is None:
        return None
    return nextPageLink['href']

def get_all_review_boxes():
    '''
        Populates global variable reviews with all the reviews starting from base_url + query_url
    '''
    query = query_url
    while(query is not None):
        url = base_url + query
        headers = { 'User-Agent' : user_agent }
        response = requests.get(url, headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html)
        allReviewsOnPage = get_page_review_boxes(soup)
        for entry in allReviewsOnPage:
            reviews.append(entry)
        query = get_next_reviews_page(soup)

def getReviewData(review, OmniParkerHouseReviews):
    '''
        Makes a list of strings consisting of review_id:attribute:rating for each attribute. 
        If no attributes returns an empty list.
    '''
    try:
        review_id = review['id']
        end_url = review.find('a', {'href': True})['href']
        reviewUrl = base_url + end_url
        headers = { 'User-Agent' : user_agent }
        response = requests.get(reviewUrl, headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html)
        
        ratings = soup.find('div', {'class' : 'rating-list'})
        allRatings = ratings.findAll('li', {'class' : 'recommend-answer'})
        for entry in allRatings:
            attribute = entry.find('div', {'class' : 'recommend-description'}).text
            rating = entry.find('img', {'alt' : True})['alt'][0]
            string = review_id + ':' + attribute + ':' + rating +'\n'
            OmniParkerHouseReviews.write(string)
    except:
        pass
    
get_all_review_boxes()

In [2]:
open('OmniParkerHouseReviews.txt', 'w').close() # make sure the file is cleared before we write to it.
OmniParkerHouseReviews = open('OmniParkerHouseReviews.txt', 'w')
for review in reviews:
    getReviewData(review, OmniParkerHouseReviews)
OmniParkerHouseReviews.close()

** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [3]:
'''
    Builds hotel average ratings, a dictionary with keys hotel name and values a list of 
    (attribute, averagerating) for all hotels
'''

def calculate_average_rating(values, total):
    '''
        Helper method, calculates average rating for a list of 5 values where the first number is # of 1 stars,
        second number is # of 2 stars... and total is the total of all the values.
    '''
    return (1.0 * values[0] + 2 * values[1]
            + 3 * values[2] + 4 * values[3] + 5 * values[4])/(total)

'''
    Builds the dictionary for everything in the file given to us by Katherine.
    NOTE: I edited the file slightly, the hotels that have a '/' character in them on Tripadvisor did not have
    the '/' character in the file so I added them in.
'''
rating_summary = open('rating-summary.txt', 'r')
line = rating_summary.readline()
hotelAverageRatings = {}
validAttributes = ['Service', 'Cleanliness', 'Value', 'Sleep Quality', 'Rooms', 'Location']
while line != '':
    lines = [line.split(':')]
    for i in range(0,4):
        lines.append(rating_summary.readline().split(':'))
    if lines[0][1] in validAttributes:
        total = 0
        for Lines in lines:
            total = total + int(Lines[3])
        averageRating = calculate_average_rating([int(values[3]) for values in lines], total)
        if lines[0][0] in hotelAverageRatings:
                hotelAverageRatings[lines[0][0]][lines[0][1]] = averageRating
        else:
            hotelAverageRatings[lines[0][0]] = {lines[0][1]: averageRating}
    line = rating_summary.readline()
rating_summary.close()

'''
    Builds the dictionary for the values of Omni Parker House based on the file made earlier in the code.
'''

OmniParker = open('OmniParkerHouseReviews.txt', 'r')
OmniParkerAttributeTotals = {'Value': [0,0,0,0,0], 'Location': [0,0,0,0,0], 'Sleep Quality': [0,0,0,0,0], 
                  'Rooms': [0,0,0,0,0], 'Cleanliness': [0,0,0,0,0], 'Service': [0,0,0,0,0]}
for line in OmniParker:
    Line = line.split(':')
    if Line[1] in validAttributes:
        OmniParkerAttributeTotals[Line[1]][int(Line[2])-1] = OmniParkerAttributeTotals[Line[1]][int(Line[2])-1] + 1
    
hotelAverageRatings['Omni Parker House'] = []
hotelAverageRatings['Omni Parker House'] = {'Service' :
                                                calculate_average_rating(OmniParkerAttributeTotals['Service'], 
                                                                         sum(OmniParkerAttributeTotals['Service']))}
hotelAverageRatings['Omni Parker House']['Cleanliness'] = calculate_average_rating(OmniParkerAttributeTotals['Cleanliness'], 
                                                                         sum(OmniParkerAttributeTotals['Cleanliness']))

hotelAverageRatings['Omni Parker House']['Value'] = calculate_average_rating(OmniParkerAttributeTotals['Value'], 
                                                                         sum(OmniParkerAttributeTotals['Value']))

hotelAverageRatings['Omni Parker House']['Sleep Quality'] = calculate_average_rating(OmniParkerAttributeTotals['Sleep Quality'], 
                                                                         sum(OmniParkerAttributeTotals['Sleep Quality']))

hotelAverageRatings['Omni Parker House']['Rooms'] = calculate_average_rating(OmniParkerAttributeTotals['Rooms'], 
                                                                         sum(OmniParkerAttributeTotals['Rooms']))

hotelAverageRatings['Omni Parker House']['Location'] = calculate_average_rating(OmniParkerAttributeTotals['Location'], 
                                                                         sum(OmniParkerAttributeTotals['Location']))
OmniParker.close()

In [4]:
'''
    Adds to the hotelAverageRatings dictionary a list with the number of votes for each traveling rate.
    Each hotel in the dictionary is of the form [(service, avgServiceScore), (cleanliness, avgCleanlinessScore), ... ,
                                                hotelRatings, average hotel score]
    hotelRatings is a list [# excellent, # very good, # average, # poor, # terrible]
'''
def getHotelRating(url):
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html)
    div = soup.find('div', {'id': 'ratingFilter'})
    ul = div.find('ul')
    lis = div.findAll('li')
    ratings = [0,0,0,0,0]
    for i in range(0,5):
        spans = lis[i].findAll('span')
        ratings[i] = int(str(spans[3].text).replace(',',''))
    return ratings

def HotelAverageRating(hotelRatings):
    return (5.0 * hotelRatings[0] + 4 * hotelRatings[1] + 3 * hotelRatings[2] + 
                2 * hotelRatings[3] + hotelRatings[4])/(sum(hotelRatings))

allHotelPages = {}
nextPage_url = '/Hotels-g60745-Boston_Massachusetts-Hotels.html'
while (nextPage_url is not None):
    url = base_url + nextPage_url
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html)
    for entry in soup.findAll('div', {'class': 'listing_title'}):
        allHotelPages[entry.find('a').text] = entry.find('a')['href']
    try:
        nextPage_url = soup.find('div', {'class': 'unified pagination standard_pagination'}).find('a', 
                                                {'class': 'nav next ui_button primary taLnk'})['href']
    except:
        nextPage_url = None
        
keys = allHotelPages.keys()
keys.sort()
for key in keys:
    if key in hotelAverageRatings:
        hotelAverageRatings[key]['HotelRatings'] = getHotelRating(base_url + allHotelPages[key])
        hotelAverageRatings[key]['AverageScore'] = HotelAverageRating(hotelAverageRatings[key]['HotelRatings'])

In [44]:
import numpy as np
Y = []
Y2 = []
X = []

def excellentCheck(hotelRatings):
    excellents = hotelRatings[0] * 1.0
    total = sum(hotelRatings)
    percent = excellents/total
    if percent > 0.6:
        return 1
    return 0

for key in keys:
    data = hotelAverageRatings[key]
    Y.append(data['AverageScore'])
    Y2.append(excellentCheck(data['HotelRatings']))
    x = []
    try:
        x.append(data['Service'])
    except:
        x.append(0)
    try:
        x.append(data['Cleanliness'])
    except:
        x.append(0)
    try:
        x.append(data['Value'])
    except:
        x.append(0)
    try:
        x.append(data['Sleep Quality'])
    except:
        x.append(0)
    try:
        x.append(data['Rooms'])
    except:
        x.append(0)
    try:
        x.append(data['Location'])
    except:
        x.append(0)
    X.append(np.array(x))

In [45]:
'''
    The X and Y variables are made in the code block above. Both these lists are made from a sorted list of the keys
        i.e. the first entry will be for Aloft Boston Seaport, the second Americas Best Value Inn ...
    Y is a list of the 82 values denoted by the 'AverageScore' field in the hotelAverageRatings dictionary.
    X is a list of 82 lists. Each list in X is the average rating for Service, Cleanliness, Value, Sleep Quality,
        Rooms, Location respectively. 
'''

import statsmodels.api as sm

X = np.array(X)
Y = np.array(Y)

model = sm.OLS(Y, X)
results = model.fit()
print results.summary()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 2.684e+04
Date:                Mon, 28 Mar 2016   Prob (F-statistic):          3.10e-124
Time:                        21:09:56   Log-Likelihood:                 81.256
No. Observations:                  82   AIC:                            -150.5
Df Residuals:                      76   BIC:                            -136.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.1926      0.089      2.165      0.0

Analysis of the Regression:

x1 = Service

x2 = Cleanliness

x3 = Value

x4 = Sleep Quality

x5 = Rooms 

x6 = Location

From the above table, the Location attribute seems to have the least impact on the rating for a hotel. The highest coefficient is associated with the Rooms attribute, meaning the Rooms has the most impact on the rating for the hotel. The cleanliness attribute also has a significant impact on the rating of the hotel but has the highest standard error so isn't as reliable. Looking at the confidence intervals, it isn't likely that the rating of an attribute goes against the rating of the hotel (i.e. all attributes have low average score but the hotel still got a high score overall). Negative correlations aren't a concern with this data set. 

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [42]:
import sklearn as sk
X = np.array(X)
Y2 = np.array(Y2)
linearRegression = sk.linear_model.LinearRegression()
linearRegression.fit(Y.reshape(-1,1),Y2)
print linearRegression.coef_
print linearRegression.score(Y.reshape(-1,1),Y2)
linearRegression.fit(X,Y2)
print linearRegression.coef_
print linearRegression.score(X,Y2)

[ 0.62132444]
0.410327195043
[ 0.57064449 -1.25443955  0.38406119  0.31952701  0.71841793  0.00974222]
0.505445945267


The first value above is the correlation between the average hotel score and whether or not it is classified as excellent. I'm surprised that it's so low actually. I expect that obviously poorly rated hotels will all be not excellent so those should be all correctly matched, which means there should be a lot of hotels with high ratings on the border of the 60% excellent. Some hotels also don't have many reviews, so each vote their has a lot more weight on the percentage which might explain the result.

The second list represents the attributes from before in the same order. Cleanliness having a negative correlation was surprising. Rooms still had a high correlation which makes sense because the room is the most important/rated part of hotels.

I don't have standard errors for either but I imagine it would be high with the values I have. The numbers below each list represents the coefficient of determination, R^2. This value represents how well the fit actually is. Both have a low percentage, the attributes being equivalent to a coin flip and guessing. Having more information might help or once again maybe some hotels just don't have enough reviews and are skewed by a couple bad votes when the majority still votes well. I'm not 100% but that is my best guess as to why the values come out how they do.

-------

In [43]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()