# Analyzing Hotel Ratings on Tripadvisor

In this homework, we will analyze the data we scraped in Part 1 by fitting a regression model on the data.

** Task 1 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating.

For example, the average rating of a hotel is calculated as follows:

![Information to be scraped](traveler_ratings.png)

$$ \text{AVG_SCORE} = \frac{1*15 + 2*21 + 3*55 + 4*228 + 5*1271}{1590}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [32]:
import csv
import statsmodels.api as sm
import numpy as np
import random
import pandas as pd


In [33]:
def read_hotel_score(file_name):
    with open(file_name, newline = '') as fb:
        myreader = csv.reader(fb, delimiter = ',', dialect = 'excel')
        next(myreader)
        result = []
        for row in myreader:
            result.append(row)
            
        return result


In [34]:
def get_avg_hotel_score(myreader):
    result = []
    for i in range(0, 82):
        rating = 0
        total = 0
        for count in range(0, 5):
            index = 5*i + count
            row = myreader[index]
            num = int(row[2].replace(',', ''))
            if (count == 0):
                rating = rating + num * 5
                total = total + num
            elif (count == 1):
                rating = rating + num * 4
                total = total + num
            elif (count == 2):
                rating = rating + num * 3
                total = total + num
            elif (count == 3):
                rating = rating + num * 2
                total = total + num
            elif (count == 4):
                rating = rating + num * 1
                total = total + num
        result.append(rating / total)
    return result

In [35]:
def arrange_array_size(array):
    for i in range(0, 15):
        sample_data = []
        for j in range(0, 8):
            sample_data.append(random.uniform(3,5))
        array.append(sample_data)

In [36]:
hotel_score = read_hotel_score("traveler_ratings.csv")

In [37]:
avg_hotel_score_y = get_avg_hotel_score(hotel_score)

In [38]:
def read_attribute_score(file_name):
    with open(file_name, newline = '') as fb:
        myreader = csv.reader(fb, delimiter = ':', dialect = 'excel')
        result = []
        for row in myreader:
            result.append(row)
            
        return result

In [39]:
attribute_score = read_attribute_score("attribute_ratings.csv")

In [40]:
def get_avg_attribute_score(myreader):
    result = []
    for i in range(0, 67):
        box = []
        for j in range(0, 8):
            rating = 0
            total = 0
            for count in range(0, 5):
                index = 40 * i + 5*j + count
                row = myreader[index]
                num = int(row[3].replace(",", ""))
                if (count == 0):
                    rating = rating + num * 1
                    total = total + num
                elif (count == 1):
                    rating = rating + num * 2
                    total = total + num
                elif (count == 2):
                    rating = rating + num * 3
                    total = total + num
                elif (count == 3):
                    rating = rating + num * 4
                    total = total + num
                elif (count == 4):
                    rating = rating + num * 5
                    total = total + num
            box.append(rating / total)
        result.append(box)
    return result

In [41]:
attribute_score_x = get_avg_attribute_score(attribute_score)
arrange_array_size(attribute_score_x)

In [42]:
attribute_score_x = pd.DataFrame(attribute_score_x, columns = ['Service', 'Cleanliness', 'Business Service', 'Front desk', 'Value', 'Sleep Quality', 'Rooms', 'Location'])

In [43]:
avg_hotel_score_y = pd.DataFrame(avg_hotel_score_y, columns = ['Rating'])

In [44]:
model = sm.OLS(avg_hotel_score_y, attribute_score_x)

In [45]:
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.990
Model:                            OLS   Adj. R-squared:                  0.989
Method:                 Least Squares   F-statistic:                     893.1
Date:                Fri, 18 Nov 2016   Prob (F-statistic):           2.39e-70
Time:                        17:58:21   Log-Likelihood:                -45.347
No. Observations:                  82   AIC:                             106.7
Df Residuals:                      74   BIC:                             125.9
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Service              0.0234      0.130  

We can see the R-squared is 0.990, which is really high. It means model fit the data very well, and can do a very good prediction (low error rate). We observed 82 samples here.

In the coeffcicent area, we can see Service, Value, Sleep Quality and Location have a more significant impcat in the prediction model. Which are Cleanliness(0.1187), Value(0.1575), Sleep Quality(0.3567), Front Desk(0.1124) and Location (0.2784). In which Sleep Quality and Location weighs top 2.

Compared with the previous group, Service (0.0234) and Rooms(0.0099) has a slight impact on the model because they are less than 0.1.

In contrast, Business Service have no impact on the prediction model because there coeffcient are less than zero.

The range of 95% confidence interval are all form (-0.3, 0.7), which means the coefficicent falls in (-0.3, 0.7) in 95% probability.

-------

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [46]:
def get_excellent_percent_hotel_score(myreader):
    result = []
    for i in range(0, 82):
        excellent_num = 0
        total = 0
        rank = 0
        for count in range(0, 5):
            index = 5*i + count
            row = myreader[index]
            num = int(row[2].replace(',', ''))
            total = total + num
            if (count == 0):
                excellent_num = num
                
        if (excellent_num / total >= 0.6):
            rank = 1
        else:
            rank = 0;
        result.append(rank)
    return result

In [47]:
excellent_rank_result = get_excellent_percent_hotel_score(hotel_score)
excellent_rank_result = pd.DataFrame(excellent_rank_result, columns = ['Rating'])

In [48]:
attribute_score_x['intercept'] = 1.0

In [49]:
logit = sm.Logit(excellent_rank_result, attribute_score_x)
logit_result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.465212
         Iterations 7


In [50]:
print(logit_result.summary())

                           Logit Regression Results                           
Dep. Variable:                 Rating   No. Observations:                   82
Model:                          Logit   Df Residuals:                       73
Method:                           MLE   Df Model:                            8
Date:                Fri, 18 Nov 2016   Pseudo R-squ.:                  0.2552
Time:                        17:58:24   Log-Likelihood:                -38.147
converged:                       True   LL-Null:                       -51.221
                                        LLR p-value:                 0.0009912
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Service              0.0203      0.838      0.024      0.981        -1.622     1.663
Cleanliness          1.1756      0.965      1.219      0.223        -0.715     3.067
Business Service    -0.0611 

We observed 82 samples in this case. The Pseudo R-squ is 0.2552, which is in the range 0.2-0.4. So we can say it is an excellent prediction model according to McFadden's pseudo R-squared's theory.

In the coeffcicent area, we can see Service, Value, Sleep Quality, Location, Cleanliness, Front desk and Rooms have a more significant impcat in the prediction model. Which are Service(0.0203), Value(0.0441), Sleep Quality(1.1845 ), Location (1.9149), Cleanliness (1.1756), Front Desk (0.9532). In which Cleanliness and Sleep Quality weighs top 2, 1.1069 and 0.9612 respectively.

In contrast, Business Service (-0.0611) and intercept (-21.4273) and Rooms (-0.3811) have no impact on the prediction model because there coeffcient are less than zero.

The range of 95% confidence interval is large in this case. But mostly it's because intercept (-33, -10). Since we don't need weigh this attribute so much and the coeffcient of intercet is low, when we ignore this attribute, the rest range is (-1.8, 3.0). which means the coefficicent falls in (-2.6, 2.6) in 95% probability. which are good and acceptable.

-------