
### Introduction:   

TripAdvisor is a well-established website providing travel information, hotel reviews and other trip-related content. The vast majority of the content is user generated and the scraped datasets can be used to perform inferential data analysis of various factors that affect a rating.

### Dataset:

Las Vegas Strip. (2017). UCI Machine Learning Repository.
https://archive.ics.uci.edu/dataset/397/las+vegas+strip


 
This dataset is a sample of TripAdvisor Hotel reviews of 21 hotels located in Las Vegas. The dataset contains 504 records with 20 attributes each. There are 24 records per hotel (two per each month, randomly selected), for the year 2015.

We will probe the basic dataset attributes and hopefully uncover some interesting effects from the data.



In [2]:
from scipy.stats import t
import scipy.stats as stats
import pandas as pd 
import numpy as np 
data=pd.read_csv('/Users/isha/Downloads/LasVegasTripAdvisorReviews-Dataset.csv',sep=';')


## Hypothesis 1
## T-test

**Problem statement:**<br>
Excalibur Hotel & Casino has an average rating of 3(stars) as per TripAdvisor. We have a sample of 24 ratings given by TripAdvisor users for this hotel. Does the average hotel rating given by the user differs from that of the rating given by TripAdvisor?

**Test and Hypothesis:**<br>
We are going to perform a 2-sided T-test to see if the following hypothesis holds<br>
`H0: average hotel rating provided by the user is equal to 3(stars)`<br>
`H1: average hotel rating provided by the user is not equal to 3(stars)`<br>
significance level alpha = 0.05

**Dataset variable:**<br>
Score: Average hotel rating(stars) as per the user<br>
Hotel_stars: Average hotel rating(stars) as per TripAdvisor<br>




In [3]:
#average rating of the Excalibur Hotel & Casino given by the user 
avg_rating = data[data['Hotel name']=='Excalibur Hotel & Casino']['Score'].mean()
#standard deviation of the rating
std= data[data['Hotel name']=='Excalibur Hotel & Casino']['Score'].std()
#no of records for Excalibur Hotel & Casino
sample_data=len(data[data['Hotel name']=='Excalibur Hotel & Casino'])
#rating of the Excalibur Hotel & Casino given by TripAdvisor
mu_not = 3



population_mean=mu_not
sample_mean = avg_rating

## significance level 
alpha = 0.05
  
# calculating t-value    
t_score = (sample_mean - population_mean) / (std / np.sqrt(sample_data))

# Degrees of freedom
df = sample_data - 1

# Calculate p-value
p_value = 2 * (1 - t.cdf(np.abs(t_score), df))


if (p_value <= alpha):
    print("p value is less than or equal to alpha so we reject the null hypothesis")
else:
    print("p value greater than alpha, we fail to reject the null hypothesis")



p value is less than or equal to alpha so we reject the null hypothesis


Hence we could conclude that <br>
`H1: average hotel rating provided by the user is not equal to 3(stars)`

# Hypothesis 2 
## Z-test

**Problem statement:**<br>
From the analysis of the data we could see that the majority of travelers are couples. There is an estimate that almost one third of the people travelling to Las Vegas are couples and the rest are divided under business, families, friends and solo travelers. We are going to use a 1-sided Z-test to see if the proportion of type of travelers is skewed towards Couples.


**Test and Hypothesis:** <br>
We are going to perform a 1-sided Z-test to see if the following hypothesis holds <br>
`H0: p <= 0.33` <br>
`H1: p > 0.33`  <br>
significance level alpha = 0.05


In [4]:
## We are talking about proportion here which is different than the mean
couples = len(data[data['Traveler type']=='Couples'])

#Total no of records
n =len(data)
  
x = couples
alpha = 0.05 
p_0 = 0.33

  
p_cap = x/n
  
z =(p_cap - p_0)/(np.sqrt(p_0*(1-p_0)/n))
p_value = 1 - stats.norm.cdf(z)
  
if (p_value <= alpha):
    print ("p_value is less than or equal to alpha so we reject the null hypothesis")
else:
    print ("p value is greater than alpha so we fail to reject the null hypothesis")

p_value is less than or equal to alpha so we reject the null hypothesis


Hence we could conclude that <br>
`H1: p > 0.33`  <br>

# Hypothesis 3 
## ANOVA Test 

**Problem statement:**
The dataset contains data on various Las Vegas hotel reviews submitted by the users. The users have been differentiated into multiple categories. We want to find out if the average number of hotel reviews done by users travelling with Families or Friends or as couples is the same.<br>

**Test and Hypothesis:**
We are going to perform a one way ANOVA Test and perform multiple comparisons on the sample of users from 3 different categories of Traveler type (families, friends and couples).<br>
`H0: The average no. of hotel reviews done by users from all 3 categories is the same.`<br>
`H1: The average no. of hotel reviews done by users from all 3 categories is not the same.`<br>
significance level alpha = 0.05


**Dataset variable:**<br>
Traveler_type: Couples, families and friends<br>
Nr_Hotel_reviews: no of hotel reviews done by the user<br>



In [5]:
group_a=data[data['Traveler type']=='Couples']['Score'].tolist()
group_b=data[data['Traveler type']=='Friends']['Score'].tolist()
group_c=data[data['Traveler type']=='Families']['Score'].tolist()
f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)
print ('the p-value is {}'.format(p_value))

if (p_value <= alpha):
    print("p value is less than or equal to alpha so we reject the null hypothesis")
else:
    print("p value greater than alpha, we fail to reject the null hypothesis")



the p-value is 0.12979067950534684
p value greater than alpha, we fail to reject the null hypothesis


`H0: The average no. of hotel reviews done by users from all 3 categories is the same.`<br>


# Hypothesis 4
## Chi-squared test (categorical data analysis) on bivariate relationships:

**Problem statement:**
To examine if the fact that a hotel has a tennis court as an Amenity, has any association with the type of travelers (Business or Friends or Families) staying at that hotel.<br>

**Test and Hypothesis:**
We are going to perform chi-squared Test on the 3 different categories of Traveler types (families, friends and Business).<br>
`H0: Presence of tennis court as an amenity is independent of the traveler type`<br>
`H1: There is an association between the presence of tennis court as an amenity and the traveler type.`<br>
significance level alpha = 0.05


**Dataset variable:**<br>
Traveler_type: Business, friends and families<br>
Tennis_court:  If the hotel has a tennis court<br>

In [6]:
from scipy.stats import chi2_contingency

## preparing values for contingency table 
families_tennis_yes=len(data[(data['Traveler type']=='Families') & (data['Tennis court']=='YES')])
families_tennis_no=len(data[(data['Traveler type']=='Families') & (data['Tennis court']=='NO')])
friends_tennis_yes=len(data[(data['Traveler type']=='Friends') & (data['Tennis court']=='YES')])
friends_tennis_no=len(data[(data['Traveler type']=='Friends') & (data['Tennis court']=='NO')])
business_tennis_yes=len(data[(data['Traveler type']=='Business') & (data['Tennis court']=='YES')])
business_tennis_no=len(data[(data['Traveler type']=='Business') & (data['Tennis court']=='NO')])
observed = np.array([[19, 91],
                     [23, 59],
                     [19, 55]])

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(observed)

# Print the chi-square test results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:", expected)

if (p_value <= alpha):
    print("p value is less than or equal to alpha so we reject the null hypothesis")
else:
    print("p value greater than alpha, we fail to reject the null hypothesis")



Chi-square statistic: 3.523339113806147
P-value: 0.17175786478029847
Degrees of freedom: 2
Expected frequencies: [[25.22556391 84.77443609]
 [18.80451128 63.19548872]
 [16.96992481 57.03007519]]
p value greater than alpha, we fail to reject the null hypothesis


`H0: Presence of tennis court as an amenity is independent of the traveler type`<br>
