 Build a hotel recommendation bot using sentiment analysis and guest reviews scores. 
 The dataset includes reviews of 1493 different hotels in 6 cities.

In [2]:
import pandas as pd
import time

### Load the data

In [4]:
print("Loading data file now, this could take a while depending on file size")
start = time.time()
df = pd.read_csv('D:\Code\ML\Machine_Learning\Data\Hotel_Reviews.csv')
end = time.time()
df.info()
print("Loading took " + str(round(end - start, 2)) + " seconds")

  df = pd.read_csv('D:\Code\ML\Machine_Learning\Data\Hotel_Reviews.csv')


Loading data file now, this could take a while depending on file size
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Hotel_Address                               515738 non-null  object 
 1   Additional_Number_of_Scoring                515738 non-null  int64  
 2   Review_Date                                 515738 non-null  object 
 3   Average_Score                               515738 non-null  float64
 4   Hotel_Name                                  515738 non-null  object 
 5   Reviewer_Nationality                        515738 non-null  object 
 6   Negative_Review                             515738 non-null  object 
 7   Review_Total_Negative_Word_Counts           515738 non-null  int64  
 8   Total_Number_of_Reviews                     515738 non-null  int64  
 9   

### Dataframe operations

The shape of the data(rows, columns)

In [5]:
print("The shape of the data is: " + str(df.shape))

The shape of the data is: (515738, 17)


Calculate the frequency count for reviewer nationalities

In [7]:
print("Distinct values in Reviewer_Nationality: " + str(df['Reviewer_Nationality'].nunique()))

Distinct values in Reviewer_Nationality: 227


In [8]:
nationalities_freq = df['Reviewer_Nationality'].value_counts().reset_index()
print("Top 10 nationalities of reviewers:")
nationalities_freq.head(10)

Top 10 nationalities of reviewers:


Unnamed: 0,Reviewer_Nationality,count
0,United Kingdom,245246
1,United States of America,35437
2,Australia,21686
3,Ireland,14827
4,United Arab Emirates,10235
5,Saudi Arabia,8951
6,Netherlands,8772
7,Switzerland,8678
8,Germany,7941
9,Canada,7894


find the most frequently reviewed hotel for each of the top 10 most reviewer nationalities

In [20]:
#the most frequently reviewed hotel for each of the top 10 most reviewer nationalities
top_10_nationalities = nationalities_freq.head(10)['Reviewer_Nationality']

print("The most reviewed hotel for top 10 nationalities:")
for nat in top_10_nationalities:
    freq = df[df['Reviewer_Nationality'] == nat]['Hotel_Name'].value_counts()
    print(str(nat).strip() + ": " + str(freq.index[0]) + 
          " with " + str(freq[0]) + " reviews.") 

The most reviewed hotel for top 10 nationalities:
United Kingdom: Britannia International Hotel Canary Wharf with 3833 reviews.
United States of America: Hotel Esther a with 423 reviews.
Australia: Park Plaza Westminster Bridge London with 167 reviews.
Ireland: Copthorne Tara Hotel London Kensington with 239 reviews.
United Arab Emirates: Millennium Hotel London Knightsbridge with 129 reviews.
Saudi Arabia: The Cumberland A Guoman Hotel with 142 reviews.
Netherlands: Jaz Amsterdam with 97 reviews.
Switzerland: Hotel Da Vinci with 97 reviews.
Germany: Hotel Da Vinci with 86 reviews.
Canada: St James Court A Taj Hotel London with 61 reviews.


  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")
  " with " + str(freq[0]) + " reviews.")


Frequency count of hotel

In [21]:
hotel_freq = df['Hotel_Name'].value_counts().reset_index()
hotel_freq.head(10)

Unnamed: 0,Hotel_Name,count
0,Britannia International Hotel Canary Wharf,4789
1,Strand Palace Hotel,4256
2,Park Plaza Westminster Bridge London,4169
3,Copthorne Tara Hotel London Kensington,3578
4,DoubleTree by Hilton Hotel London Tower of London,3212
5,Grand Royale London Hyde Park,2958
6,Holiday Inn London Kensington,2768
7,Hilton London Metropole,2628
8,Millennium Gloucester Hotel London,2565
9,Intercontinental London The O2,2551


Calculate an average score (getting the average of all reviewer scores in the dataset for each hotel)

In [28]:
#  Add a new column to your dataframe with the column header Calc_Average_Score that contains that calculated average.
df['Calc_Average_Score'] = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)

# Add a new column with the difference between the two average scores
df['Score_Difference'] = df['Calc_Average_Score'] - df['Average_Score']

# Create a df without all the duplicates of Hotel_Name (so only 1 row per hotel)
review_scores_df = df.drop_duplicates(subset='Hotel_Name')

# Sort the dataframe to find the lowest and highest average score difference
review_scores_df = review_scores_df.sort_values(by='Score_Difference')
print(review_scores_df[["Score_Difference", "Average_Score", "Calc_Average_Score", "Hotel_Name"]])





        Score_Difference  Average_Score  Calc_Average_Score  \
3813                -1.3            7.2                 5.9   
250308              -0.9            8.6                 7.7   
68936               -0.9            6.8                 5.9   
22189               -0.8            7.1                 6.3   
201776              -0.7            7.5                 6.8   
...                  ...            ...                 ...   
54745                0.5            8.6                 9.1   
43688                0.7            7.5                 8.2   
178253               0.7            7.9                 8.6   
111027               0.7            8.8                 9.5   
495945               0.8            7.7                 8.5   

                                               Hotel_Name  
3813                                   Kube Hotel Ice Bar  
250308          MARQUIS Faubourg St Honor Relais Ch teaux  
68936                                       Villa Eugenie  
221

Calculate and print out how many rows have column Positive_Review values of "No Positive" and Negative_Review values of "No Negative"



In [32]:
start = time.time()
No_positive_reviews = df['Positive_Review'].str.contains("No Positive", case=False, na=False)
print("Positive reviews with 'No Positive': " + str(No_positive_reviews.sum()))

No_negative_reviews = df['Negative_Review'].str.contains("No Negative", case=False, na=False)
print("Negative reviews with 'No Negative': " + str(No_negative_reviews.sum()))

Both_no_reviews = df[No_positive_reviews & No_negative_reviews]
print("Reviews with both 'No Positive' and 'No Negative': " + str(Both_no_reviews.shape[0]))

end = time.time()
print("Time taken: " + str(round(end - start, 2)) + " seconds")


Positive reviews with 'No Positive': 35948
Negative reviews with 'No Negative': 128227
Reviews with both 'No Positive' and 'No Negative': 128
Time taken: 0.75 seconds
