<font color='Orange'> <h2> NLP Project </h2> </font>
## Project Title: Sentiment Analysis and Topic Modelling on Hotel Reviews
#### Author: 
- Hemanth Mydugolam

<a id="0"></a> <br>
 # Table of Contents  
[Part 1: Sentiment Analysis](#1)     
1. [Import the required libraries such as Pandas, NLTK, and Scikit-learn.](#11)
1. [Load the reviews data file and examine](#12)
1. [Cleaning the data](#13)
1. [Data Analyzing.](#14)
1. [EDA - Data Visualization](#15)
1. [Data Preparation for sentiment Analysis (Stop words removal, special characters removal and tokenization](#16)
1. [Sentiment Analysis: NLTK](#17)

[Part 2: Topic Modelling](#2)
1. [Load the required libraries](#21) 
1. [Use the same reviews dataset as the input file](#22)
1. [Preprocess the reviews data (removing stop words, tokenization,stemming, and lemmatization)](#23) 
1. [Latent Dirichlet Allocation - LDA Approach](#24)
    1. [Positive Reviews - LDA](#241)
    1. [Negative Reviews - LDA](#242)    

[Part 3: Deep dive into particular Hotel on best and worst reviewed Hotel](#3)
1. [Based on EDA Results](#31) 
1. [Data preparation on choosen Hotel](#32) 
1. [Topic Modelling on subset data](#33) 
    1. [Positive Reviews Data preparation for LDA](#331)
    1. [Negative Reviews Data preparation for LDA](#332)
1. [Topics visualization using pyLDAvis](#34) 

[Part 4: Insights](#4)

<a id="1"></a> 
## Part 1: Sentiment Analysis

<a id="11"></a> 
### 1.1 Import the required libraries such as Pandas, NLTK, and Scikit-learn.

In [None]:
# 1.1 Import the required libraries
import plotly.express as px
from tqdm import tqdm
import numpy as np
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import plotly.express as px

from numpy import newaxis
from wordcloud import WordCloud, STOPWORDS

from tqdm import tqdm

from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

import xgboost as xgb
import tensorflow as tf
import tensorflow_hub as hub
#import tensorflow_text

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, Bidirectional, Activation, GRU, BatchNormalization
from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

from tensorflow.keras.optimizers import Adam

%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42

nltk.download('stopwords')
stop_words = stopwords.words('english')

<a id="12"></a> 
### 1.2 Load the reviews data file and examine

In [None]:
# 1.2 Load the reviews data file
df_h = pd.read_csv("Hotel_Reviews.csv")

In [None]:
print(df_h.info())

In [None]:
df = df_h

In [None]:
# Display basic information about the dataset
print(df.info())

In [None]:
# Display the first few rows of the dataset
df.head()

<a id="13"></a> 
### 1.3 Cleaning the data

In [None]:
import string

In [None]:
# 1.3 Cleaning the data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    return text

In [None]:
#Combining both Positive and Negative reviews
df['Combined_Review'] = df.Positive_Review + df.Negative_Review

In [None]:
# Apply cleaning to reviews
df['Cleaned_Review'] = df['Combined_Review'].apply(clean_text)

<a id="14"></a> 
### 1.4 Data Analyzing.

In [None]:
# Display summary statistics
df.describe().T

In [None]:
df.describe(include='object').T

In [None]:
# Reviwer Score distribution
print(df['Reviewer_Score'].value_counts())

In [None]:
print(f"Total Positive word count {df.Review_Total_Positive_Word_Counts.sum()}, Total negative word count {df.Review_Total_Negative_Word_Counts.sum()}")

In [None]:
# Top Hotels and Bottom Hotels
top_hotels = df.groupby('Hotel_Name')['Average_Score'].mean().nlargest(10)
top_hotels

In [None]:
bottom_hotels = df.groupby('Hotel_Name')['Average_Score'].mean().nsmallest(10)
bottom_hotels

In [None]:
# Temporal Analysis
df['Review_Date'] = pd.to_datetime(df['Review_Date'])
df.set_index('Review_Date', inplace=True)
monthly_average_scores = df.resample('M')['Average_Score'].mean()
monthly_average_scores

In [None]:
# Nationality-based Analysis
nationality_scores = df.groupby('Reviewer_Nationality')['Reviewer_Score'].mean().sort_values(ascending=False)
nationality_scores

In [None]:
# Tags Analysis
tags = df['Tags'].str.split(',').explode().str.strip()
tag_counts = tags.value_counts()
tag_counts

<a id="15"></a> 
### 1.5 EDA - Data Visualization

In [None]:
fig = px.histogram(df, x="Reviewer_Score", title="Review Score Distribution", nbins=20, text_auto=True)

# Change the color of the distribution bars
color_sequence = ['#255d84'] * 24  # You can replace this with your preferred color or use a different color sequence
fig.update_traces(marker=dict(color=color_sequence))

# Rename x-axis and y-axis titles
fig.update_xaxes(title_text="Reviewer Score")
fig.update_yaxes(title_text="Count of Reviews")

# Center the title
fig.update_layout(
    title=dict(text="Review Score Distribution", font=dict(size=20, color='black')),
    title_x=0.5
)

fig.show()

In [None]:
fig = px.histogram(df, x="Average_Score", title='Review Average Score Distribution',nbins=24,text_auto=True)

# Change the color of the distribution bars
color_sequence = ['#1f77b4'] * 24  # You can replace this with your preferred color or use a different color sequence
fig.update_traces(marker=dict(color=color_sequence))

# Rename x-axis and y-axis titles
fig.update_xaxes(title_text="Average Score",range=[6.0, 10])
fig.update_yaxes(title_text="Count of Reviews")

# Center the title
fig.update_layout(
    title=dict(text="Review Average Score Distribution", font=dict(size=20, color='black')),
    title_x=0.5,
)

fig.show()

In [None]:
fig = px.histogram(df, x="Country", title='Reviews distribution in each Country',text_auto=True)

# Sort bars based on values
sorted_countries = df['Country'].value_counts().index
fig.update_xaxes(categoryorder='array', categoryarray=sorted_countries)

# Change the color of each bar
color_sequence = px.colors.qualitative.Set1  # You can choose a different color sequence
fig.update_traces(marker=dict(color=color_sequence))

# Rename x-axis and y-axis titles
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Count of Reviews")

# Center the title
fig.update_layout(
    title=dict(text="Reviews distribution in each Country", font=dict(size=20, color='black')),
    title_x=0.5,
)

fig.show()

In [None]:
fig = px.histogram(df, x="Review Year", title='Year wise Reviews distribution',text_auto=True)

# Sort bars based on values
sorted_countries = df['Review Year'].value_counts().index
fig.update_xaxes(categoryorder='array', categoryarray=sorted_countries)

# Change the color of each bar
color_sequence = px.colors.qualitative.Set1  # You can choose a different color sequence
fig.update_traces(marker=dict(color=color_sequence))

# Rename x-axis and y-axis titles
fig.update_xaxes(title_text="Review Year")
fig.update_yaxes(title_text="Count of Reviews")

# Center the title
fig.update_layout(
    title=dict(text="Year wise Reviews distribution", font=dict(size=20, color='black')),
    title_x=0.5,
)

fig.show()

In [None]:
df1 = df
hotels_avgscore = df1.groupby('Hotel_Address')['Average_Score'].mean().reset_index(name="Avg Score")
hotels_avgscore.head()

In [None]:
df2 = pd.merge(df1, hotels_avgscore, on='Hotel_Address')
df2.head()

In [None]:
df3 = df2.loc[:, ['Hotel_Name','Average_Score','City','lat','lng','City_Latitude','City_Longitude','Avg Score']]
df3.head()

In [None]:
df_h = df3.groupby(by=["Hotel_Name", "lat","lng","Avg Score"]).size().reset_index(name="Hotel Count")
df_h

In [None]:
ht_name = df_h['Hotel_Name']
lat = df_h['lat']
lng = df_h['lng']
ag_score = df_h['Avg Score']
ht_count = df_h['Hotel Count']

In [None]:
# Install below library if you haven't done
#!pip install folium

In [None]:
import folium
from folium.plugins import MarkerCluster, MiniMap, Fullscreen 
 
city_data = {
    'Hotel_Name': ht_name,
    'Latitude': lat,
    'Longitude': lng,
    'Avg_Rating': ag_score,
    'HotelCount': ht_count,
}

fixed_radius = 10
# Create a Folium map centered around the first hotel
map_center = [48.7784485, 9.1800132]
my_map = folium.Map(location=map_center, zoom_start=5)

# Create a MarkerCluster layer
marker_cluster = MarkerCluster().add_to(my_map)

# Add markers for each hotel with a fixed radius
for i in range(len(city_data['Hotel_Name'])):
    folium.CircleMarker(
        location=[city_data['Latitude'][i], city_data['Longitude'][i]],#I have added stuttgart location so that all cities cover
        radius=fixed_radius,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        popup=f"{city_data['Hotel_Name'][i]} - {city_data['Avg_Rating'][i]} Avg rating - {city_data['HotelCount'][i]} Reviews"
    ).add_to(marker_cluster)
    
# Add layer control for better interactivity
folium.LayerControl().add_to(my_map)
 
# Add a minimap for better navigation
minimap = MiniMap(toggle_display=True)
my_map.add_child(minimap)
 
# Add fullscreen button for full-screen mode
Fullscreen().add_to(my_map)
 
# Save the map as an HTML file
my_map.save("hotels_data_map.html")
 
# Display the map directly in the notebook
my_map

<a id="16"></a> 
### 1.6 Data Preparation for sentiment Analysis (Stop words removal, special characters removal and tokenization

In [None]:
# 1.6 Data Preparation for Sentiment Analysis
# Tokenization and stop words removal
stop_words = set(stopwords.words('english'))

In [None]:
def tokenize_and_remove_stopwords(text):
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

In [None]:
df['Tokenized_Review'] = df['Cleaned_Review'].apply(tokenize_and_remove_stopwords)

<a id="17"></a> 
### 1.7 Sentiment Analysis: NLTK

In [None]:
#!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

In [None]:
# Apply the sentiment analyzer to each review
df['NLTK_Sentiment_Score'] = df['Cleaned_Review'].apply(lambda x: sid.polarity_scores(x)['compound'])

In [None]:
# Categorize the sentiment scores into positive, neutral, and negative
df['NLTK_Sentiment_Label'] = df['NLTK_Sentiment_Score'].apply(lambda x: 'Positive' if x > 0 else ('Neutral' if x == 0 else 'Negative'))

In [None]:
df.head()

<a id="2"></a> 
## Part 2: Topic Modelling

<a id="21"></a> 
### 2.1 Load the required libraries

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

<a id="22"></a> 
### 2.2 Use the same reviews dataset as the input file

In [None]:
pos_documents = df['Positive_Review'].tolist()
neg_documents = df['Negative_Review'].tolist()

<a id="23"></a> 
### 2.3 Preprocess the reviews data (removing stop words, tokenization,stemming, and lemmatization)

In [None]:
# Preprocess the data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

In [None]:
processed_documents = [preprocess_text(doc) for doc in pos_documents]

In [None]:
# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(processed_documents)
corpus = [dictionary.doc2bow(doc) for doc in processed_documents]

<a id="24"></a> 
### 2.4 Latent Dirichlet Allocation - LDA Approach

<a id="241"></a> 
#### A. Positive Reviews - LDA

In [None]:
# Build the LDA model (As the data is more it is going to take more than 17 minutes to run the below model building)
lda_model = gensim.models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)

In [None]:
# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from gensim.models import LdaModel
from gensim import corpora

In [None]:
# Wordcloud for all top 5 Positive topics
num_topics = 5
def get_all_topic_words(lda_model, dictionary, num_topics):
    all_topic_words = []
    for i in range(num_topics):
        topic_terms = lda_model.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        all_topic_words.extend(topic_terms)
    return all_topic_words

def generate_wordcloud(all_topic_words):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud for Top 5 Positive Topics')
    plt.show()

# Example usage
all_topic_words = get_all_topic_words(lda_model, dictionary, num_topics)
generate_wordcloud(all_topic_words)

In [None]:
topic_name_GP = ["Hotel Location and Accessibility","Room Amenities and Services","Room Comfort and Cleanliness","Staff and Service Excellence","Overall Hotel Experience"]

In [None]:
# Wordcloud for each Positive topic
num_topics = 5
def get_topic_words(lda_model, dictionary, num_topics):
    topic_words = {}
    for i in range(num_topics):
        topic_terms = lda_model.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        topic_words[f'Topic {i + 1}'] = topic_terms
    return topic_words

def generate_wordcloud(topic_words):
    i=0
    for topic, terms in topic_words.items():
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(terms))
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Word Cloud for topic: {topic_name_GP[i]}')
        plt.show()
        i = i+1

# Example usage
topic_words = get_topic_words(lda_model, dictionary, num_topics)
generate_wordcloud(topic_words)

In [None]:
##### Positive Reviews from the data for the above topics


<a id="242"></a> 
#### B. Negative Reviews - LDA

###### Pre processing the data for Negative Reviews Model

In [None]:
processed_documents1 = [preprocess_text(doc) for doc in neg_documents]
# Create a dictionary and corpus for LDA
dictionary1 = corpora.Dictionary(processed_documents1)
corpus1 = [dictionary.doc2bow(doc) for doc in processed_documents1]

In [None]:
# Build the LDA model (As the data is more it is going to take more than 30 minutes to run the below model building)
lda_model_GN = gensim.models.LdaModel(corpus1, num_topics=5, id2word=dictionary1, passes=10)

In [None]:
# Print the topics
topics = lda_model_GN.print_topics(num_words=10)
for topic in topics:
    print(topic)

In [None]:
# Wordcloud for all top 5 Positive topics
num_topics = 5
def get_all_topic_words(lda_model_GN, dictionary1, num_topics):
    all_topic_words = []
    for i in range(num_topics):
        topic_terms = lda_model_GN.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        all_topic_words.extend(topic_terms)
    return all_topic_words

def generate_wordcloud(all_topic_words):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud for Top 5 Negative Topics')
    plt.show()

# Example usage
all_topic_words = get_all_topic_words(lda_model_GN, dictionary1, num_topics)
generate_wordcloud(all_topic_words)

In [None]:
topic_name_GN = ["Reservation and Booking Concerns","Room Quality and Maintenance Issues","Guest Feedback: Cost and Value","Property Issues","Service and Communication"]

In [None]:
# Wordcloud for each Positive topic
num_topics = 5
def get_topic_words(lda_model_GN, dictionary1, num_topics):
    topic_words = {}
    for i in range(num_topics):
        topic_terms = lda_model_GN.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        topic_words[f'Topic {i + 1}'] = topic_terms
    return topic_words

def generate_wordcloud(topic_words):
    i=0
    for topic, terms in topic_words.items():
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(terms))
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Word Cloud for topic: {topic_name_GN[i]}')
        plt.show()
        i=i+1

# Example usage
topic_words = get_topic_words(lda_model_GN, dictionary1, num_topics)
generate_wordcloud(topic_words)

<a id="3"></a> 
## Part 3: Deep dive into particular hotel

<a id="31"></a> 
### 3.1 Based on EDA Results

Choosen "Park Plaza Westminster Bridge London" Hotel to deep dive and check the sentiment and topics in both the "Positive" and "Negative Reviews"

In [None]:
df_ch = df.loc[df['Hotel_Name'] == "Park Plaza Westminster Bridge London"]

In [None]:
df_ch.head()

In [None]:
fig = px.histogram(df_ch, x="Review Year", title='Year wise Reviews distribution for Park Plaza Westminster Bridge London',text_auto=True)

# Sort bars based on values
sorted_countries = df_ch['Review Year'].value_counts().index
fig.update_xaxes(categoryorder='array', categoryarray=sorted_countries)

# Change the color of each bar
color_sequence = px.colors.qualitative.Set1  # You can choose a different color sequence
fig.update_traces(marker=dict(color=color_sequence))

# Rename x-axis and y-axis titles
fig.update_xaxes(title_text="Review Year")
fig.update_yaxes(title_text="Count of Reviews")

# Center the title
fig.update_layout(
    title=dict(text="Year wise Reviews distribution", font=dict(size=20, color='black')),
    title_x=0.5,
)

fig.show()

In [None]:
df_ch.info()

In [None]:
df_ch = df_ch.iloc[:,0:20]

<a id="32"></a> 
### 3.2 Data prepration on "Park Plaza Westminster Bridge London"

#### Data Cleaning and preparation

In [None]:
import string

In [None]:
# 1.3 Cleaning the data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    return text

In [None]:
# Apply cleaning to reviews
df_ch['Cleaned_Positive_Review'] = df_ch['Positive_Review'].apply(clean_text)
df_ch['Cleaned_Negative_Review'] = df_ch['Negative_Review'].apply(clean_text)

In [None]:
df_ch.head()

<a id="33"></a> 
### 3.3 Topic Modelling on subset data

<a id="331"></a> 
#### A. Positive Reviews Data preparation for LDA

In [None]:
# remove reviews with "No Positive" as the review (around 290)
print(df_ch.shape)
TM_Pos = df_ch.loc[df_ch['Positive_Review'] != "No Positive"]
print(TM_Pos.shape)

In [None]:
documents_hp = TM_Pos['Positive_Review'].tolist()

# Preprocess the data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

In [None]:
processed_documents = [preprocess_text(doc) for doc in documents_hp]

In [None]:
# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(processed_documents)
corpus = [dictionary.doc2bow(doc) for doc in processed_documents]

In [None]:
# Build the LDA model (As the data is more it is going to take more than 30 minutes to run the below model building)
lda_model_HP = gensim.models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

In [None]:
# Print the topics
topics = lda_model_HP.print_topics(num_words=10)
for topic in topics:
    print(topic)

In [None]:
# Wordcloud for all top 5 Positive topics
num_topics = 5
def get_all_topic_words(lda_model_HP, dictionary, num_topics):
    all_topic_words = []
    for i in range(num_topics):
        topic_terms = lda_model_HP.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        all_topic_words.extend(topic_terms)
    return all_topic_words

def generate_wordcloud(all_topic_words):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud for Top 5 Positive Topics')
    plt.show()

# Example usage
all_topic_words = get_all_topic_words(lda_model_HP, dictionary, num_topics)
generate_wordcloud(all_topic_words)

In [None]:
topic_name_HP = ["Hotel Service & Management","Room Ambience","Room Features","Delicious Food","Exceptional Amenities"]

In [None]:
# Wordcloud for each Positive topic
num_topics = 5
def get_topic_words(lda_model_HP, dictionary, num_topics):
    topic_words = {}
    for i in range(num_topics):
        topic_terms = lda_model_HP.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        topic_words[f'Topic {i + 1}'] = topic_terms
    return topic_words

def generate_wordcloud(topic_words):
    i=0
    for topic, terms in topic_words.items():
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(terms))
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Word Cloud for topic: {topic_name_HP[i]}')
        plt.show()
        i=i+1

# Example usage
topic_words = get_topic_words(lda_model_HP, dictionary, num_topics)
generate_wordcloud(topic_words)

<a id="332"></a> 
#### B. Negative Reviews Data preparation for LDA

In [None]:
# remove reviews with "No Negatives" as the review (around 290)
# remove reviews with "No Positive" as the review (around 290)
print(df_ch.shape)
TM_Neg1 = df_ch.loc[df_ch['Negative_Review'] != "No Negative"]
print(TM_Neg1.shape)
TM_Neg2 = TM_Neg1.loc[TM_Neg1['Negative_Review'] != " No negatives just a little confused as to where lifts were located when leaving room maybe a marker on carpet to indicate direction "]
print(TM_Neg2.shape)
TM_Neg3 = TM_Neg2.loc[TM_Neg2['Negative_Review'] != " No negatives "]
print(TM_Neg3.shape)
TM_Neg4 = TM_Neg3.loc[TM_Neg3['Negative_Review'] != " All good no negatives"]
print(TM_Neg4.shape)
TM_Neg5 = TM_Neg4.loc[TM_Neg4['Negative_Review'] != " No negatives at all of any note"]
print(TM_Neg5.shape)

In [None]:
documents_neg = TM_Neg5['Negative_Review'].tolist()

# Preprocess the data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

In [None]:
processed_documents2 = [preprocess_text(doc) for doc in documents_neg]

In [None]:
# Create a dictionary and corpus for LDA
dictionary2 = corpora.Dictionary(processed_documents2)
corpus2 = [dictionary.doc2bow(doc) for doc in processed_documents2]

In [None]:
# Build the LDA model (As the data is more it is going to take more than 30 minutes to run the below model building)
lda_model_hn = gensim.models.LdaModel(corpus2, num_topics=5, id2word=dictionary2, passes=15)

In [None]:
# Print the topics
topics = lda_model_hn.print_topics(num_words=10)
for topic in topics:
    print(topic)

In [None]:
topic_terms = lda_model_hn.print_topics(5)

In [None]:
# Wordcloud for top 5 negative topics
num_topics = 5
def get_all_topic_words(lda_model_hn, dictionary2, num_topics):
    all_topic_words = []
    for i in range(num_topics):
        topic_terms = lda_model_hn.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        all_topic_words.extend(topic_terms)
    return all_topic_words

def generate_wordcloud(all_topic_words):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud for Top 5 Negative Topics')
    plt.show()

# Example usage
all_topic_words = get_all_topic_words(lda_model_hn, dictionary2, num_topics)
generate_wordcloud(all_topic_words)

In [None]:
topic_name_HN = ["Check-in Challenges and Service Shortcomings","Room Quality and Maintenance Issues","Guest Feedback: Cost and Value","Property Issues","Service and Communication"]

In [None]:
# Wordcloud for each negative topic
num_topics = 5
def get_topic_words(lda_model_hn, dictionary2, num_topics):
    topic_words = {}
    for i in range(num_topics):
        topic_terms = lda_model_hn.print_topics(num_topics)[i][1]
        topic_terms = topic_terms.split('"')[1::2]  # Extracting terms between double quotes
        topic_words[f'Topic {i + 1}'] = topic_terms
    return topic_words

def generate_wordcloud(topic_words):
    i=0
    for topic, terms in topic_words.items():
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(terms))
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Word Cloud for topic: {topic_name_HN[i]}')
        plt.show()
        i=i+1

# Example usage
topic_words = get_topic_words(lda_model_hn, dictionary2, num_topics)
generate_wordcloud(topic_words)

<a id="4"></a> 
## Part 4: Insights

**Summary:**
Based detailed analysis on the complete 'Europe hotels reviews data' reveals a predominance of perfect scores, suggesting overall excellent guest experiences, with significantly more words used in negative reviews than in positive ones. The Ritz Paris leads the top-rated hotels, showcasing exemplary service. Over time, hotel scores have consistently averaged between 8.3 and 8.4.

**LDA Topics:**

*Positive topics:*
1. Hotel Location and Accessibility
2. Room Amenities and Services
3. Room Comfort and Cleanliness
4. Staff and Service Excellence
5. Overall Hotel Experience

*Negative Topics:*
1. Reservation and Booking Concerns
2. Room Quality and Maintenance
3. Guest Experiences with Staff and Service
4. Issues Related to Hotel Facilities and Services
5. Specific Complaints and Negative Incidents

**Suggestions based on complete Analysis:**
*Service Improvement:* Address booking concerns and room quality issues while enhancing staff training for more personalized service based on guest feedback.
*Marketing and Branding:* Leverage positive feedback on "Staff and Service Excellence" and "Overall Hotel Experience," and highlight "Hotel Location and Accessibility" in  marketing strategy.
*Customer Experience Design:* Improve room amenities and services to boost guest satisfaction, and address specific issues mentioned in negative reviews to prevent future problems.
*Competitive Analysis:* Compare the top-performing hotels' scores and reviews with the average to identify best practices that can be adopted or adapted.
*Operational Adjustments:* Use the temporal analysis to prepare for peak times with higher guest expectations and manage off-peak times more efficiently.
*Reputation Management:* Address negative reviews proactively by reaching out to dissatisfied guests and offering resolutions, which can also improve online ratings.