In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier
from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Objective of the Analysis**

This end to end sentiment analysis and prediction routine is based on TripAdvisor's hotel reviews, this data set has two columns Review and Ratings. The objective of this analysis cum prediction routine is to identify sentiments for the reviews posted by various customers using the NLTK Vader Sentiment Analyser and later create a model to predict the sentiment scores based on the input text. 

We will also use a low code library named PyCaret to perform topic modeling to better understand the leading topic models within this corpus.

So let's get started!

# **Let's Understand What is Sentiment Analysis?**

Sentiment analysis is a method of identifying sentiment from a piece of text, it entails the process of text classification into a positive, negative or a neutral emotions leveragin various analytical techniques.

# **Why it is so important for academicians and organizations to perform sentiment analysis?**

Sentiment analysis is essential for businesses and academai alike to better understand customers emotions. 

**Importance of SA from a business perspective:** You have just launched a new range of products and you want to identify the areas of opportunities to further enhance the product and to do that Sentiment Analysis can come in quite handy to identify those granular level of details to understand the product improvement opportunities relative to the sentiment of the customer. 

**From the perspective of academia:** Analysing students' feedback using sentiment analysis techniques can identify the students' positive or negative feelings, or even more reﬁned emotions, that students have towards the current teaching.

There are otherways to slice and dice the data to get down to finer insights by utilizing demography, goegraphy and timestamp data and make valuable business decisions. 

**SENTIMENT ANALYSIS PROCESS FLOW** *(SOURCE: DATACAMP)*

![](https://cdn-images-1.medium.com/max/361/0*ga5rNPmVYBsCm-lz.)

**Step 1: Let's read the data**

In [None]:
import pandas as pd
df = pd.read_csv('/kaggle/input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')
df.head()

**Step 2. EDA**

Now, we will take a look at the variable “Rating” to see if majority of the customer ratings are positive or negative.

In [None]:
sns.countplot(x= df['Rating'])



From the above graph, we can see that most of the customer rating is within the positive zone (high = 4-5). This leads us to believe that most reviews will be pretty positive too, which will be analyzed in a while.Now, we can create some wordclouds to see the most frequently used words in the reviews.

# **Sentiment Analysis**

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

**Generating Sentiment Scores using Vader Sentiment Analyzer**

In [None]:
df['scores'] = df['Review'].apply(lambda hotel_overview: sid.polarity_scores(str(hotel_overview)))
df.head()

**Creating Compound Scores along with the Sentiment Description for further analysis**

Let's take a step back to better understand how the compound scores are calculated in Vader

What does **VADER** stands for?

**V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner

**What is a compound score and how it is calculated?**

The compound score is the sum of positive, negative & neutral scores which is then normalized between **-1(most extreme negative)** and **+1 (most extreme positive)**.
**The more Compound score closer to +1**, the higher the positivity of the text. These scores are calculated based on the Valence scores for the words in a sentence.

**What is a Valence Score?**

It is a score assigned to the word under consideration by means of observation and experiences rather than pure logic.

Consider the words 'terrible' , 'hopeless', 'miserable'. Any self-aware Human would easily gauge the sentiment of these words as Negative.

While on the other side, words like 'marvellous', 'worthy', 'adequate' are signifying positive sentiment.

According to the academic paper on VADER, the Valence score is measured on a scale from -4 to +4, where -4 stands for the most ‘Negative’ sentiment and +4 for the most ‘Positive’ sentiment. Intuitively one can guess that midpoint 0 represents ‘Neutral’ Sentiment, and this is how it is defined actually too.

In [None]:
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['sentiment_type']=''
df.loc[df.compound>0,'sentiment_type']='POSITIVE'
df.loc[df.compound==0,'sentiment_type']='NEUTRAL'
df.loc[df.compound<0,'sentiment_type']='NEGATIVE'
df.head()

In [None]:
df.sentiment_type.value_counts().plot(kind='bar',title="sentiment analysis")

From the above graph we can clearl see that most of the hotel reviews have a positive sentiments compared to negative and neurtal sentiments are quite low,hence,it will be agood idea to focus only on positive and negative set of reviews for further analysis. 

I will also reduce the data set by including only the review, sentiment type and compound score cols. 

In [None]:
data = df[['Review','sentiment_type','compound', 'Rating']]
data


Now let's split the data into training and a testing sets. The test set is the 10% of the original dataset. For this particular analysis I dropped reviews with neutral sentiment, as the reviews for neutral are almost none and may not have a great impact on the over all dataset. 

In [None]:
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets
train, test = train_test_split(data,test_size = 0.1)

In [None]:
# Removing neutral sentiments
train = train[train.sentiment_type != "NEUTRAL"]

As a next step I separated the Positive and Negative tweets of the training set in order to easily visualize their contained words. After that I cleaned the text. Now they were ready for a WordCloud visualization which shows only the most emphatic words of the Positive and Negative tweets.

In [None]:
train_pos = train[ train['sentiment_type'] == 'POSITIVE']
train_pos = train_pos['Review']
train_neg = train[ train['sentiment_type'] == 'NEGATIVE']
train_neg = train_neg['Review']

def wordcloud_draw(data, color = 'black'):
    words = ' '.join(data)
    cleaned_word = " ".join([word for word in words.split()
                            if 'http' not in word
                                and not word.startswith('@')
                                and not word.startswith('#')
                                and word != 'RT'
                            ])
    wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color=color,
                      width=2500,
                      height=2000
                     ).generate(cleaned_word)
    plt.figure(1,figsize=(13, 13))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
    
print("Positive words")
wordcloud_draw(train_pos,'white')
print("Negative words")
wordcloud_draw(train_neg)

Interesting to notice the following words and expressions in the positive word set: good, room, people, lovely, nice, restaurant

My interpretation relted to these words are, in general customers had a positive experience as their rooms were well appointed and taken care of by good people.

At the same time, negative reviews contains words like: bad, clean, toilets, bathroom, reservation, paid, disappointed, problem

My interpertation about these words are that overall the experience was good (that's why high positive sentiment), however, cleanliness of toilets and washrooms could have caused bad experience and at the same time problematic reservation experience may have disappointed the customers.

Stop Word: Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information. ( the, for, this etc. )

Now that we have performed some basic EDA on the data, we are ready for the next step that is Topic Modelling. From this point onwards I will be using the original data set for the remainder of the predictive analysis cycle.

# **Get the Data Ready**

In [None]:
data.head()

# **PyCaret the ML Workhorse 🐴**

For this phase of the analysis I will be utilizing PyCaret's capabilities to perform NLP routine!

In [None]:
pip install --upgrade pycaret-nightly

# Let's Setup the Environment

Setting up the nlp environment entails the following actions, automagically performed!

**Removing Numeric Characters:** All numeric characters are removed from the text. They are replaced with blanks.

**Removing Special Characters:** All non-alphanumeric special characters are removed from the text. They are also replaced with blanks.

**Word Tokenization:** Word tokenization is the process of splitting a large sample of text into words.

**Stopword Removal:** A stop word (or stopword) is a word that is often removed from text because it is common and provides little value for information retrieval, even though it might be linguistically meaningful. Example of such words in english language are: "the", "a", "an", "in" etc.

**Bigram Extraction:** A bigram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words.

**Trigram Extraction:** Similar to bigram extraction, trigram is a sequence of three adjacent elements from a string of tokens.

**Lemmatizing:** Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single word, identified by the word's lemma, or dictionary form. In English language, word appears in several inflected forms. For example the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word.

**Custom Stopwords:** We didn't use it but this option lets the user define specific words that they want to exclude from the text.

In [None]:
from pycaret.nlp import *
exp_name = setup(data = data, target = 'Review')

In [None]:
lda = create_model('lda', num_topics = 6, multi_core = True)

In [None]:
lda_top = assign_model(lda)
lda_top.head()

In [None]:
evaluate_model(lda)

In [None]:
plot_model(lda,'topic_model')

**LDAvis** = **L**atent **D**irichlet **A**llocation **Vis**ualization 👆

**LDAvis** tool helps to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). 

Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.

# **Saving the Model 💾**

In [None]:
save_model(lda,'Final LDA Model 07212020')