# Sentiment Analysis Operations
Apply sentiment analysis to the review columns and save the results in a dataset

In [5]:
import time
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
nltk.download('stopwords')

# Load the filtered hotel reviews from csv
df = pd.read_csv('./Hotel_Reviews_Filtered.csv')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/amanda.vieira/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amanda.vieira/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


If you were to run Sentiment Analysis on the Negative and Positive review columns, it could take a long time. Tested on a powerful test laptop with fast CPU,it took 12 - 14 minutes depending on which sentiment library was used. That's a (relatively) long time, so worth investigating if that can be speeded up.

Removing stop words, or common English words that do not change the sentiment of a sentence, is the first step. By removing them, the sentiment analysis should run faster, but not be less accurate (as the stop words do not affect sentiment, but they do slow down the analysis).

The longest negative review was 395 words, but after removing the stop words, it is 195 words.

Removing the stop words is also a fast operation, removing the stop words from 2 review columns over 515,000 rows took 3.3 seconds on the test device. The relative shortness of the operation means that if it improves the sentiment analysis time, then it is worth doing.

In [7]:
start = time.time()
cache = set(stopwords.words("english"))

def remove_stopwords(review):
    text = " ".join([word for word in review.split() if word in cache])
    return text

df.Negative_Review = df.Negative_Review.apply(remove_stopwords)
df.Positive_Review = df.Positive_Review.apply(remove_stopwords)

## Performing sentiment analysis

Calculate the sentiment analysis for both negative and positive review columns, and store the result in 2 new columns.

The test of the sentiment will be to compare it to the reviewer's score for the same review. For instance, if the sentiment thinks the negative review had a sentiment of 1 (extremely positive sentiment) and a positive review sentiment of 1, but the reviewer gave the hotel the lowest score possible, then either the review text doesn't match the score, or the sentiment analyser could not recognize the sentiment correctly. You should expect some sentiment scores to be completely wrong, and often that will be explainable, e.g. the review could be extremely sarcastic "Of course I LOVED sleeping in a room with no heating" and the sentiment analyser thinks that's positive sentiment, even though a human reading it would know it was sarcasm.

NLTK supplies different sentiment analyzers to learn with, and you can substitute them and see if the sentiment is more or less accurate. The VADER sentiment analysis is used here.

In [8]:
# Create the vader sentiment analyser (there are others in NLTK you can try too)
vader_sentiment = SentimentIntensityAnalyzer()

# There are 3 possibilities of input for a review:
# It could be "No Negative", in which case return 0
# It could be "No Positive", in which case, return 0
# It could be a review, in which case calculate the sentiment

def calc_sentiment(review):
    if review == "No Negative" or review == "No Positive":
        return 0
    return vader_sentiment.polarity_scores(review)["compound"]

In [9]:
# Add a negative sentiment and positive sentiment column
print('Calculating sentiment columns for both positive and negative reviews')

start = time.time()

df['Negative_Sentiment'] = df.Negative_Review.apply(calc_sentiment)
df['Positive_Sentiment'] = df.Positive_Review.apply(calc_sentiment)

end = time.time()

print('Calculating sentiment took ' + str(round(end - start, 2)) + ' seconds')

Calculating sentiment columns for both positive and negative reviews
Calculating sentiment took 23.67 seconds


In [10]:
# Print the results and see it the sentiment matches the review
df = df.sort_values(by=["Negative_Sentiment"], ascending=True)
print(df[["Negative_Review", "Negative_Sentiment"]])

df = df.sort_values(by=["Positive_Sentiment"], ascending=True)
print(df[["Positive_Review", "Positive_Sentiment"]])

                                          Negative_Review  Negative_Sentiment
512114  it s not at all the don t how they them the a ...             -0.9776
155157  the was not as the was and the were it most wa...             -0.9559
226751  is and there was on the this should be but and...             -0.9559
252171  itself with some of the more and up when you i...             -0.9468
267388  this for the and the is but which an to with t...             -0.9430
...                                                   ...                 ...
176261  didn t the on my to the when hadn t out the fo...              0.9077
5883    was but the are the of so the won t have to th...              0.9175
172285  was not in no in no no and it an to didn t the...              0.9235
63987   in and you have to for you once was not the ha...              0.9726
161675  we in this we them on it but after few it was ...              0.9743

[515738 rows x 2 columns]
                                     

## Save the file

In [11]:
# Reorder the columns (This is cosmetic, but to make it easier to explore the data later)
df = df.reindex(["Hotel_Name", "Hotel_Address", "Total_Number_of_Reviews", "Average_Score", "Reviewer_Score", "Negative_Sentiment", "Positive_Sentiment", "Reviewer_Nationality", "Leisure_trip", "Couple", "Solo_traveler", "Business_trip", "Group", "Family_with_young_children", "Family_with_older_children", "With_a_pet", "Negative_Review", "Positive_Review"], axis=1)

print('Saving results to Hotel_Reviews_NLP.csv')
df.to_csv('Hotel_Reviews_NLP.csv', index=False)

Saving results to Hotel_Reviews_NLP.csv
