<a href="https://colab.research.google.com/github/ITU-Business-Analytics-Team/Business_Analytics_for_Professionals/blob/main/Part%20I%20%3A%20Methods%20%26%20Technologies%20for%20Business%20Analytics/Chapter%207%3A%20Text%20Analytics/7_6_1_Rule_Based_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Analytics**
## Sentiment Analysis

### Rule-based Sentiment Analysis

In this notebook, rule-based sentiment anlaysis is explained with hotel reviews dataset using NLTK Vader Sentiment Analysis tool. First, the dataset is downloaded to be investigated.

In [None]:
import gdown
url = 'https://drive.google.com/uc?id=1pQpUru4YLIxOZ452wJVvKp2aWXXc15aI'
output = '7.6.1. Rule Based Sentiment Analysis.zip'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1pQpUru4YLIxOZ452wJVvKp2aWXXc15aI
To: /content/7.6.1. Rule Based Sentiment Analysis.zip
100%|██████████| 47.3M/47.3M [00:00<00:00, 72.6MB/s]


'7.6.1. Rule Based Sentiment Analysis.zip'

In [None]:
!unzip '7.6.1. Rule Based Sentiment Analysis.zip'

Archive:  7.6.1. Rule Based Sentiment Analysis.zip
replace Hotel_Reviews.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [None]:
import pandas as pd
import numpy as np

In [None]:
reviews_df = pd.read_csv("Hotel_Reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] + reviews_df["Positive_Review"]
# create the label
reviews_df["is_bad_review"] = reviews_df["Reviewer_Score"].apply(lambda x: 1 if x < 5 else 0)
# select only relevant columns
reviews_df = reviews_df[["review", "is_bad_review"]]
reviews_df.head()

Unnamed: 0,review,is_bad_review
0,I am so angry that i made this post available...,1
1,No Negative No real complaints the hotel was g...,0
2,Rooms are nice but for elderly a bit difficul...,0
3,My room was dirty and I was afraid to walk ba...,1
4,You When I booked with your company on line y...,0


In [None]:
reviews_df.shape

(515738, 2)

The dataset consists two columns: review as the customer review and is_bad_review for its label and includes 51574 reviews. Since rule-based sentiment analysis does not require training, entire dataset is not needed. In order to reduce computational load for next cells, 10% of the dataset will be sampled.

In [None]:
reviews_df = reviews_df.sample(frac = 0.1, replace = False, random_state=42)

Rule-based methods rely on individual words. So, even if the user's intent is negative when using "No Positive" phrase, there is high chance to a simple rule-based model would not understand the difference and count the word of "Positive" with positive polarity. In order to prevent this, these phrases can be removed from the dataset.

In [None]:
# remove 'No Negative' or 'No Positive' from text
reviews_df["review"] = reviews_df["review"].apply(lambda x: x.replace("No Negative", "").replace("No Positive", ""))

Vader is a lexicon-based sentiment analysis method. We need to download vader_lexicon to be able to use it.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

Vader works better after feature engineering is applied such as after than tokenization and lemmatization. The dataset is cleaned as in next cell.

In [None]:
# return the wordnet object value corresponding to the POS tag
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

# clean text data
reviews_df["review_clean"] = reviews_df["review"].apply(lambda x: clean_text(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


After cleaning the dataset, Vader SentimentIntensityAnalyzer can be applied onto reviews without further training phase.

In [None]:
# add sentiment anaylsis columns
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews_df["sentiments"] = reviews_df["review"].apply(lambda x: sid.polarity_scores(x))
reviews_df = pd.concat([reviews_df.drop(['sentiments'], axis=1), reviews_df['sentiments'].apply(pd.Series)], axis=1)



Below final dataset is shown. The compound score is the sum of positive, negative & neutral scores which is then normalized between -1(most extreme negative) and +1 (most extreme positive). If we would label the reviews by the highest scored polarity regarding the results of Vader, we would create our predicted labels and could compare with is_bad_review to understand the performance. From the below dataset, it is obvious to say this method would not score  high in terms of accuracy.

In [None]:
reviews_df

Unnamed: 0,review,is_bad_review,review_clean,neg,neu,pos,compound
488440,Would have appreciated a shop in the hotel th...,0,would appreciate shop hotel sell drinking wate...,0.049,0.617,0.334,0.9924
274649,No tissue paper box was present at the room,0,tissue paper box present room,0.216,0.784,0.000,-0.2960
374688,Pillows Nice welcoming and service,0,pillow nice welcome service,0.000,0.345,0.655,0.6908
404352,Everything including the nice upgrade The Hot...,0,everything include nice upgrade hotel revamp s...,0.000,0.621,0.379,0.9153
451596,Lovely hotel v welcoming staff,0,lovely hotel welcome staff,0.000,0.230,0.770,0.7717
...,...,...,...,...,...,...,...
274862,Bathroom water easy made the bathroom wet whe...,0,bathroom water easy make bathroom wet bath wal...,0.000,0.614,0.386,0.8834
9732,Room very small chair tatty in the room,1,room small chair tatty room,0.000,1.000,0.000,0.0000
424201,Expensive rates and mini bar prices Roof top ...,0,expensive rate mini bar price roof top pool vi...,0.000,0.886,0.114,0.2023
72380,There was a very loud AC machine right outsid...,0,loud ac machine right outside window affect sl...,0.047,0.845,0.108,0.4767
