# Sentiment Analysis

Sentiment analysis is a popular use case of text classification. Sentiment analysis focuses on categorizing opinions expressed in a piece of text, to determine whether it is positive, negative, or neutral. It has a lot of business applications, such as product analytics, brand monitoring, market research, or customer support. Sentiment analysis uses natural language processing \(NLP\) and machine learning models such as Naive Bayes.

## Naive Bayes

Naive Bayes is a classification algorithm based on the Bayes Theorem. It is naive because it assumes the conditional independence of words. This is, the presence of a particular word in a class \(category\) is not related to the presence of any other word to simplify computations. Naive Bayes is a technique that can be used in sentiment analysis with the bag of words model. This algorithm is used to calculate the probability that a piece of text is positive or negative given the words in the sentence. Naive Bayes is simple, but it can outperform more sophisticated classification methods.

Bayes’ rule is used in probability theory to calculate conditional probability. Bayes’ rule says that the probability of $b$ given $a$ is equal to the probability of $a$ given $b$, times the probability of $b$, divided by the probability of $a$.

<br/>

$P(b \; | \; a) = \large{\frac{P(b) \; P(a \; | \; b)}{P(a)}}$

<br/>

Bayes’ rule is a powerful rule of probability and statistics. Knowing $𝑃(𝑎 \; | \; 𝑏)$, $𝑃(𝑎)$ and $𝑃(𝑏)$, it calculates $𝑃(𝑏 \; | \; 𝑎)$.

## Bag of words model

The bag of words model represents text as an unordered collection of words, ignoring the syntax. This approach is used in text classification tasks where the frequency of each word is used as a feature by the classifier, such as sentiment analysis or email classification.

## The Naive Bayes classifier

Naive Bayes assumes that every word in a sentence is independent of the other words. This is, it assumes the context does not matter. Naive Bayes calculates $𝑃(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 \; | \;``𝑎𝑚𝑎𝑧𝑖𝑛𝑔 \; 𝑚𝑜𝑣𝑖𝑒")$ as $𝑃(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 \; | \; ``𝑎𝑚𝑎𝑧𝑖𝑛𝑔", \; ``𝑚𝑜𝑣𝑖𝑒")$. This is equal to

<br/>

$\large{\frac{𝑃(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) \; 𝑃(``𝑎𝑚𝑎𝑧𝑖𝑛𝑔", \; ``𝑚𝑜𝑣𝑖𝑒" \; | \; 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒)}{𝑃(``𝑎𝑚𝑎𝑧𝑖𝑛𝑔", \; "𝑚𝑜𝑣𝑖𝑒")}}$

<br/>

The probability $𝑃(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) \; 𝑃(``𝑎𝑚𝑎𝑧𝑖𝑛𝑔", \; ``𝑚𝑜𝑣𝑖𝑒" \; | \; 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒)$ is proportional to $P(positive, \; ``amazing", \; ``movie")$ and naively proportional to $P(positive) \; P(``amazing" \; | \; positive) \; P(``movie" \; | \; positive)$, since Naive Bayes assumes that the context does not matter. The denominator of the equation is ignored to reduce computations. The probability will be normalized to sump up to $1$.

Therefore, Naive Bayes calculates the probability that a document containing words $w_1$, $𝑤_2$, …, $𝑤_𝑛$ has a positive sentiment and negative sentiment as:

$pos = P(positive \; | w_1, w_2, ... , w_n) = P(positive) \prod_{i=1}^{n} P(w_i \; | \; positive)$

$neg = P(negative \; | w_1, w_2, ... , w_n) = P(negative) \prod_{i=1}^{n} P(w_i \; | \; negative)$

<br/>

Where $P(positive) = \large{\frac{positive \; reviews}{total reviews}}$, and $P(negative) = \large{\frac{negative \; reviews}{total reviews}}$

<br/>

Probability is calculated using additive smoothing to avoid words with probability zero. Laplace smoothing adds $1$ to each value in the probability distribution, pretending that all words 𝑤 have been observed at least once.

$𝑃(𝑤_𝑖 \; | \; 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = \large{\frac{𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 \; 𝑜𝑓 \; 𝑤_𝑖 \; 𝑖𝑛 \; 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 \; reviews \; + \; 1}{𝑡𝑜𝑡𝑎𝑙 \; 𝑤𝑜𝑟𝑑𝑠 \; 𝑖𝑛 \; 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 \; reviews \; + \; 𝑡𝑜𝑡𝑎𝑙 \; 𝑢𝑛𝑖𝑞𝑢𝑒 \; 𝑤𝑜𝑟𝑑𝑠}}$

$𝑃(𝑤_𝑖 \; | \; nega𝑡𝑖𝑣𝑒) = \large{\frac{𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 \; 𝑜𝑓 \; 𝑤_𝑖 \; 𝑖𝑛 \; nega𝑡𝑖𝑣𝑒 \; reviews \; + \; 1}{𝑡𝑜𝑡𝑎𝑙 \; 𝑤𝑜𝑟𝑑𝑠 \; 𝑖𝑛 \; nega𝑡𝑖𝑣𝑒 \; reviews \; + \; 𝑡𝑜𝑡𝑎𝑙 \; 𝑢𝑛𝑖𝑞𝑢𝑒 \; 𝑤𝑜𝑟𝑑𝑠}}$

<br/>

The normalized probability is:

$positive \; sentiment = \frac{pos}{pos \; + \; neg}$

$negative \; sentiment = \frac{neg}{pos \; + \; neg}$



In [22]:
from nltk import word_tokenize

In [23]:
# extract_words returns a list with the words in a text

def extract_words(text):
    return list(
        word.lower() for word in word_tokenize(text)
        if any(c.isalpha() for c in word)
    )

# load_file returns a tuple with the list of words in the input file and the total lines read, the function extend() adds multiple items to a list

def load_file(file_name):
    total_lines = 0

    words = []

    with open(file_name, "r") as f:
        for line in f:
            words.extend(extract_words(line))

            total_lines += 1

    return (words, total_lines)

In [24]:
#words in positive reviews

words_in_positive_reviews, positive_reviews = load_file("positive_reviews.txt")

print(words_in_positive_reviews, positive_reviews)

['great', 'best', 'movie', 'ever', 'wonderful', 'indie', 'production', 'great', 'movie', 'i', 'recommend', 'it', 'i', 'loved', 'it', 'my', 'children', 'loved', 'it', 'i', 'really', 'enjoyed', 'it', 'amazing', 'movie', 'i', 'had', 'a', 'good', 'time', 'i', 'enjoyed', 'it', 'so', 'much', 'beautiful', 'and', 'tender', 'story'] 10


In [25]:
#words in negative reviews

words_in_negative_reviews, negative_reviews = load_file("negative_reviews.txt")

print(words_in_negative_reviews, negative_reviews)

['really', 'boring', 'worst', 'movie', 'ever', 'a', 'waste', 'of', 'time', 'not', 'worth', 'it', 'pretentious', 'and', 'boring', 'really', 'bad', 'i', 'want', 'my', 'money', 'back', 'terrible', 'a', 'waste', 'of', 'time', 'mediocre', 'production', 'and', 'direction', 'boring', 'and', 'mediocre', 'movie', 'very', 'bad', 'not', 'worth', 'it', 'tedious', 'story', 'and', 'characters', 'very', 'bad', 'and', 'boring'] 10


In [26]:
def calculate_word_frequency(words_in_positive_reviews, words_in_negative_reviews):
    # each entry corresponds with a word, and its value with the word's frequency in positive and negative reviews
    # for example, 'enjoyed': [2, 0] indicates that 'enjoyed' appears 2 times in positive reviews and zero times in negative reviews

    frequency = {}

    # for each word in a positive review, if the word is not in the dictionary it adds a new entry with frequency [1, 0]
    # otherwise, it increments the first value of the list, which corresponds to positive reviews

    for word in words_in_positive_reviews:
        if word in frequency.keys():
            frequency[word][0] += 1
        else:
            frequency[word] = [1, 0]

    # for each word in a negative review, if the word is not in the dictionary it adds a new entry with frequency [0, 1]
    # otherwise, it increments the second value of the list, which corresponds to negative reviews

    for word in words_in_negative_reviews:
        if word in frequency.keys():
            frequency[word][1] += 1
        else:
            frequency[word] = [0, 1]

    return frequency

In [27]:
# calculate_word_probability returns a dictionary with word's probability in positive and negative reviews

def calculate_word_probability(frequency, words_in_positive_reviews, words_in_negative_reviews):
    # each entry corresponds with a word, and its value with the word's probability in positive and negative reviews
    # for example, 'enjoyed': [0.05263157894736842, 0.015625] indicates that 'enjoyed' appears in positive reviews with
    # probability 0.05263157894736842, and in negative reviews with probability 0.015625

    # for each word in the dictionary frequency, it adds a new word in the dictionary probability, the probability of a
    # word is calculated using additive smoothing to avoid probability zero. Laplace Smoothing adds 1 to each value in
    # the probability distribution, pretending that all words have been observed at least once

    probability = {}

    for word in frequency:
        positive = (frequency[word][0] + 1)/(len(words_in_positive_reviews) + len(frequency))
        negative = (frequency[word][1] + 1)/(len(words_in_negative_reviews) + len(frequency))

        probability[word] = [positive, negative]

    return probability	    

In [28]:
#word probability in positive and negative reviews

word_frequency = calculate_word_frequency(words_in_positive_reviews, words_in_negative_reviews)
print(word_frequency)

word_probability = calculate_word_probability(word_frequency, words_in_positive_reviews, words_in_negative_reviews)
print(word_probability)

{'great': [2, 0], 'best': [1, 0], 'movie': [3, 2], 'ever': [1, 1], 'wonderful': [1, 0], 'indie': [1, 0], 'production': [1, 1], 'i': [5, 1], 'recommend': [1, 0], 'it': [5, 2], 'loved': [2, 0], 'my': [1, 1], 'children': [1, 0], 'really': [1, 2], 'enjoyed': [2, 0], 'amazing': [1, 0], 'had': [1, 0], 'a': [1, 2], 'good': [1, 0], 'time': [1, 2], 'so': [1, 0], 'much': [1, 0], 'beautiful': [1, 0], 'and': [1, 5], 'tender': [1, 0], 'story': [1, 1], 'boring': [0, 4], 'worst': [0, 1], 'waste': [0, 2], 'of': [0, 2], 'not': [0, 2], 'worth': [0, 2], 'pretentious': [0, 1], 'bad': [0, 3], 'want': [0, 1], 'money': [0, 1], 'back': [0, 1], 'terrible': [0, 1], 'mediocre': [0, 2], 'direction': [0, 1], 'very': [0, 2], 'tedious': [0, 1], 'characters': [0, 1]}
{'great': [0.036585365853658534, 0.01098901098901099], 'best': [0.024390243902439025, 0.01098901098901099], 'movie': [0.04878048780487805, 0.03296703296703297], 'ever': [0.024390243902439025, 0.02197802197802198], 'wonderful': [0.024390243902439025, 0.01

In [29]:
def classify(text, positive_reviews, negative_reviews, probability):
    #extract words from textt
    words = extract_words(text)

    #ratio of positive andnegative reviews
    total_reviews = positive_reviews + negative_reviews

    positive = positive_reviews / total_reviews
    negative = negative_reviews / total_reviews

    #probability of each word for positive and negative reviews
    for word in words:
        if word in probability.keys():
            positive = positive * probability[word][0]
            negative = negative * probability[word][1]
    summation = positive + negative

    return (positive/summation, negative/summation)

In [30]:
   # test the classifier with unseen movie reviews

unseen_reviews = ["Beautiful, amazing movie, I loved it!", "Boring, not worth it", "Amazing, I liked it so much", "Annoying and boring", "No idea"]

print("Words in unseen reviews \n")

for review in unseen_reviews:
    print(extract_words(review))

print("\nClassification \n")

for review in unseen_reviews:
    result = classify(review, positive_reviews, negative_reviews, word_probability)

    positive = result[0]
    negative = result[1]

    if abs(positive - negative) > 0.25:
        if positive > negative:
            print(review, "is", float("%.4f" % positive), "positive")
        else:
            print(review, "is", float("%.4f" % negative), "negative")
    else:
        print(review, "is neutral", float("%.4f" % positive), "positive", float("%.4f" % negative), "negative")

Words in unseen reviews 

['beautiful', 'amazing', 'movie', 'i', 'loved', 'it']
['boring', 'not', 'worth', 'it']
['amazing', 'i', 'liked', 'it', 'so', 'much']
['annoying', 'and', 'boring']
['no', 'idea']

Classification 

Beautiful, amazing movie, I loved it! is 0.9945 positive
Boring, not worth it is 0.9368 negative
Amazing, I liked it so much is 0.9878 positive
Annoying and boring is 0.9241 negative
No idea is neutral 0.5 positive 0.5 negative


In [33]:
from nltk.corpus import stopwords

#set of english stop words

stop_words = set(stopwords.words('english'))

print(stop_words)

{"he'd", 'an', 'theirs', 'being', 'down', 'll', 'mustn', "should've", 'when', 'their', 'or', "we're", 'these', 'y', 'am', 'needn', 'up', "don't", 'only', 'such', 'him', 'did', 'below', 'doing', "it'd", 'of', 'your', 'but', 'nor', 'over', 'under', 'any', "i've", "wasn't", 'them', 'from', 'while', 'so', 'who', 'shouldn', "i'd", 'does', 'have', "i'm", "they'd", 'as', "it'll", 'after', 'its', 'just', 'ours', 'this', 'ain', "we'd", 'if', 'same', "they'll", "you've", 'our', 'how', 'other', 'he', 'his', "doesn't", 'most', 'against', 'all', 'you', 'through', 'out', 'that', 't', 'wasn', "we've", 'having', 'yours', "won't", 'here', 'once', 'both', "aren't", 'at', 'there', 'whom', 'before', "mightn't", 'above', 's', "needn't", 'aren', 'why', 'to', 'had', 'more', 've', 'until', "hadn't", 'is', 'she', "she'll", 'which', 'those', 'be', 'few', 'between', 'during', "mustn't", 'too', 'ma', 'what', "didn't", 'each', "hasn't", 'd', "wouldn't", 'now', "you'd", "she's", 'some', 'themselves', 'shan', 'can',

In [None]:
#the function extract words in ovveridden to exclude stop words

def extract_words(text):
    return list(
        
    )