# Sentiment Analysis 
In this lab, we will build a sentiment analysis system which detects the attitude of the text . we build the system based on the Naive Bayes model. Naive Bayes is a probabilistic classifier , meaning that for a document $d$, out of all classes $c \in C$ the classifier returns the class $ \hat c$ which has the maximum posterior probability given the document. we express that mathematically as follows: $$ \hat c = \underset{c \in C}{\arg\max}{P(c | d)} \tag {1}$$ 

We use the formula of Bayes' rule to transform $Eq.1$ into some other probabilities that have some useful properties:

$$ P(c|d) = \frac {P(d|c) P(c)} {P(d)} \tag{2}$$

Bayes' rule gives us a way to break down any conditional probability, $P(c|d)$, into three other probabilities. We can dervie a new formula for $\hat c$ using Bayes' rule as follows : 
$$ \hat c = \underset{c \in C}{\arg\max}{P(c | d)} = \underset{c \in C}{\arg\max} {\frac {P(d|c) P(c)} {P(d)}} \tag {3} $$

We can simplfy $Eq.3$ by dropping the denominator $P(d)$. This is possible since we compute $\frac {P(d|c) P(c)} {P(d)}$ for each possible class. But $P(d)$ doesn't change for every class since we are always asking about the most likely class for the same document d, which must have the same probability $P(d)$. Thus, we can choos the class that maximizes the simpler formula: 
$$ \hat c = \underset{c \in C}{\arg\max}{P(c | d)} = P(d|c) P(c) \tag{4}$$

As a result of this simplification , we can now compute the most probable class $\hat c$ give some document d by choosing the class that has the highest product of two probabilities : 

 - <b>Prior Probability $P(c)$: </b> represents what is originally believed about $c$ before new evidence is introduced or before collecting the document $d$ 

 - <b> Likelihood $P(d|c)$: </b> The prbability of falling under a specific class 
 
 
We can simplify  $Eq.4$ by breaking viewing the document as a set of features $f_1, \ldots , f_n$:
$$ \hat c = \underset{c \in C}{\arg\max}{P(c | d)} = P(f_1, \ldots , f_n|c) P(c) \tag{5}$$

However, it still hard to compute $\hat c$ using this formula due to the huge number of parameters we have in this equation. Moreover, We need to make some assumptions to simplfity the equation so we can compute $\hat c$ easily. 

### Naive Bayes Assumption 
 - <b>The Bag of Words Assumption</b>: we assume that the document $d$ is just a set of features (words) or a bag of features, $f_1, \ldots , f_n$, regardless their order. So we assume that the features $f_1, \ldots , f_n|c$ only encode the word identity and not the position 
 - <b> Conditional Independence Assumption:</b> this is commonly known as Naive Bayes Assumption which we assume that the probabilities $P(f_i|c)$ are independent give the the class and based on that assumption we can break the likelihood into a set of simple probabilities as follows: $$P(f_1, \dots , f_n|c)  = P(f_1|c) \cdot P(f_2|c) \cdots P(f_n|c) \tag{6}$$
 

based on these two assumptions we can re-write $Eq.5$ as follows: 
$$\hat c = \underset{c \in C}{\arg\max}{P(c | d)} = \underset{c \in C}{\arg\max} P(f_1, \ldots , f_n|c) P(c)  =  \underset{c \in C}{\arg\max} P(f_1|c) \cdot P(f_2|c) \cdots P(f_n|c) P(c) = \underset{c \in C}{\arg\max} P(c) \prod_{f \in F} {P(f|c)} \tag{7}$$

<b>Finally,</b> we can replace the set features by a set of words as the features that describe the document is just  a bunch of words. Thus, we can viewing the features $f_1, \dots , f_n$ as a set of words $w_1, \dots , w_n$ then the final equation chosen by a naive bayes classifirer is : $$\hat c = \underset{c \in C}{\arg\max} P(c) \prod_{i} {P(w_i|c)} \tag{8} $$

<b>However,</b> for a computational issue, Naive Bayes Calculations are done in log space to avoid underfolw and increase the 
computation speed. Thus $Eq.8$ is expressed as: 

$$ \hat c = \underset{c \in C} {\arg \max} \log{\big (P(c) \prod_{i} {P(w_i|c) \big )}}  = \underset{c \in C} {\arg \max} \log{P(c|d)} \hspace{1mm} + \hspace{1mm} \sum_{i}{\log{P(w_i|c)}} \tag{9}
$$

#### Note: 
The classifiers that use a linear combination of the imputs to make a classification decision - like naive and also logistic regression are called <b>linear classifiers</b>

# Outline 
* [Import Functions, Libraries, and Data](#1)
* [Pre-process The Data](#2)
 * [Remove Noise](#3)
 * [Building a Frequency Table](#4)
 * [Computing the Positive and Negative Frequencies for a Word in a Certain Class](#44)
 * [Pre-processing Step](#5)
* [Training the Model](#6)
* [Testing the Model](#7)
 * [ Make Predictions](#71)
 * [Compute the Accuracy](#72)
* [Error Analysis](#8)



# Import Functions, Libraries, and Data <a anchor = "anchor" id = "1" > </a>

In [None]:
#import the necessary libraries and functions
import numpy as np 
import math 
import nltk
import string
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
from nltk.corpus import stopwords

In [None]:
#downlaod the data needed for that lab
#nltk.download("twitter_samples")
#nltk.download("stopwords")

In [None]:
#import the data 
from nltk.corpus import twitter_samples

#  Pre-process The Data <a anchor = "anchor" id = "2"></a>



## Remove Noise  <a anchor = "anchor" id = "3"></a>
For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.
- **Remove noise**: You will first want to remove noise from your data -- that is, remove words that don't tell you much about the content. These include all common words like 'I, you, are, is, etc...' that would not give us enough information on the sentiment.
- We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags because they can not tell you a lot of information on the sentiment.
- You also want to remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy," and "happy." as different words.
- Finally you want to use stemming to only keep track of one variation of each word. In other words, we'll treat "motivation", "motivated", and "motivate" similarly by grouping them within the same stem of "motiv-".

In [None]:
def process_tweet(tweet):
    '''
    Usage:
      #process_tweet --> used to clean the text, tokenize it into separate words, remove stopwords, 
                         and convert words to stems.
      
    Arguments:
      #tweet --> a string containing  a tweet 
    
    Returns:
      #tweet_clean --> a list of words containing  the processed tweet
    '''
    
    #create a new Porter stemmer
    stemmer = PorterStemmer()
    
    #get the latest of all the English stop words 
    stopwords_english = stopwords.words('english')
    
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    
    #define an empty list which will hold the cleaned tokens
    tweets_clean = []
    
    #loop over every tokens in the list of tokens, tweets_tokens
    for token in tweet_tokens:
        if (token not in stopwords_english and token not in string.punctuation):
            
            #stemming that token 
            stem_token = stemmer.stem(token)
            
            #append the stem of the token in the tweets_clean list
            tweets_clean.append(stem_token)
            
    
    return tweets_clean

In [None]:
#Test the code
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

In [None]:
' '.join(process_tweet(custom_tweet))

In [None]:
' '.join(process_tweet(custom_tweet)).encode('ascii', 'ignore')

## Building a Frequency Table <a anchor = "anchor" id ="4" ></a>
To help train your naive bayes model, you will need to build a dictionary where the keys are a (word, label) tuple and the values are the corresponding frequency.  Note that the labels we'll use here are 1 for positive and 0 for negative.

In [None]:
def build_freqs(tweets, ys):
    '''
    Usage:
      #bulid_freqs --> used to count how often a word in the 'corpus' (the entire set of tweets) was 
                       associated with a positive label '1' or a negative label '0', then builds 
                       the freqs dictionary, where each key is a (word,label) tuple, 
                       and the value is the count of its frequency within the corpus of tweets.
      
    Arguments:
      #tweets --> a list of tweets 
      #ys --> a m x 1 array holds the sentiment label or the class (0 or 1) corresponds to every tweet
    
    Returns:
      #freqs--> a dictionary whose key is (word, sentiment label (class)) and whose value frequency
                     of the word which is the number of times that word showes up in that class
    '''
    
    #reduce the rank of the ys array to be rank one array, then, convert it to list 
    yslist = np.squeeze(ys).tolist()
    
    #initialize freqs dic as empty dic which will be populated by looping over every tweet in tweets
    freqs = {}
    
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            
            #if the key exist in the dic, increment the value by one 
            if pair in freqs:
                freqs[pair] += 1
            #if not, intialize its value to one 
            else:
                freqs[pair] = 1
                
    return freqs

In [None]:
#Test the code

tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]

build_freqs(tweets,ys)

## Computing the Positive and Negative Frequencies for a Word in a Certain Class<a anchor = "anchor" id ="44" ></a>

In [None]:
def look_up(freqs, word, label):
    '''
    Usage:
      #train_naive_naive_bayes --> used for training the model
  
    
    Arguments:
      #freqs --> a dic which map each (word, label) to its corresponding frequency
      #word --> the word to look up
      #label: the label corresponding to the word
    
    Returns:
      #n --> the number of times the word of interest appears in all the documents of topic c (c= label)
    
    Notes:
      #We return zero if the tuple doesn't exist in the frequency table 
    '''
    
    #initialize n 
    n = 0
    
    #get the given 2-tuple
    pair = (word, label) 
    
    #check if that tuple exists in the frequency table 
    if (pair in freqs):
        
        n = freqs[pair] #get the corresponding frequency to this tuple
        
    return n

# Pre-processing Step  <a anchor = "anchor" id = "5"></a>
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

#select the set of positive and negative tweets 
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

#split the data into train and test set 
train_pos = all_positive_tweets[ :4000]
test_pos = all_positive_tweets[4000:]

train_neg = all_negative_tweets[:4000]
test_neg = all_negative_tweets[4000:]

#concatenating the training tweets, positive and negative 
train_x  = train_pos + train_neg 

#conatenating the testing tweets, positive and negatives
test_x = test_pos + test_neg

In [None]:
# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

In [None]:
#explore the training tweets 
train_x

In [None]:
#Explore the test tweets
test_x

In [None]:
#combine the negative and postive labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [None]:
#explore train_y 
train_y.shape

In [None]:
#explore test_y 
test_y.shape

In [None]:
#create frequency dictionary 
freqs = build_freqs(train_x,train_y)

In [None]:
#explore the freq dictionary 
freqs

In [None]:
#explore the length of the frequency dic 
len(freqs.keys())

# Training the Model <a anchor = "anchor" id = "6"></a>

The goal of training the model is to learn the probabilities $P(c)$ and $P(w_i,c)$ but how ?. We simply use the frequencies in the data.


<b>For</b> $P(c)$:
We ask the precentage of the documents in our training set are in each class $c$. Thus the prior probability is given by : 

$$P(c) = \frac {N_c} {N_{doc}}$$  where, 

$N_c:$ the number of the documents in our training set falling under the class $c$

$N_{doc}:$ the total number of documents 

<b>For</b> $P(w_i|c)$:
We compute the likelihood as the fration of times the word $w_i$ appears among all the words in all documents of topic $c$. Thus the likelihood is given by : 
$$ P(w_i|c) = \frac {count(w_i, c)} { \sum_{w \in V}{count(w,c)}}$$

where, 

$count(w_i, c):$ the number of times the word $w_i$ appears in all documents of topic $c$ 

$\sum_{w \in V}{count(w,c)}:$ the sum of the frequency of each word in the documents of topic $c$


#### ِAdd-One Smoothing
There's a problem with training the likelihood, imagine that we are trying to estimate the word "great" given class positive, but suppose that our training set is not enough to hold all words so the word great doesn't appear in the documents so the it has a frequency of zero and thus the the $P(w_i, c)$ is equal to zero since the numerator equals zero and that affects the classification as a whole since the naive bayes model multiplies all the feature likelihoods together which lead to zero probability. one of the solutions that we can use is <b>add-one smoothing</b>. We pretend that the frequency of every word in the vocabulary is incremented by one and thus the likelihood is given by:
$$ P(w_i|c) = \frac {count(w_i, c) + 1} { \sum_{w \in V}{(count(w,c) + 1 )}} =  \frac {count(w_i, c) + 1} {\big ( \sum_{w \in V}{count(w,c)}\big) + |V|}$$



#### The Pseudo-code of The Training Algorithm
<img src = "https://i.imgur.com/26TEwqH.png" width = "50%" >



#### We will make some modifications in the algorithm: 

 - Instead of computing $P(d|c = -)$ and $P(d|c = +)$ and compare them to find what is more probable, we will divide them.
 - By convention , we divide the $P(d|c = +)$ by $P(d|c = -)$
 - If the quotient > 0 the the positive class is more probable  than the negative one 
 - If the quatient < 0 the negative class is more probable than the positive one 
 - The formula of that division is as follows :
 $$ \frac{P(d|+)}{P(d|-)} = \log\bigg (\frac{P(+)} {P(-)} \cdot \prod_{i}\big( \frac{P(w_i|+)}{P(w_i|-)} \big)\bigg)  = \log\big (\frac{P(+)} {P(-)}\big) + \sum_{i}{\log\bigg({\frac{P(w_i|+)}{P(w_i|-)}}\bigg)} = log\hspace{1mm}prior + log\hspace{1mm}likelihood
 $$

In [None]:
def train_naive_bayes(freqs, train_x, train_y):
    
    '''
    Usage:
      #train_naive_naive_bayes --> used for training the model
  
    
    Arguments:
      #freqs --> a dic which map each (word, label) to its corresponding frequency
      #train_x --> a list of tweets 
      #train_y -->  a list of labels correponding to the tweets (0,1)
    
    Returns:
      #logprior --> log𝑃(𝑐)
      #loglikelihood --> a dictionary with the loglikelihoods for each word --> {W_1: log𝑃(𝑤1|𝑐),..., W_n: log𝑃(𝑤n|𝑐)}
    '''
    
    #initialize the loglikelihood dic as well as logprior
    loglikelihood = {}
    logprior = 0
    
    #get the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    
    #compute the vocabulary size
    V = len(vocab)
    
    #initialize Npos and Nneg 
    N_pos = N_neg = 0
    
    #compute  Npos and Nneg -->  ∑𝑐𝑜𝑢𝑛𝑡(𝑤,𝑐)
    #Loop over every (word, label) in the frequency table
    for pair in freqs.keys():
        
        #if the label is positive (greater than zero) 
        if pair[1] > 0:
            
            #increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]
            
        else:
            
            N_neg += freqs[pair]
            
    
    #########################
    # compute the log prior #
    ########################
    
    #calculate the total number of documents 
    D = len(train_y)
    
    #compute the number of positive documents 
    D_pos = np.squeeze(sum(filter(lambda X: X > 0, train_y)))
    
    #compute the number of negative documents 
    D_neg = D - D_pos
    
    
    #Calculate the log prior 
    logprior = np.log(D_pos) - np.log(D_neg)
    
    ##############################
    # compute the log likelihood #
    ##############################
    
    
    #Loop over each word in the vocabulary
    for word in vocab:
        
        #get the positive and negative frequencies of the word 
        freqs_pos = look_up(freqs, word, 1) # 𝑐𝑜𝑢𝑛𝑡(𝑤_i,+)
        freqs_neg = look_up(freqs, word, 0) # 𝑐𝑜𝑢𝑛𝑡(𝑤_i,-)
        
        #calculate 𝑃(𝑤ord|+) 
        p_w_pos = (freqs_pos + 1) / (N_pos + V)
        
        #calculate 𝑃(𝑤ord|-) 
        p_w_neg = (freqs_neg + 1) / (N_neg + V)
        
        #calculate the log likelihood of the word 
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
        
    
    
    return logprior, loglikelihood

In [None]:
#Test the code
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

# Testing the Model <a anchor = "anchor" id = "7"></a>

#### The Pseudo-code of The Testing Algorithm

<img src = "https://i.imgur.com/c0Z36Yh.png" width = "800px">


### Note:
- Because we make some modification on training algorithm that affect the testing algorithm we will make as instead of outputing the highest probability given the document we want to test , we will outputing the fraction that tell us that document is more likely to fall in the positive class or the negative one 
- If the quotient > 0 the the positive class is more probable  than the negative one 
- If the quatient < 0 the negative class is more probable than the positive one 
- The log prior could be a ngeative value, ex(if the fraction is between 0 and one --> log(0.5) = -0.3)

<br>

## Make Predictions <a anchor = "anchor" id = "71"></a>

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Usage:
      #train_naive_naive_bayes --> used for training the model
  
    
    Arguments:
      #tweet --> the tweet or the document the need to test 
      #logprior --> log𝑃(𝑐)
      #loglikelihood --> a dictionary with the loglikelihoods for each word --> {W_1: log𝑃(𝑤1|𝑐),..., W_n: log𝑃(𝑤n|𝑐)}
    
    Returns:
      #p --> the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)
    '''
    
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]


    return p

In [None]:
#Test the code

my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

In [None]:
#Extra test

my_tweet = 'She is sad.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

## Compute the Accuracy <a anchor = "anchor" id = "72"></a>

In [None]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    '''
    Usage:
      #train_naive_naive_bayes --> used for training the model
  
    
    Arguments:
      #test_x --> list of tweets
      #test_y --> the corresponding labels for the list of tweets
      #logprior --> log𝑃(𝑐)
      #loglikelihood --> a dictionary with the loglikelihoods for each word --> {W_1: log𝑃(𝑤1|𝑐),..., W_n: log𝑃(𝑤n|𝑐)}
    
    Returns:
      #accuracy --> (# of tweets classified correctly)/(total # of tweets)
    '''
    
    #initialize the accuracy 
    accuracy = 0
    
    #define an empty list ,y_hat, which will hold the list of the estimated value of y for every tweet in test_x list 
    y_hats = []
    
    
    #Loop over every tweet in the list of tweets, test_x
    for tweet in test_x:
        
        #if prediction > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            y_hat_i = 1
        else :
            y_hat_i = 0
        
        #append the estimated value of y_hat coresponding to this tweet 
        y_hats.append(y_hat_i)
        
    #convert y_hat to columns vector to get the value of mean properly
    y_hats = np.array(y_hats).reshape(test_y.shape)
        
    
    #compute the error which is (# of tweets classified icorrectly)/(total # of tweets)
    error = np.mean(np.absolute(y_hats - test_y))
    
    #compute the accuracy which is 1-error 
    accuracy = 1 - error
    
        
    return accuracy 

In [None]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

In [None]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = naive_bayes_predict(tweet, logprior, loglikelihood)
#     print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

In [None]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
naive_bayes_predict(my_tweet, logprior, loglikelihood)

# Error Analysis <a anchor = "anchor" id = "8"></a>

In [None]:
# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0): # or (y_hat > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

In [None]:
# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (y_hat > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

For more information about <b>encode() function</b> visit [encode()](https://www.w3schools.com/python/ref_string_encode.asp)

# Congratulations!