# Sentiment Analysis 
In this lab, we will build a sentiment analysis system which detects the attitude of the text . we build the system based on the logistic regression model.


# Outline 
* [Import Functions, Libraries, and Data](#1)
* [Helper Functions](#2)
* [Pre-process the data](#3)
* [Logistic Regression](#4)
 * [Sigmoid Function](#4.1)
 * [Compute the Cost Function](#4.2)
 * [Compute the Gradient](#4.3)
* [Extracting the Features](#5)
 * [Feature Extraction for a Single Tweet](#5.1)
 * [Feature Extraction for all the Tweets in the Training Set](#5.2)
* [Training the Model](#6)
* [Test your logistic regression](#7)
* [Check Performance Using the Test Set](#8)
* [Error Analysis](#9)
* [Predict with your Own Tweet](#10)

# Import Functions, Libraries, and Data <a anchor = "anchor" id = "1" > </a>

In [None]:
#import the necessary libraries and functions
import numpy as np 
import math 
import nltk
import string
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
from nltk.corpus import stopwords

In [None]:
#downlaod the data needed for that lab
nltk.download("twitter_samples")
nltk.download("stopwords")

In [None]:
#import the data 
from nltk.corpus import twitter_samples 

# Helper Functions <a anchor = "anchor" id = "2"></a>

We will implement some  helper functions that help us build the System 

In [None]:
def process_tweet(tweet):
    '''
    Usage:
      #process_tweet --> used to clean the text, tokenize it into separate words, remove stopwords, 
                         and convert words to stems.
      
    Arguments:
      #tweet --> a string containing  a tweet 
    
    Returns:
      #tweet_clean --> a list of words containing  the processed tweet
    '''
    
    #create a new Porter stemmer
    stemmer = PorterStemmer()
    
    #get the latest of all the English stop words 
    stopwords_english = stopwords.words('english')
    
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    
    #define an empty list which will hold the cleaned tokens
    tweets_clean = []
    
    #loop over every tokens in the list of tokens, tweets_tokens
    for token in tweet_tokens:
        if (token not in stopwords_english and token not in string.punctuation):
            
            #stemming that token 
            stem_token = stemmer.stem(token)
            
            #append the stem of the token in the tweets_clean list
            tweets_clean.append(stem_token)
            
    
    return tweets_clean

In [None]:
def build_freqs(tweets, ys):
    '''
    Usage:
      #bulid_freqs --> used to count how often a word in the 'corpus' (the entire set of tweets) was 
                       associated with a positive label '1' or a negative label '0', then builds 
                       the freqs dictionary, where each key is a (word,label) tuple, 
                       and the value is the count of its frequency within the corpus of tweets.
      
    Arguments:
      #tweets --> a list of tweets 
      #ys --> a m x 1 array holds the sentiment label or the class (0 or 1) corresponds to every tweet
    
    Returns:
      #freqs--> a dictionary whose key is (word, sentiment label (class)) and whose value frequency
                     of the word which is the number of times that word showes up in that class
    '''
    
    #reduce the rank of the ys array to be rank one array, then, convert it to list 
    yslist = np.squeeze(ys).tolist()
    
    #initialize freqs dic as empty dic which will be populated by looping over every tweet in tweets
    freqs = {}
    
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            
            #if the key exist in the dic, increment the value by one 
            if pair in freqs:
                freqs[pair] += 1
            #if not, intialize its value to one 
            else:
                freqs[pair] = 1
                
    return freqs

# Pre-process the data <a anchor = "anchor" id = "3"></a>
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

In [None]:
#select the set of positive and negative tweets 
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
#split the data into train and test set 
train_pos = all_positive_tweets[ :4000]
test_pos = all_positive_tweets[4000:]

train_neg = all_negative_tweets[:4000]
test_neg = all_negative_tweets[4000:]

In [None]:
#concatenating the training tweets, positive and negative 
train_x  = train_pos + train_neg 

#conatenating the testing tweets, positive and negatives
test_x = test_pos + test_neg

In [None]:
#explore the training tweets 
train_x

In [None]:
#Explore the test tweets
test_x

In [None]:
#combine the negative and postive labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [None]:
#explore train_y 
train_y.shape

In [None]:
#explore test_y 
test_y.shape

In [None]:
#create frequency dictionary 
freqs = build_freqs(train_x,train_y)

In [None]:
#explore the freq dictionary 
freqs

In [None]:
#explore the length of the frequency dic 
len(freqs.keys())

### Test the process_tweet function

The given function `process_tweet()` tokenizes the tweet into individual words, removes stop words and applies stemming.

In [None]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

<br>

# Logistic Regression  <a anchor = "anchor" id = "4"></a>

### Sigmoid Function <a anchor = "anchor" id = "4.1"></a>
The function is named sigmoid as their graph looks like an $S$.

To create probability, we'll pass $z$ through the sigmoid function , $ h(z)$.

The sigmoid has the following equation , which shown graphically in the below Figure:

$$ h(z) = \frac{1}{1+e^{-z}} \tag{1} = \frac{1}{1 + e^{-\theta^{T}X}}$$

<img  src = "https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg" width = "50%" >

In [None]:
def sigmoid(X,theta):
    '''
    Usage:
      #sigmoid --> Computes sigmoid of z = X𝜃 element-wise.

    Arguments:
      #X --> The Design Matrix
      #theta --> The Parameters which need to update

    Returns:
      #sigmoid ---> The computed vlaue of sigmoid(z)
    '''
    #Compute  Our linear-hypothesis function 
    z = np.matmul(X,theta)

    #computes the sigmoid of z
    sigmoid = np.divide(1, (np.add(1, np.exp(-z))))
        
    return sigmoid

### Compute the Cost Function  <a anchor = "anchor" id = "4.2"></a>
\begin{equation}
  CE = \sum_{i=1}^{m} {Loss(y_{pred},y)} =\frac{1}{m}\sum_{i=1}^{m} {(y^{(i)})(-\log(y_{pred}^{(i)})) - (l-y^{(i)})(\log(1-y_{pred}^{(i)}))}
\end{equation}


In [None]:
def computeCost(X,y,theta):
    '''
    Usage:
      #computCost --> computes the cost for logistic regression
  
    
    Arguments:
      #X --> The Design Matrix
      #y --> The Ground Truth
      #theta --> The Parameters which need to update
    
    Returns:
      #J --> The cost value
    '''
    #Compute m --> the number of training featue vectors
    m = X.shape[0]
    
    #Compute  Our non-linear hypothesis function 
    h = sigmoid(X,theta)
    
    #Compute the losses
    losses = np.subtract(np.multiply(-y,np.log(h)), np.multiply((1-y),np.log(1-h)))
     
    #Compute the Cross Entropy Cost function
    J = (1/m)*(np.sum(losses))
    
    return J

### Compute the Gradient <a anchor = "anchor" id = "4.3"></a>

The Gradient is defined as follows: 
\begin{equation}
\frac{\partial J}{\partial \theta_{j}} = \frac{1}{m} \big(\sum_{i=1}^{m} { (y_{pred}^{(i)} - y^{(i)}) x^{(i)}_{j} \big)}
\end{equation}

In [None]:
def gradientDescent(X,y,theta,alpha,num_iters):
    '''
    Usage:
      #gradientDescent --> computes the gradient descent for linear regression
  
    
    Arguments:
      #X --> The Design Matrix
      #y --> The Ground Truth
      #theta --> The Parameters which need to update
      #alpha --> is the learning rate which indicates the learning step or how far we go down 
      #num_iters--> is the number of iterations needed to go to the global optima
    
    Returns:
      #The updated parameters,theta 
      #cost_history: which is list containing the the values of the cost function, J, for every iteration
    '''
    #Compute m --> the number of training featue vectors
    m = X.shape[0]
    
    #Define the cost history as empty list
    cost_history = []
    
    #Preallocating gradient for faster computaions 
    #The size of gradient equals:(numfeatures (includingx_0),)
    dtheta = np.zeros((X.shape[1],))

    
    #Keep until Convergence
    for i in range(num_iters+1):
        
        #Compute sigmoid of X𝜃 element-wise with the parameters, theta
        h = sigmoid(X,theta)
        
        #dtheta is the partial derivates of cost function with respect to the parameters, theta
        dtheta = (1/m)*(np.matmul(X.T, (np.subtract(h, y))))
        
        #Update theta
        theta = theta - alpha*dtheta
        
        #While debugging, it can be useful to print out the values of the cost function (computeCost) 
        cost = computeCost(X,y,theta)
        
        #Append the value of the cost at a specific value for theta to cost_history
        cost_history.append(cost)
        
        #print the cost function for every itration to track its new value step-by-step
        print("Reached iteration: {0}, the cost = {1}".format(i, cost))
    
    print("\n\nParameters have been trained!") 
    
    return theta, cost_history

<br> 

# Extracting the Features  <a anchor = "anchor" id = "5"></a>

### Feature Extraction for a Single Tweet <a anchor = "anchor" id = "5.1"></a>
* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set. 

<b>After feature extraction , the feature will look like:</b>

$$ x_m  = [1,\sum_{w}{freqs(w,1)}\hspace{1mm}, \sum_{w}{freqs(w,0)}]$$

<b>where,</b>

$x_m:$ represents a feature corresponds to tweet $m$

$1:$ represents the bias term 

$\sum_{w}{freqs(w,1)}\hspace{1mm}:$ epresents the sum of the positive frequencies for every unique word on the tweet $m$

$\sum_{w}{freqs(w,0)}\hspace{1mm}:$ represents the sum of the negative frequencies for every unique word on the tweet $m$

In [None]:
def extract_features(tweet, frequency):
    '''
    Usage:
      #extract_features --> used to extract features from a a given tweet to represent it 
      
    Arguments:
      #tweet --> list of tokens corresponds to a tweet
      #frequency --> a dic mapping from (word, class) to corresponding frequency 
    
    Returns:
      #x --> 1 × 3 feature vector 
    '''
    
    #process the tweet by tokenizing, stemming, and removing stopwords
    tokens_list= process_tweet(tweet)
    
    #pre-allocating a 1 × 3 feature vector 
    x = np.zeros((1,3))
    
    #adding the bias term 
    x[0,0] = 1
    
    #loop over every token in the list of tokens
    for token in tokens_list:
        #increment the word count corresponding to the positive class
        x[0,1] += freqs.get((token,1),0)
        
        #increment the word count coressponing to the negative class
        x[0,2] += freqs.get((token,0),0)
        
        
    #to be safe, we will assert that the shape of our feature vector = (1, 3)
    #if not, the program will stop, and give Assertion Error
    assert(x.shape == (1,3))
    
    return x

In [None]:
# Test the code 
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)

In [None]:
# Extra test 
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)

<br>

### Feature Extraction for all the Tweets in the Training Set <a anchor = "anchor" id = "5.2"> </a>
We want to apply the extract feature function to every tweet in our training set in order to construct the design matrix so we can training the model 

In [None]:
def computeDM(train_set, frequency):
    '''
    Usage:
      #computeDM --> used to compute the design matrix for a given training set
      
    Arguments:
      #train_Set --> list of tweets 
      #frequency --> a dic mapping from (word, class) to corresponding frequency 
    
    Returns:
      #X --> the design matrix for a given training set 
    '''
    
    #Pre-allocating the design matrix 
    X = np.zeros((len(train_set),3))
    
    #Loop over every tweet in the training set 
    for i in range(len(train_set)):
        #Extract freature for the i-th tweet 
        X[i, :] = extract_features(train_set[i],frequency)
        
    
    return X

<br> 

# Training the Model <a anchor = "anchor" id = "6" > </a>

In [None]:
#get the design matrix 
X = computeDM(train_x, freqs)

In [None]:
#Explore X
X

In [None]:
#Explore the shape fo X
X.shape

In [None]:
#get the crossponding labels of the design matrix
Y = train_y.reshape(8000,)

In [None]:
#Explore the shape of Y 
Y.shape

In [None]:
#Train the model
theta, cost_history =  gradientDescent(X,Y,theta=np.array([0,0,0]),alpha = 1e-9,num_iters = 1500)

In [None]:
#Explore the updated parameters
theta

<br>

# Test your logistic regression <a anchor = "anchor" id = "7" ></a>

$$ y_{pred} = Sigmoid(Z) = Sigmoid(\theta^{T}X) $$

In [None]:
def predict_tweet(tweet, freqs, theta):
    '''
    Usage:
      #predict_tweet --> used to predict whether a tweet is positive or negative
      
    Arguments:
      #tweet --> a string 
      #freqs --> a dic mapping from (word, class) to corresponding frequency
      #theta --> the learned (updated) parameters
    
    Returns:
      #y_pred--> the probability of a tweet being positive or negative
    '''
    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(x,theta)
    
    
    return y_pred

In [None]:
# Test the function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))

In [None]:
# Test your own tweet
my_tweet = 'I am learning :)'
predict_tweet(my_tweet, freqs, theta)

# Check Performance Using the Test Set <a anchor = "anchor" id = "8"></a>

After training the model , we can test the how well the model does on unseen data to check its  performance.

We can check its performance by computing the accuracy of the model as follows:

$$ Accuracy = \frac {Examples \hspace{1mm} correctly \hspace{1mm} classified} {Total \hspace{1mm} number \hspace{1mm} of \hspace{1mm} examples}$$

In [None]:
def test_logistic_regression(test_x, test_y, freqs, theta):
    '''
    Usage:
      #test_logistic_regression --> used to compute the accuracy of the model 
      
    Arguments:
      #test_x --> a list of tweets
      #test_y --> (m,) vector with the corresponding labels for the list of tweets
      #freqs --> a dic mapping from (word, class) to corresponding frequency
      #theta --> the learned (updated) parameters
    
    Returns:
      #accuracy --> (# of tweets classified correctly) / (total # of tweets)
    '''
    
    #Define m_test --> the number of testing examples
    m_test = test_y.shape[0]
    
    #intialize a list for storing predictions 
    y_hat = []
    
    #loop over every tweet in the test set 
    for tweet in test_x:
        
        #get the estimated value of the true label y for every tweet 
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            #append 1 to y_hat 
            y_hat.append(1.0)
            
        else:
            #append 0 to y_hat 
            y_hat.append(0.0)
            
        
    #compute the accuracy of the model 
    accuracy = (np.sum(np.asarray(y_hat) == test_y) / m_test)
    
    return accuracy

In [None]:
#convert test_y to rank one array so that the above function works properly 
Y_test = test_y.reshape(2000,)

In [None]:
#compute the accuracy of our model 
tmp_accuracy = test_logistic_regression(test_x, Y_test, freqs, theta)
print(tmp_accuracy)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

# Error Analysis <a anchor = "anchor" id = "9"></a>

In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?

In [None]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

# Predict with your Own Tweet <a anchor ="anchor" id = "10"> </a>

In [None]:
def pred(my_tweet, freqs, theta):
    '''
    Usage:
      #pred--> used to predict whether a tweet is positive or negative
      
    Arguments:
      #tweet --> a string  represents my own tweet
      #freqs --> a dic mapping from (word, class) to corresponding frequency
      #theta --> the learned (updated) parameters
    
    Returns:
      #y_pred--> the probability of a tweet being positive or negative
    '''
    #Compute the the probability of a tweet being positive 
    y_pred = predict_tweet(my_tweet, freqs, theta)
    
    if y_pred > 0.5:
        print('Positive Sentiment')
    else:
        print('Negative Sentiment')
    
    return y_pred

In [None]:
#predict your own tweet
my_tweet = "What you must understand about me is that I’m a deeply unhappy person"

prediction = pred(my_tweet, freqs, theta)

# Congratulations!