<a href="https://colab.research.google.com/github/KavehKadkhoda/Sentiment-Analysis/blob/main/5_Extracting_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
# Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
#     The first feature is the number of positive words in a tweet.
#     The second feature is the number of negative words in a tweet. 
# Then train your logistic regression classifier on these features.
# Test the classifier on a validation set.

# Instructions: Implement the extract_features function. 
# This function takes in a single tweet.
# Process the tweet using the imported `process_tweet` function and save the list of tweet words.
# Loop through each word in the list of processed words
#     For each word, check the 'freqs' dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
#     Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


In [30]:
#process_tweet: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.
#build_freqs: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' 
#or a negative label '0', then builds the 'freqs' dictionary, where each key is the (word,label) tuple, 
#and the value is the count of its frequency within the corpus of tweets.

import re
import string
import numpy as np

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks    
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean



def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    return freqs

In [31]:
#Import functions and data

# import nltk
import nltk
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples 



In [33]:
#Prepare the data
#The twitter_samples contains subsets of five thousand positive_tweets, 
#five thousand negative_tweets, 
#and the full set of 10,000 tweets.
#If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.
#You will select just the five thousand positive tweets and five thousand negative tweets.

In [34]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [35]:
# Train test split: 20% will be in the test set, and 80% in the training set.
# split the data into two pieces, one for training and one for testing (validation set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

In [36]:
#Create the numpy array of positive labels and negative labels.
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


In [37]:
#Create the frequency dictionary using the imported build_freqs function. 
#    for y,tweet in zip(ys, tweets):
#        for word in process_tweet(tweet):
#            pair = (word, y)
#            if pair in freqs:
#                freqs[pair] += 1
#            else:
#                freqs[pair] = 1

#Notice how the outer for loop goes through each tweet, 
#and the inner for loop steps through each word in a tweet.

#The 'freqs' dictionary is the frequency dictionary that's being built.
#The key is the tuple (word, label), such as ("happy",1) or ("happy",0). 
#The value stored for each key is the count of how many times the word "happy" was associated with a positive label,
# or how many times "happy" was associated with a negative label.

# create frequency dictionary
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11436


In [38]:
#Process tweet
#The given function 'process_tweet' tokenizes the tweet into individual words, removes stop words and applies stemming.

# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


In [39]:
#Extracting the features

def extract_features(tweet, freqs, process_tweet=process_tweet):

    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    
    #bias term is set to 1
    x[0,0] = 1 
    ### START CODE HERE ###
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        #if freqs.get(word, 1.0) != None:
            x[0, 1] += freqs.get((word, 1.0), 0)
        
        # increment the word count for the negative label 0
        #if freqs.get(word, 0.0) != None:
            x[0, 2] += freqs.get((word, 0.0), 0)
        
    assert(x.shape == (1, 3))
    return x


In [40]:
# Check your function
# test 1
# test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)

#Expected output
#[[1.000e+00 3.133e+03 6.100e+01]]

# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)

#Expected output
#[[1. 0. 0.]]


[[1.000e+00 3.133e+03 6.100e+01]]
[[1. 0. 0.]]


In [41]:
#Logistic regression

#Sigmoid
#It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    
    # calculate the sigmoid of z
    h = 1 / (1 + np.exp(-z))
    
    return h


# Testing your function 
if (sigmoid(0) == 0.5):
    print('SUCCESS!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('CORRECT!')
else:
    print('Oops again!')


#Cost function
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2

# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001) # loss is about 9.2


#Gradient
# GRADED FUNCTION: gradientDescent
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x, theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = (-1./m) * (np.dot(y.transpose(), np.log(h)) + np.dot((1 - y).transpose(), np.log(1 - h)))

        # update the weights theta
        theta = (theta) - ((alpha/m) * np.dot(x.transpose(), (h - y)))
        
    J = float(J)
    return J, theta


# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)

# Apply gradient descent
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")

#Expected output
#The cost after training is 0.67094970.
#The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]

SUCCESS!
CORRECT!
The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]


In [42]:
#Training Your Model

#To train the model:
#Stack the features for all training examples into a matrix X.

# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

#Expected Output:
#The cost after training is 0.22522315.
#The resulting vector of weights is [6e-08, 0.00053818, -0.0005583]

The cost after training is 0.22522315.
The resulting vector of weights is [6e-08, 0.00053818, -0.0005583]


In [43]:
#Test your logistic regression

#It is time to test the logistic regression function on some new input that the model has not seen before.

#Predict whether a tweet is positive or negative.

#Given a tweet, process it, then extract the features.
#Apply the model's learned weights on the features to get the logits.
#Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).


def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    
    return y_pred


# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))  


#Expected Output:

#I am happy -> 0.519275
#I am bad -> 0.494347
#this movie should have been great. -> 0.515979
#great -> 0.516065
#great great -> 0.532096
#great great great -> 0.548062
#great great great great -> 0.563929

I am happy -> 0.519275
I am bad -> 0.494347
this movie should have been great. -> 0.515979
great -> 0.516065
great great -> 0.532096
great great great -> 0.548062
great great great great -> 0.563929


In [44]:
#Check performance using the test set
#After training your model using the training set above, check how your model might perform on real, 
#unseen data, by testing it against the test set.

#Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model.
#Use your 'predict_tweet' function to make predictions on each tweet in the test set.
#If the prediction is > 0.5, set the model's classification 'y_hat' to 1, otherwise set the model's classification 'y_hat' to 0.
#A prediction is accurate when the y_hat equals the test_y. Sum up all the instances when they are equal and divide by m.

def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    
    # the list for storing predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)

        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    test_y = np.squeeze(test_y)
    accuracy = (y_hat == test_y).sum() / len(test_x)

    
    return accuracy

  
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

#Expected Output:¶
#0.9950
#Pretty good!

Logistic regression model's accuracy = 0.9950


In [45]:
#Error Analysis

#In this part you will see some tweets that your model misclassified. 
#Why do you think the misclassifications happened? 
#Specifically what kind of tweets does your model misclassify?

# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

Label Predicted Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.48901497	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots
http://t.co/UGQzOx0huu
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418949	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/aOKldo3GMj http://t.co/xWCM9qyRG5
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418949	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/R2JBO8iNww http://t.co/ow5BBwdEMY
THE PROCESSED TWEET IS: ["i'm", 'play', 

In [46]:
#Predict with your own tweet

# Feel free to change the tweet below
my_tweet = 'Director Adam McKay latest outing is a biting satire with its crosshairs clearly aimed at politicians and the larger society out there who are apathetic of the looming climate crisis facing the world.'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['director', 'adam', 'mckay', 'latest', 'outing', 'bite', 'satir', 'crosshair', 'clearli', 'aim', 'politician', 'larger', 'societi', 'apathet', 'loom', 'climat', 'crisi', 'face', 'world']
[[0.49910099]]
Negative sentiment


In [47]:
#Later, we will see how we can use deeplearning to improve the prediction performance!