## **Logistic Regression**

In this notebook, wee will implement logistic regression for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative sentiment. 

We will cover below concepts:

* How to extract features for Logistic Regression given some text
* Implement Logistic Regression
* Apply Logistic Regression on a NLP task
* Test using our Logistic Regression
* Perform error analysis

### **Import/Load required libraries and data**

In [1]:
! cp '/content/drive/My Drive/Colab Notebooks/utils.py' '/content'

In [3]:
! cat 'utils.py' | tail -10


    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for label, tweet in zip(labels_list, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            freqs[pair] = freqs.get(pair, 0) + 1

    return freqs

In [4]:
import nltk
from os import getcwd 
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples

from utils import process_tweet, build_freqs

In [5]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')

In [8]:
# split the dataset into train(80%) and test(20%)
train_pos = pos_tweets[:4000]
test_pos = pos_tweets[4000:]
train_neg = neg_tweets[:4000]
test_neg = neg_tweets[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

In [9]:
# create positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [10]:
# print the shape of train and test labels
print("train_y.shape =" + str(train_y.shape))
print("test_y.shape =" + str(test_y.shape))

train_y.shape =(8000, 1)
test_y.shape =(2000, 1)


### **Create Frequency dictionary**

In [11]:
freqs = build_freqs(train_x, train_y)

In [12]:
# check the output
print("type(freqs) =" + str(type(freqs)))
print("len(freqs) =" + str(len(freqs.keys())))

type(freqs) =<class 'dict'>
len(freqs) =11346


### **Process Tweets**

In [13]:
# test the process_tweet() function
print('This is an example of a positive class tweet: \n', train_x[0])
print('\nThis is an example of processed version of a positive class tweet: \n', process_tweet(train_x[0]))

This is an example of a positive class tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of processed version of a positive class tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


### **Logistic Regression**

#### **Sigmoid Function**

The *Sigmoid* function is defined as:

h(*z*) = 1 / (1 + exp(-*z*))

In [30]:
def sigmoid(z): 
    '''
    Returns the sigmoid of input value.
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    h = 1 / (1 + np.exp(-z))
    return h

In [15]:
# test the sigmoid function
if (sigmoid(0) == 0.5):
    print('SUCCESS!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('CORRECT!')
else:
    print('Oops again!')

SUCCESS!
CORRECT!


#### **Logistic Regression: Linear Regression and a Sigmoid**
Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$ $\theta$ values are simply "weights" that our model is supposed to learn.

Logistic regression$$ h(z) = \frac{1}{1+\exp^{-z}}$$$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$We will refer to 'z' as the 'logits'.

**Cost function and Gradient**

The cost function used for logistic regression is the average of the log-loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That's the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0.
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0.
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [16]:
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2

9.210340371976294

* Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number:

  $-1 \times log(0.0001) \approx 9.2$. The closer the prediction is to zero, the larger the loss.

In [17]:
# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001) # loss is about 9.2

9.210340371976182

**Update the weights**

To update our weight vector $\theta$, we will apply gradient descent to iteratively improve your model's predictions.

The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j $$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$

* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.

#### **Gradient Descent function**

* The number of iterations *num_iters* is the number of times that we will use the entire training set.
* For each iteration, we will calculate the cost function using all training examples, and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'. $z = \mathbf{x}\mathbf{\theta}$
  * $\mathbf{x}$ has dimensions (m, n+1)
  * $\mathbf{\theta}$: has dimensions (n+1, 1)
  * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'. Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of $\theta$ is also vectorized. Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$
* We will use np.dot for matrix multiplication.
* To ensure that the fraction -1/m is a decimal value, we will cast either the numerator or denominator (or both), like `float(1)`, or write `1.` for the float version of 1.

In [18]:
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Returns the cost and weight vector.
    Input:
        x: matrix of features, i.e. (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations we want to train your model for
    Output:
        J: the final cost
        theta: our final weight vector
    Hint: we might want to print the cost to make sure that it is going down.
    '''
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x, theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = (-1/m) * (np.dot(y.T, np.log(h)) + np.dot((1 - y.T), np.log(1 - h)))

        # update the weights theta
        theta = theta - (alpha/m) * np.dot(x.T, (h - y))
        
    J = float(J)
    return J, theta

In [None]:
# test the function
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)

# Apply gradient descent
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")

The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]


### **Extracting Features**

* Given a list of tweets, we will extract the features and store them in a matrix. We will extract two feature:
  * First, the number of positive words in a tweet.
  * Second, the number of negative words in a tweet.
* Then we will train our Logistic Regression classifier on those features.
* Next we will test the classifier on a test dataset.

#### ***extract_features()* function**

* This function will take a single tweet at a time.
* It will process the tweet using *process_tweet()* function from *utils.py*, and save the list of tweet words.
* It will then loop through each word in the list of processed words, and do these:
  * For each word, check the *freqs* dictionary for the count when that word has a positive(1) label. *[Hint: key will be (word, 1.0)]*
  * Do the same thing when the word is associated with the negative(0) label. *[Hint: key will be (word, 0.0)]*
* <font color='orange'>We need to handle the case when (word, label) key is not found in the dictionary.

  [*Hint: dictionary.get() function deals with case when key is not present*] </font>
  


In [33]:
def extract_features(tweet, freqs):
    '''
    Returns a feature vector of dimension (1,3), in the format [1.0, sum(positive_counts), sum(negative_counts)].
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    
    #bias term is set to 1
    x[0,0] = 1 
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        x[0,1] += freqs.get((word, 1), 0)
        
        # increment the word count for the negative label 0
        x[0,2] += freqs.get((word, 0), 0)
        
    assert(x.shape == (1, 3))
    return x

In [32]:
# test the function

# test 1 (on training sample)
tmp1 = extract_features(train_x[0], freqs)
print('Test 1 result:\n')
print(tmp1)

# test 2 (when some words are not there in freqs dictionary)
tmp2 = extract_features('hooola foobar alpharomeo', freqs)
print('\nTest 2 result:\n')
print(tmp2)

Test 1 result:

[[1.00e+00 3.02e+03 6.10e+01]]

Test 2 result:

[[1. 0. 0.]]


### **Train Logistic Regression classifier**

In [35]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

The cost after training is 0.24216529.
The resulting vector of weights is [7e-08, 0.0005239, -0.00055517]


### **Test the classifier**

We will now test out logistic regression classifier on some input that the model has not seen before.

#### **Create *predict_tweet()* function**

This function will predict whether a tweet is positive or negative.

**Steps:**
* Given a tweet, process it, and extract the features.
* Apply the learned model weights on the features to get the logits.
* Apply sigmoid to these logits to get prediction between 0 and 1.

In [36]:
def predict_tweet(tweet, freqs, theta):
    '''
    Returns the predicted value between 0 and 1.
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''

    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    return y_pred

In [37]:
# test the function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))

I am happy -> 0.518580
I am bad -> 0.494339
this movie should have been great. -> 0.515331
great -> 0.515464
great great -> 0.530898
great great great -> 0.546273
great great great great -> 0.561561


Now let's check the performance on our **test** dataset.

To do that, we will create another function which will give us the accuracy of our classifier on test data.

**Steps:**
* Given the test data and the weights of our trained model, we will calculate the accuracy of our logistic regression model.
* We will use our *predict_tweet()* function to make predictions on each tweet in the test set.
* If the prediction is > 0.5, we will set the model's classification y_hat to 1, otherwise the model's classification y_hat to 0.
* A prediction is accurate when y_hat equals test_y. We will sum up all the instances when they are equal and divide by m.
<font color='orange'>
* *Hints:*
  * np.asarray() to convert a list to a numpy array
  * np.squeeze() to make an (m,1) dimensional array into an (m,) array
</font>

In [38]:
def test_accuracy(test_x, test_y, freqs, theta):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (number of tweets classified correctly) / (total number of tweets)
    """
    
    # the list for storing predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    accuracy = np.sum(np.array(y_hat).reshape(-1, 1) == test_y) / len(test_y)
    
    return accuracy

In [39]:
tmp_accuracy = test_accuracy(test_x, test_y, freqs, theta)
print(f"Logistic regression model's test accuracy = {tmp_accuracy:.4f}")

Logistic regression model's test accuracy = 0.9950


#### **Some playaround with random tweet texts**

In [43]:
new_tweet = "The movie was not good. The prequel was much better and awesome. This movie 0/10."
print('Tweet:')
print(new_tweet)
print('\nProcessed Tweet:')
print(process_tweet(new_tweet), '\n')
y_hat = predict_tweet(new_tweet, freqs, theta)
print('\nPrediction Score:')
print(y_hat, '\n')
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

Tweet:
The movie was not good. The prequel was much better and awesome. This movie 0/10.

Processed Tweet:
['movi', 'good', 'prequel', 'much', 'better', 'awesom', 'movi', '0/10'] 


Prediction Score:
[[0.51151624]] 

Positive sentiment
