# Assignment 1: Logistic Regression
Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: 

* Learn how to extract features for logistic regression given some text
* Implement logistic regression from scratch
* Apply logistic regression on a natural language processing task
* Test using your logistic regression
* Perform error analysis

We will be using a data set of tweets. Hopefully you will get more than 99% accuracy.  
Run the cell below to load in the packages.

## Import functions and data

In [18]:
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples
from utils import process_tweets, build_freqs

### Prepare the data
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

In [4]:
# select 5k positive and negative tweets...
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print("Size of Positive Tweets : ", len(all_positive_tweets))
print()
print("Size of Negative Tweets : ", len(all_negative_tweets))

Size of Positive Tweets :  5000

Size of Negative Tweets :  5000


* Train test split: 20% will be in the test set, and 80% in the training set.

In [9]:
# split the data into two pieces, one for training and one for testing (validation set) 
train_size = 0.8

total_train_positive_samples = int(len(all_positive_tweets) * train_size)
total_train_negative_samples = int(len(all_negative_tweets) * train_size)

print("Total Positive Samples for training : ", total_train_positive_samples)
print()

print("Total Negative Samples for training : ", total_train_negative_samples)
print()

Total Positive Samples for training :  4000

Total Negative Samples for training :  4000



In [10]:
# get the positive and negative tweets in a list...
train_pos = all_positive_tweets[0:total_train_positive_samples]
train_neg = all_negative_tweets[0:total_train_negative_samples]

test_pos = all_positive_tweets[total_train_positive_samples:]
test_neg = all_negative_tweets[total_train_negative_samples:]

train_data = train_pos + train_neg
test_data = test_pos + test_neg

print("Size of Training Data : ", len(train_data))
print()

print("Size of Testing Data : ", len(test_data))

Size of Training Data :  8000

Size of Testing Data :  2000


In [14]:
# combine positive and negative tweet labels....
train_labels = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_labels = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

print("Size of train labels : {0}".format(len(train_labels)))
print("Size of test labels : {0}".format(len(test_labels)))

Size of train labels : 8000
Size of test labels : 2000


In [15]:
# Print the shape train and test sets
print("train_labels.shape = " + str(train_labels.shape))
print("test_labels.shape = " + str(test_labels.shape))

train_labels.shape = (8000, 1)
test_labels.shape = (2000, 1)


In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import twitter_samples
import re
import string
import numpy as np

# function to process my input tweets...
def process_tweets(input_tweet):
	# Step 1: To remove all hyperlinks, RT Tags and Hashtags from the string...
	# creating regexp patterns for all them...
	pattern_hyperlinks = r'https?:\/\/.*[\r\n]*'
	pattern_rt_tags = r'^RT[\s]+'
	pattern_hashtags = r'#'

	# get the list of stopwords...
	stopwords_list = list(stopwords.words('english'))
	punctuations = string.punctuation

	cleaned_tweet = ''
	tokenised_tweet = list()
	cleaned_tokenised_list = list()

	stemmed_list = []


	#print('Original Positive Tweet is : '+'\033[92m' + input_tweet)

	# Removing Hyperlinks...
	cleaned_tweet = re.sub(pattern_hyperlinks, '', input_tweet)

	# Removing all hashtags...
	cleaned_tweet = re.sub(pattern_hashtags, '', cleaned_tweet)

	# Removing all RT Tags...
	cleaned_tweet = re.sub(pattern_rt_tags, '', cleaned_tweet)

	#print("Cleaned Tweet is : ")
	#print(cleaned_tweet)

	# Tokenize the tweet...
	tokenizer = TweetTokenizer(preserve_case = False, strip_handles = True, reduce_len = True)
	tokenized_tweet = tokenizer.tokenize(cleaned_tweet)

	#print()
	#print("Tokenized Tweet is : ")
	#print(tokenized_tweet)

	# Remove Stopwords and Punctuations...
	for token in tokenized_tweet:
		if token in stopwords_list:
			pass
		elif token in punctuations:
			pass
		else:
			cleaned_tokenised_list.append(token)

	# print()

	# print("After Removing Punctuations and Stopwords : ")
	# print(cleaned_tokenised_list)

	# Stem the tokens...
	stemmer = PorterStemmer()
	for token in cleaned_tokenised_list:
		stemmed_token = stemmer.stem(token)
		stemmed_list.append(stemmed_token)


	return stemmed_list


# function to build frequency distributions....
def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweets(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1    
    return freqs



In [21]:
freqs = build_freqs(train_data, train_labels)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11345


### Process tweet
The given function `process_tweet()` tokenizes the tweet into individual words, removes stop words and applies stemming.

In [23]:
# test the function below
print('This is an example of a positive tweet: \n', train_data[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweets(train_data[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


# Part 1: Logistic regression 


### Part 1.1: Sigmoid
You will learn to use logistic regression for text classification. 
* The sigmoid function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

<div style="width:image width px; font-size:100%; text-align:center;"><img src='D:/Coursera/Natural Language Processing Specialization/Course 1 - Classification With Vector Spaces/Week 1/utf-8''sigmoid_plot.jpg' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> Figure 1 </div>

#### Instructions: Implement the sigmoid function
* You will want this function to work if z is a scalar as well as if it is an array.

In [26]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # calculate the sigmoid of z
    h = (np.exp(z))/(1 + np.exp(z))
    ### END CODE HERE ###
    
    return h

### Logistic regression: regression and a sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights". If you took the Deep Learning Specialization, we referred to the weights with the `w` vector.  In this course, we're using a different variable $\theta$ to refer to the weights.

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

### Part 1.2 Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0. 
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [27]:
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2

9.210340371976294

* Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number: $-1 \times log(0.0001) \approx 9.2$.  The closer the prediction is to zero, the larger the loss.

In [28]:
-1 * np.log(0.0001)

9.210340371976182

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


## Instructions: Implement gradient descent function
* The number of iterations `num_iters` is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are `m` training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>use np.dot for matrix multiplication.</li>
    <li>To ensure that the fraction -1/m is a decimal value, cast either the numerator or denominator (or both), like `float(1)`, or write `1.` for the float version of 1. </li>
</ul>
</p>



In [39]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x,theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = -1 * float(1)/m * ((np.dot(np.transpose(y), np.log(h))) + (np.dot(1 - np.transpose(y), np.log(1 - h))))

        # update the weights theta
        theta = theta - alpha/m * (np.dot(np.transpose(x),(h - y)))
        
    ### END CODE HERE ###
    J = float(J)
    return J, theta

In [40]:
# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)

# print(tmp_X)
# print(tmp_Y)

# print(tmp_X.shape)
# print(tmp_Y.shape)

# Apply gradient descent
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")

The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]


## Part 2: Extracting the features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set. 

### Instructions: Implement the extract_features function. 
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet()` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the `freqs` dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


In [49]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def extract_features(tweet, freqs):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweets(tweet)
    
    print("Processed Tweet : ", word_l)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3))
    print(x)
    
    #bias term is set to 1
    x[0,0] = 1 
    
    print(x)
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        x[0,1] += None
        
        # increment the word count for the negative label 0
        x[0,2] += None
        
    ### END CODE HERE ###
    assert(x.shape == (1, 3))
    return x

In [50]:
# Check your function

# test 1
# test on training data
# print(train_data[0])
# print(freqs)

extract_features(train_data[0], freqs)
# tmp1 = extract_features(train_data[0], freqs)
# print(tmp1)

Processed Tweet :  ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
[[0. 0. 0.]]
[[1. 0. 0.]]
