# Week 1 Assignment - Logistic regression

In [8]:
import nltk
import numpy as np
from nltk.corpus import twitter_samples
from utils import process_tweet, build_freqs
from typing import Tuple, Callable

nltk.download("twitter_samples")
nltk.download("stopwords")

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/arpanmajumdar/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arpanmajumdar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
all_positive_tweets = twitter_samples.strings("positive_tweets.json")
all_negative_tweets = twitter_samples.strings("negative_tweets.json")

In [10]:
test_pos = all_positive_tweets[4000:]
test_neg = all_negative_tweets[4000:]
train_pos = all_positive_tweets[:4000]
train_neg = all_negative_tweets[:4000]

X_train = train_pos + train_neg
X_test = test_pos + test_neg
y_train = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
y_test = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [11]:
print("Train size:", len(X_train))
print("Test size:", len(X_test))
print("Train labels shape:", y_train.shape)
print("Test labels shape:", y_test.shape)

Train size: 8000
Test size: 2000
Train labels shape: (8000, 1)
Test labels shape: (2000, 1)


## Prepare the data

In [12]:
freqs = build_freqs(X_train, y_train)

In [13]:
# test the function below
print("This is an example of a positive tweet: \n", X_train[0])
print(
    "\nThis is an example of the processed version of the tweet: \n",
    process_tweet(X_train[0]),
)

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


<a name='1'></a>
## 1 - Logistic Regression 

<a name='1-1'></a>
### 1.1 - Sigmoid
You will learn to use logistic regression for text classification. 
* The sigmoid function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

In [14]:
def sigmoid(z):
    h = 1 / (1 + np.exp(-z))
    return h

<a name='1-2'></a>
### 1.2 - Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label 'y' is also 1, the loss for that training example is 0. 
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [15]:
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999)  # loss is about 9.2

np.float64(9.210340371976294)

In [16]:
# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001)  # loss is about 9.2

np.float64(9.210340371976182)

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x^{(i)}_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


<a name='ex-2'></a>
### Exercise 2 - gradientDescent
Implement gradient descent function.
* The number of iterations 'num_iters" is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are 'm' training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

In [17]:
def gradient_descent(
    x: np.ndarray, y: np.ndarray, theta: np.ndarray, alpha: float, num_iters: int
) -> Tuple[float, np.ndarray]:
    """
    Input:
        x: feature vector of shape (m, n+1)
        y: labels of shape (m, 1)
        theta: model weights of shape (n+1, 1)
        alpha: Learning rate
        num_iters: No of iterations on the training data to train the model
    Output:
        J: cost
        theta: updated weights
    """
    m = x.shape[0]

    for _ in range(num_iters):
        h = sigmoid(np.dot(x, theta))  # shape: (m, 1)
        J = (-1 / m) * (np.dot(y.T, np.log(h)) + np.dot(1 - y.T, np.log(1 - h)))
        theta = theta - (alpha / m) * np.dot(x.T, (h - y))

    J = float(np.squeeze(J))
    return J, theta

In [18]:
# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)

# Apply gradient descent
tmp_J, tmp_theta = gradient_descent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(
    f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}"
)

The cost after training is 0.67094970.
The resulting vector of weights is [np.float64(4.1e-07), np.float64(0.00035658), np.float64(7.309e-05)]


<a name='2'></a>
## 2 - Extracting the Features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set. 

<a name='ex-3'></a>
### Exercise 3 - extract_features
Implement the extract_features function. 
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the 'freqs' dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)

**Note:** In the implementation instructions provided above, the prediction of being positive or negative depends on feature vector which counts-in duplicate words - this is different from what you have seen in the lecture videos

In [19]:
def extract_features(
    tweet: str, freqs: dict, process_tweet: Callable[[str], list[str]]
) -> np.ndarray:
    word_list = process_tweet(tweet)
    x = np.zeros(3)
    x[0] = 1

    for word in word_list:
        x[1] += freqs.get((word, 1), 0)
        x[2] += freqs.get((word, 0), 0)
    x = x[None, :]
    assert x.shape == (1, 3)
    return x

In [20]:
# Check your function
# test 1
# test on training data
tmp1 = extract_features(X_train[0], freqs, process_tweet=process_tweet)
print(tmp1)

[[1.000e+00 3.133e+03 6.100e+01]]


In [21]:
# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features("blorb bleeeeb bloooob", freqs, process_tweet=process_tweet)
print(tmp2)

[[1. 0. 0.]]


<a name='3'></a>
## 3 - Training Your Model

To train the model:
* Stack the features for all training examples into a matrix X. 
* Call `gradientDescent`, which you've implemented above.

This section is given to you.  Please read it for understanding and run the cell.

In [36]:
m = len(X_train)
n = 2
X = np.zeros((m, n + 1))

for i in range(m):
    X[i, :] = extract_features(X_train[i], freqs, process_tweet=process_tweet)

y = y_train
theta = np.zeros((n + 1, 1))
alpha = 1e-9
num_iters = 1500
J, theta = gradient_descent(X, y, theta, alpha, num_iters)
print(f"Cost after training: {J:8f}")
print(f"Resulting weight vector: {theta}")

Cost after training: 0.225213
Resulting weight vector: [[ 6.03424369e-08]
 [ 5.38195957e-04]
 [-5.58303150e-04]]


<a name='4'></a>
## 4 -  Test your Logistic Regression

It is time for you to test your logistic regression function on some new input that your model has not seen before. 
<a name='ex-4'></a>
### Exercise 4 - predict_tweet
Implement `predict_tweet`.
Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the logits.
* Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).

$$y_{pred} = sigmoid(\mathbf{x} \cdot \theta)$$

In [50]:
def predict_tweet(tweet: str, freqs: dict, theta: np.ndarray) -> float:
    X = extract_features(tweet, freqs, process_tweet=process_tweet)
    y_pred = sigmoid(np.dot(X, theta))
    y_pred = float(np.squeeze(y_pred))
    return y_pred

In [33]:
# Run this cell to test your function
for tweet in [
    "I am happy",
    "I am bad",
    "this movie should have been great.",
    "great",
    "great great",
    "great great great",
    "great great great great",
]:
    print("%s -> %f" % (tweet, predict_tweet(tweet, freqs, theta)))

I am happy -> 0.519275
I am bad -> 0.494347
this movie should have been great. -> 0.515980
great -> 0.516065
great great -> 0.532097
great great great -> 0.548063
great great great great -> 0.563930


<a name='4-1'></a>
### 4.1 -  Check the Performance using the Test Set
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.

<a name='ex-5'></a>
### Exercise 5 - test_logistic_regression
Implement `test_logistic_regression`. 
* Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model. 
* Use your 'predict_tweet' function to make predictions on each tweet in the test set.
* If the prediction is > 0.5, set the model's classification 'y_hat' to 1, otherwise set the model's classification 'y_hat' to 0.
* A prediction is accurate when the y_hat equals the test_y.  Sum up all the instances when they are equal and divide by m.

In [51]:
def test_logistic_regression(
    X_test: np.ndarray, y_test: np.ndarray, freqs: dict, theta: float
) -> float:
    """
    Input:
        X_test: Test feature vector of shape (m, n+1)
        y_test: Label vector of shape (m, 1)
        freqs: Dict of (word, polarity) -> freq of occurrance
        theta: Logistic regression weights
        predict_tweet: Function to transform tweet to list of tokens/words
    Output:
        accuracy: num of tweets classified successfully
    """

    # the list for storing predictions
    y_hat = []

    for tweet in X_test:
        y_pred = predict_tweet(tweet, freqs, theta)

        if y_pred > 0.5:
            y_hat.append(1.0)
        else:
            y_hat.append(0.0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    y_hat = np.array(y_hat)
    y_test = y_test.flatten()
    accuracy = np.sum(y_hat == y_test) / y_test.shape[0]

    return accuracy

In [44]:
theta

array([[ 6.03424369e-08],
       [ 5.38195957e-04],
       [-5.58303150e-04]])

In [45]:
tmp_accuracy = test_logistic_regression(X_test, y_test, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9950


<a name='5'></a>
## 5 -  Error Analysis

In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?

In [47]:
# Some error analysis done for you
print("Label Predicted Tweet")
for x, y in zip(X_test, y_test):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print("THE TWEET IS:", x)
        print("THE PROCESSED TWEET IS:", process_tweet(x))
        print(
            "%d\t%0.8f\t%s"
            % (y, y_hat, " ".join(process_tweet(x)).encode("ascii", "ignore"))
        )

Label Predicted Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.48942981	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots
http://t.co/UGQzOx0huu
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418981	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/aOKldo3GMj http://t.co/xWCM9qyRG5
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418981	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/R2JBO8iNww http://t.co/ow5BBwdEMY
THE PROCESSED TWEET IS: ["i'm", 'play', 

  print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))


THE TWEET IS: @phenomyoutube u probs had more fun with david than me : (
THE PROCESSED TWEET IS: ['u', 'prob', 'fun', 'david']
0	0.50988296	b'u prob fun david'
THE TWEET IS: pats jay : (
THE PROCESSED TWEET IS: ['pat', 'jay']
0	0.50040366	b'pat jay'
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth']
0	0.50000002	b'belov grandmoth'
THE TWEET IS: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
THE PROCESSED TWEET IS: ['sr', 'financi', 'analyst', 'expedia', 'inc', 'bellevu', 'wa', 'financ', 'expediajob', 'job', 'job', 'hire']
0	0.50648699	b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'
