##  Logistic Regression
will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative one.

### Specifically we will:
- Learn how to extract features for logistic regression given some text
- Implement logistic regression from scratch
- Apply logistic regression on a natural language processing task
- Test using your logistic regression
- Perform error analysis

In [2]:
# import Functions and Data
import nltk
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
filePath= f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

In [4]:
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples

import re
import string

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [5]:
def process_tweet(tweet):
  """
    Process tweet function:
    Input:
      tweet: a string contain a tweet
    Output:
      tweets_clean: a list of words containing the processed tweets
  """

  stemmer= PorterStemmer() # root
  stopwords_english= stopwords.words('english')

  #remove stoch market tickers like $GE
  tweet= re.sub(r'$\w*', '', tweet)

  #remove old style retweet text "RT"
  tweet= re.sub(r'^RT[\s]+', '', tweet)

  #remove hyperlinks
  tweet= re.sub(r'https?://[^\s\n\r]+', '', tweet)

  #remove hashtags
  #only removing the hash # sign from the word
  tweet= re.sub(r'#', '', tweet)

  #tokenize tweets
  tokenizer= TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
  tweet_tokens= tokenizer.tokenize(tweet)

  tweets_clean= []
  for word in tweet_tokens:
    if(word not in stopwords_english and #remove stopwords
       word not in string.punctuation): # remove puntuation
       #tweets_clean.append(word)
       stem_word= stemmer.stem(word) #stemming (root)
       tweets_clean.append(stem_word)

  return tweets_clean

In [6]:
def build_freqs(tweets, ys):
  """
    Build frequencies:
    Input:
      tweets: a list of tweets
      ys: an m x 1 array with sentiment label of each tweet (either 0 or 1)
    Output:
      freqs: a dictionary mapping each (word, sentiment) pair to its requency.
  """
  # convert np array to list since zip needs an iterable.
  # the squeeze is necessary or the list ends up with one element.
  # also this is just a NOP it ys is aleady a list.
  yslist= np.squeeze(ys).tolist() # squeeze to remove the unneeded dimentions

  # start with an empty dictionary and populate it by looping over all tweets.
  # and over all processed words in each tweet.
  freqs= {}
  for y, tweet in zip(yslist, tweets):
    for word in process_tweet(tweet):
      pair = (word, y)
      if pair in freqs:
        freqs[pair] +=1
      else:
        freqs[pair] = 1

  return freqs


### Prepare the Data
The twitter_samples contains subsets of five thousand positive_tweets, five thousand negative_tweets

In [7]:
# select the set of positive and negative tweets
all_positive_tweets= twitter_samples.strings('positive_tweets.json')
all_negative_tweets= twitter_samples.strings('negative_tweets.json')

- Train test split: 20% will be in the test set, and 80% in the reaining set.

In [8]:
# split the data into 2 pieces, one for training and one for testing(validation set)
test_pos= all_positive_tweets[4000:]
train_pos= all_positive_tweets[:4000]

test_neg= all_negative_tweets[4000:]
train_neg= all_negative_tweets[:4000]

train_x= train_pos + train_neg
test_x= test_pos + test_neg

- Create the numpy array of positive labels and negative labels.

In [9]:
# combine positive and negative labels
train_y =np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [10]:
# print the shape of train and test sets
print('train_y.shape= ' + str(train_y.shape))
print('test_y.shape= '+ str(test_y.shape))

train_y.shape= (8000, 1)
test_y.shape= (2000, 1)


- The 'freqs' dictionary is the frequency dictionary that's being built.
- The key is the tuple (word, label), such as ("happy",1) or ("happy",0). The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.

In [11]:
# create frequency dictionary
freqs= build_freqs(train_x, train_y)

# check the output
print('type(freqs)= '+str(type(freqs)))
print('len(freqs)= '+ str(len(freqs.keys())))

type(freqs)= <class 'dict'>
len(freqs)= 11396


### Process Tweets

In [12]:
# test the function below
print('this is an example of a positive tweet: \n', train_x[0])
print('\nthis is an example of the processed version of the tweet:\n', process_tweet(train_x[0]))

this is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

this is an example of the processed version of the tweet:
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


### Sigmoid

In [13]:
def sigmoid(z):
  """
  Input:
    z: is the input (can be a scalar or an array)
  Output:
    h: the sigmoid of z
  """

  #calculate the sigmoid of z
  h= 1/(1+np.exp(-z))

  return h

In [14]:
# Testing the functions
if(sigmoid(0)==0.5):
  print('Success!')
else:
  print('Oops!')

if(sigmoid(4.92) == 0.9927537604041685):
  print('Correct!')
else:
  print('Oops again!')

Success!
Correct!


### Cost function and Gradient

Implement gradient descent function.

- The number of iterations 'num_iters" is the number of times that you'll use the entire training set.
- For each iteration, you'll calculate the cost function using all training examples (there are 'm' training examples), and for all features.
- Instead of updating a single weight  𝜃𝑖
  at a time, we can update all the weights in the column vector:

In [15]:
def gradientDescent(x, y, theta, alpha, num_iters):
  """
    Input:
      x: matrix of features which is (m, n+1)
      y: corresponding labels of the input matrix x, dimensions (m, 1)
      theta: weight vector of dimensions (n+1, 1)
      num_iters: number of iterations you want to train your model for

    Output:
      j: the final cost
      theta: your final weight vector
  """
  # get 'm', thenumber of rows in matrix x
  m = x.shape[0]
  for i in range(0, num_iters):

    # get z, the dot product of x and theta
    z= np.dot(x, theta)

    #get the sigmoid of z
    h= 1/ (1+ np.exp(-z))

    # calculate teh cost functions
    J = -1/m * (np.dot(y.T, np.log(h)) + np.dot((1 - y).T, np.log(1 - h)))

    # update the weights theta
    theta= theta - (alpha / m) * (np.dot(x.T, (h-y)))

  J= float(J.item())
  return J, theta

### extracting the Features
- Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
  - The first feature is the number of positive words in a tweet.
  - The second feature is the number of negative words in a tweet.
- Then train your logistic regression classifier on these features.
- Test the classifier on a validation set.

In [16]:
def extract_features(tweet, freqs, process_tweet=process_tweet):
  """
    Input:
      tweet: a string containing one tweet
      freqs: a dictionary corresponding to the frequencies of each tuple(word, label)
    Output:
      x: a feature vector of dimension (1, 3)
  """
  # process_tweet tokenizers, stems, and removes stopwords
  word_list= process_tweet(tweet)

  # 3 elements for [bias, positive, negative] counts
  x= np.zeros(3)

  # bias term is set to 1
  x[0] = 1

  # loop through each word in the list of words
  for word in word_list:

    # increase the word count for the positive label 1
    x[1]+= freqs.get((word, 1), 0)

    # increase the word count for the negative label 0
    x[2]+= freqs.get((word, 0), 0)

  x= x[None, :]
  assert(x.shape == (1, 3))
  return x




In [17]:
# check the function
# test on training data

temp1= extract_features(train_x[0], freqs)
temp1

array([[1.000e+00, 3.133e+03, 6.100e+01]])

In [18]:
# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)

[[1. 0. 0.]]


### Training Model
To train the model:

- Stack the features for all training examples into a matrix X.
- Call gradientDescent, which we've implemented above.

In [19]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))

for i in range(len(train_x)):
  X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# apply gradient descent
J, theta= gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}");


The cost after training is 0.22524456.
The resulting vector of weights is [np.float64(6e-08), np.float64(0.00053786), np.float64(-0.00055885)]


### Predict Tweet
Implement predict_tweet. Predict whether a tweet is positive or negative.

- Given a tweet, process it, then extract the features.
- Apply the model's learned weights on the features to get the logits.
- Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).


In [20]:
def predict_tweet(tweet, freqs, theta):
  """
    Input:
      tweet: a string
      freqs: a dictionary corresponding to the frequencies of each tuple(word, label)
      theta: (3, 1) vector of weights
    Output:
      y_pred: the probability of a tweet being positive or negative
  """
  # extract the features of the tweet and store it in x
  x= extract_features(tweet, freqs)

  # make the prediction using x and theta
  y_pred= sigmoid(np.dot(x, theta)) # shape: (1, 1)

  return y_pred


In [21]:
# test the function
for tweet in ["I am happy", "I am bad", "this movie should have been great.", "great", "great great",'great great great', "great great great grea"]:
  print("%s -> %f" %(tweet, predict_tweet(tweet, freqs, theta).item()))

I am happy -> 0.519259
I am bad -> 0.494338
this movie should have been great. -> 0.515962
great -> 0.516052
great great -> 0.532070
great great great -> 0.548023
great great great grea -> 0.548023


### Check the Performance using Test Set

Implement test_logistic_regression.

- Given the test data and the weights of my trained model, calculate the accuracy of my logistic regression model.
- Use your 'predict_tweet' function to make predictions on each tweet in the test set.
- If the prediction is > 0.5, set the model's classification 'y_hat' to 1, otherwise set the model's classification 'y_hat' to 0.
- A prediction is accurate when the y_hat equals the test_y. Sum up all the instances when they are equal and divide by m.

In [22]:
def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
  """
    Input:
      test_x: a list of tweets
      test_y: (m, 1) vector with the corresponding labels for the list of tweets
      freqs: a dictionary with the frequency of each pair (or tuple)
      theta: weight vector of dimension(3, 1)
    Output:
      accuracy: (# of tweets classified correctly) / (total # of tweets)
  """

  #strating the list for storing predictions
  y_hat=[]

  for tweet in test_x:
    #get the label prediction for the tweet
    y_pred= predict_tweet(tweet, freqs, theta)

    if y_pred > 0.5:
      #append 1.0 to the list
      y_hat.append(1.0)
    else:
      #append 0 to the list
      y_hat.append(0.0)

  # with the above implementation, y_hat is a list, but test_y is (m, 1) array
  y_hat= np.array(y_hat)
  test_y= np.squeeze(test_y) # or test_y.reshape(-1)

  #convert both to one_dimensional arrays in order to compare them using the '=' operator
  accuracy= np.sum(y_hat== test_y) / len(test_x)

  return accuracy


In [23]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9965


### Error Analysis
In this part we will see some tweets that our model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does our model misclassify?



In [24]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

Label Predicted Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.48899230	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'


  print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))


THE TWEET IS: off to the park to get some sunlight : )
THE PROCESSED TWEET IS: ['park', 'get', 'sunlight']
1	0.49632433	b'park get sunlight'
THE TWEET IS: @msarosh Uff Itna Miss karhy thy ap :p
THE PROCESSED TWEET IS: ['uff', 'itna', 'miss', 'karhi', 'thi', 'ap', ':p']
1	0.48246197	b'uff itna miss karhi thi ap :p'
THE TWEET IS: @phenomyoutube u probs had more fun with david than me : (
THE PROCESSED TWEET IS: ['u', 'prob', 'fun', 'david']
0	0.50983764	b'u prob fun david'
THE TWEET IS: pats jay : (
THE PROCESSED TWEET IS: ['pat', 'jay']
0	0.50040341	b'pat jay'
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth']
0	0.50000001	b'belov grandmoth'
THE TWEET IS: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
THE PROCESSED TWEET IS: ['sr', 'financi', 'analyst', 'expedia', 'inc', 'bellevu', 'wa', 'financ', 'expediajob', 'job', 'job', 'hire']
0	0.50647821	b'sr finan

### Predict with our own Tweet


In [28]:
my_tweet= "This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!"
print(process_tweet(my_tweet))
y_hat= predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
  print('Positive sentiment')
else:
  print('Negative sentiment')

['ridicul', 'bright', 'movi', 'plot', 'terribl', 'sad', 'end']
[[0.48122783]]
Negative sentiment
