# Overview

In this notebook, we will be implementing logistic regression from scratch for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative one. We will use ntltk tweeter dataset for the sentiment analysis task.


### Load packages

In [69]:
import re
import string
import numpy as np
import pandas as pd
import nltk
from sklearn import metrics
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

## Logistic regression

Logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. 
#### The logistic function 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

<div style="width:image width px; font-size:100%; text-align:center;"><img src='../tmp2/sigmoid_plot.jpg' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> Figure 1 </div>

Let's take the example of the following regular linear regression :

$$z = w_0 x_0 + w_1 x_1 + w_2 x_2 + ... w_N x_N$$

Note that the `w` values are "weights". 

In logistic regression, we apply a sigmoid to the output of the obove linear regression equation.

$$ h(z) = \frac{1}{1+\exp^{-z}}$$

$$z = w_0 x_0 + w_1 x_1 + w_2 x_2 + ... w_N x_N$$

#### Cost function
The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

We can rewrite it into matrix multiplication :

$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

#### Update weights

To update the weight vector $w$, we will apply **gradient descent** to iteratively improve the model's predictions.  

The gradient of the cost function $J$ with respect to one of the weights $w_j$ is:

$$\nabla_{w_j}J(w) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $w_j$, so $x_j$ is the feature associated with weight $w_j$

* To update the weight $w_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:

$$w_j = w_j - \alpha \times \nabla_{w_j}J(w) $$

* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.

Matrix multiplication :

$$\mathbf{w} = \mathbf{w} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

#### Sigmoid

In [2]:
def compute_sigmoid(z):
    h = 1/(1 + np.exp(-z))
    return h

In [3]:
if (compute_sigmoid(4.92) == 0.9927537604041685):
    print('CORRECT!')

CORRECT!


#### Gradient descent

In [6]:
def compute_gradient_descent(x, y, weight, alpha, n_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        n_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        w: your final weight vector
    '''
    m = x.shape[0]
    for i in range(n_iters):
        z = np.dot(x,weight)
        h = compute_sigmoid(z)
        J = -(np.dot(y.T , np.log(h)) + np.dot((1 - y).T, np.log(1-h)))/m
        
         # update the weights theta
        weight = weight - (1/m) * alpha * np.dot(np.transpose(x), (h - y)) 
                                 
    J = float(J)
    
    return J, weight

## Sentiment analysis

In [13]:
#import dataset
nltk.download('twitter_samples')

#import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/diouladoucoure/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/diouladoucoure/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords

#### Data preprocessing
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets.

In [15]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

This is an example of positive tweet

In [21]:
all_positive_tweets[2]

'@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!'

* **Train - test split**

In [22]:
# split the data into two pieces, one for training and one for testing (validation set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

In [23]:
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [24]:
# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


* **Cleaning**

Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:
* Tokenizing the string
* Lowercasing
* Removing stop words and punctuation
* Stemming

In [25]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [29]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


###### Word frequency

We are going to build a frequency dictionary. The key is the tuple (word, label), such as ("happy",1) or ("happy",0). The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.

In [40]:
def get_freqs(tweets, labels):
    """
    tweets : a list of tweets
    labels : is a list of labels associated to each tweet 
    """
    freq = {}
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            if (word, label) in freq:
                freq[(word, label)] += 1
            else :
                freq[(word, label)] = 1
    return freq

In [42]:
# create frequency dictionary
freqs = get_freqs(train_x, np.squeeze(train_y).tolist())

In [43]:
freqs

{('followfriday', 1.0): 23,
 ('top', 1.0): 30,
 ('engag', 1.0): 7,
 ('member', 1.0): 14,
 ('commun', 1.0): 27,
 ('week', 1.0): 72,
 (':)', 1.0): 2847,
 ('hey', 1.0): 60,
 ('jame', 1.0): 7,
 ('odd', 1.0): 2,
 (':/', 1.0): 5,
 ('pleas', 1.0): 80,
 ('call', 1.0): 27,
 ('contact', 1.0): 4,
 ('centr', 1.0): 1,
 ('02392441234', 1.0): 1,
 ('abl', 1.0): 6,
 ('assist', 1.0): 1,
 ('mani', 1.0): 28,
 ('thank', 1.0): 504,
 ('listen', 1.0): 14,
 ('last', 1.0): 39,
 ('night', 1.0): 55,
 ('bleed', 1.0): 2,
 ('amaz', 1.0): 41,
 ('track', 1.0): 5,
 ('scotland', 1.0): 2,
 ('congrat', 1.0): 15,
 ('yeaaah', 1.0): 1,
 ('yipppi', 1.0): 1,
 ('accnt', 1.0): 2,
 ('verifi', 1.0): 2,
 ('rqst', 1.0): 1,
 ('succeed', 1.0): 1,
 ('got', 1.0): 57,
 ('blue', 1.0): 8,
 ('tick', 1.0): 1,
 ('mark', 1.0): 1,
 ('fb', 1.0): 4,
 ('profil', 1.0): 2,
 ('15', 1.0): 4,
 ('day', 1.0): 187,
 ('one', 1.0): 90,
 ('irresist', 1.0): 2,
 ('flipkartfashionfriday', 1.0): 16,
 ('like', 1.0): 187,
 ('keep', 1.0): 55,
 ('love', 1.0): 336,
 

##### Extracting the features

* Given a list of tweets, we will extract the features and store them in a matrix. we will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 

In [44]:
def extract_features(tweet, freqs):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    word_list = process_tweet(tweet)
    x = np.zeros((1, 3)) 
    
    #bias term set to 1
    x[0,0] = 1 
    
    for word in word_list:
        # increment the word count for the positive label 1
        x[0,1] += freqs.get((word,1),0)
        # increment the word count for the negative label 0
        x[0,2] += freqs.get((word,0),0)
    assert(x.shape == (1, 3))
    return x

##### Training

In [48]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = compute_gradient_descent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

The cost after training is 0.24216529.
The resulting vector of weights is [7e-08, 0.0005239, -0.00055517]


#### Make predictions

In [50]:
def predict(tweet, freqs, weight):
    x = extract_features(tweet, freqs)
    z = np.dot(x, weight)
    pred = compute_sigmoid(z)
    return pred

Let's test the model on the following example

In [54]:
for tweet in ['such a beautiful movie', "she's pretty", 'I hate pizza.', 'great', 'Awful', 'oh my god, that is horrible']:
    print( '%s -> %f' % (tweet, predict(tweet, freqs, theta)))

such a beautiful movie -> 0.504373
she's pretty -> 0.500561
I hate pizza. -> 0.494093
great -> 0.515464
Awful -> 0.496098
oh my god, that is horrible -> 0.493987


##### Performance metrics

In [73]:
def get_labels(test_x, freqs, weight):
    """
    This function returns the predicted labels (0,1)
    """
    predicted_labels = []
    score = []
    for tweet in test_x:
        pred = predict(tweet, freqs, weight)
        score.append(pred)
        
        if pred > 0.5 :
            predicted_labels.append(1)
        else :
            predicted_labels.append(0)
            
    return predicted_labels, pred

In [59]:
predicted_labels = get_labels(test_x, freqs, theta)[0]

In [66]:
true_labels = list(np.squeeze(test_y))

In [70]:
def metrics_classification_report(observed, predicted, column_name):
    df_metrics_report = pd.DataFrame(columns = [column_name], index = ['Accuracy', 'F1', 'Precision','Recall'])
    df_metrics_report[column_name]['Accuracy'] = metrics.accuracy_score(observed, predicted)
    df_metrics_report[column_name]['F1'] = metrics.f1_score(observed, predicted)
    df_metrics_report[column_name]['Precision'] = metrics.precision_score(observed, predicted)
    df_metrics_report[column_name]['Recall'] = metrics.recall_score(observed, predicted)
    return df_metrics_report

In [71]:
df_performance = metrics_classification_report(true_labels, predicted_labels, 'Metrics')

In [72]:
df_performance

Unnamed: 0,Metrics
Accuracy,0.995
F1,0.99499
Precision,0.996988
Recall,0.993


Let's predict our own tweet

In [80]:
# Feel free to change the tweet below
my_tweet = 'I really enjoyed the movie. It was amazing'
print(process_tweet(my_tweet))
y_hat = predict(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['realli', 'enjoy', 'movi', 'amaz']
[[0.50441411]]
Positive sentiment
