<a href="https://colab.research.google.com/github/7wikd/Logistic_Regression--Sentiment_Analysis/blob/master/TweetSentiment_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Logistic Regression for Sentiment Analysis

Using the dataset of tweets

Twitter HOWTO: https://www.nltk.org/howto/twitter.html

### Importing NLTK Library


* twitter_samples: if you're running this notebook on your local computer, you will need to download it using:
```Python
nltk.download('twitter_samples')
```

* stopwords: if you're running this notebook on your local computer, you will need to download it using:
```python
nltk.download('stopwords')
```

In [1]:
import nltk
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples

### Creating functions for Processing and Frequency Calculation

#### Process (`process()`)
* Clean the text
* Tokenize into separate words
* Remove Stopwords
* Convert words to stems

#### Frequency (`frequency()`)
* Counts the occurences of a word in the 'corpus'
* Assigns label '1' for positive tweet
* Assigns label '0' for negative tweet

In [3]:
#<----------- Libraries ----------->

from nltk.tokenize import TweetTokenizer
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

#<----------- Processing ----------->

def process(tweet):
  stemmer = PorterStemmer()
  stopwords_en = stopwords.words('english')
  tweet = re.sub(r'\$\w*','',tweet)
  tweet = re.sub(r'^RT[\s]+','',tweet)
  tweet = re.sub(r'https?:\/\/.*[\r\n]*','',tweet)
  tweet = re.sub(r'#','',tweet)

  tokenizer = TweetTokenizer(preserve_case=False,strip_handles=True,reduce_len=True)
  tokenize = tokenizer.tokenize(tweet)

  tweet_clean = []
  for word in tokenize:
    if(word not in stopwords_en and word not in string.punctuation):
      stem_word = stemmer.stem(word)
      tweet_clean.append(stem_word)
  return tweet_clean

#<----------- Frequency----------->

def frequency(tweets,labels):
  labels = np.squeeze(labels).tolist()

  freqs = {}

  for label, tweet in zip(labels,tweets):
    for word in process(tweet):
      pair = (word,label)
      if pair in freqs:
        freqs[pair] += 1
      else:
        freqs[pair] = 1
  return freqs

### Positive and Negative Tweets separated

In [4]:
positive_all = twitter_samples.strings('positive_tweets.json')
negative_all = twitter_samples.strings('negative_tweets.json')

In [5]:
#<----------- Splitting data ----------->
train_split = 0.8

#80% training, 20% testing

Pos_train = positive_all[:int(len(positive_all)*train_split)]
Neg_train = negative_all[:int(len(negative_all)*train_split)]
Pos_test = positive_all[int(len(positive_all)*train_split):]
Neg_test = negative_all[int(len(positive_all)*train_split):]

train_X = Pos_train + Neg_train
test_X = Pos_test + Neg_test

train_Y = np.append(np.ones((len(Pos_train),1)),np.zeros((len(Neg_train),1)),axis = 0)
test_Y = np.append(np.ones((len(Pos_test),1)),np.zeros((len(Neg_test),1)),axis = 0)


In [6]:
print("train_Y.shape = "+str(train_Y.shape))
print("test_Y.shape = "+str(test_Y.shape))

train_Y.shape = (8000, 1)
test_Y.shape = (2000, 1)


In [7]:
freqs = frequency(train_X,train_Y)

print('This is an example of a positive tweet: \n', train_X[0])
print('\nThis is an example of the processed version of the tweet: \n', process(train_X[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


### Sigmoid Function

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights".

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$

where, $$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
and 'z' are known as logits


In [8]:
def sigmoid(z):
  h = 1.0/(1.0+np.exp(-z))
  return h


#testing sigmoid function:

if(sigmoid(0)== 0.5):
  print("Hello! Correct")
print(sigmoid(1))

Hello! Correct
0.7310585786300049


### Gradient Descent

#### Cost Function

Cost function in Logistic Regression: The average of log losses across all training examples.
$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))$$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0. 
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. The closer the model prediction gets to 1, the larger the loss.

#### Update the weights

To update your weight vector $\theta$, gradient descent is iteratively applied improve model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.



In [9]:
def gradient_descent(x,y,theta,alpha, num_iters):
  m = x.shape[0]

  for i in range(0,num_iters):
    z = np.dot(x,theta)

    h = sigmoid(z)

    J = ((-1/m)*(np.dot(y.T, np.log(h))+np.dot((1-y).T,np.log(1-h))))
  
    theta = theta - (alpha/m)*np.dot((x.T),(h-y))

  J = float(J)
  return J, theta

### Extracting Features

In [10]:
def features_extract(tweet,freqs):
  word_l = process(tweet)

  x = np.zeros((1,3))
  x[0,0] = 1

  for word in word_l:
    x[0,1] += freqs.get((word,1.0),0)
    x[0,2] += freqs.get((word,0.0),0)

  assert(x.shape == (1,3))
  return x

### TRAINING THE DATA

In [11]:
X = np.zeros((len(train_X),3))

for i in range(len(train_X)):
  X[i,:] = features_extract(train_X[i],freqs)

Y = train_Y

J, theta = gradient_descent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost post-training: {J:.8f}.")
print(f"Vector of weights: {[round(t,8) for t in np.squeeze(theta)]}")

The cost post-training: 0.24216529.
Vector of weights: [7e-08, 0.0005239, -0.00055517]


### TESTING THE DATA

In [12]:
def predict(tweet,freqs,theta):
  x = features_extract(tweet,freqs)

  y_pred = sigmoid(np.dot(x,theta))

  return y_pred

In [13]:
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict(tweet, freqs, theta)))

I am happy -> 0.518580
I am bad -> 0.494339
this movie should have been great. -> 0.515331
great -> 0.515464
great great -> 0.530898
great great great -> 0.546273
great great great great -> 0.561561


### Checking the performance of the model

In [14]:
def Accuracy(test_X,test_Y,freqs,theta):
  y_hat = []

  for tweet in test_X:
    y_pred = predict(tweet,freqs,theta)
    
    if y_pred > 0.5:
      y_hat.append(1)
    elif y_pred < 0.5:
      y_hat.append(0)

  accuracy = (np.asarray(y_hat) == np.squeeze(test_Y)).sum()/len(test_X)

  return accuracy

In [15]:
accuracy = Accuracy(test_X,test_Y,freqs,theta)
print(f"Model Accuracy = {accuracy:.4f}")

Model Accuracy = 0.9950


Accuracy = 99.50%

### Test against custom data

In [19]:
# Feel free to change the tweet below
#[A random sentence here]
my_tweet = 'Hello World! Nice to meet you!'
print(process(my_tweet))
y_hat = predict(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['hello', 'world', 'nice', 'meet']
[[0.51339539]]
Positive sentiment
