<a href="https://colab.research.google.com/github/Mohammad-Hijazi29/Sentiment-Analysis-Model/blob/main/ML_Proj_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Project



##Importing the necessary libraries and download the "stopwords" dataset from NLTK


In [None]:
import nltk, re, string
from nltk.corpus import stopwords, twitter_samples
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import euclidean_distances
from collections import Counter
import pandas as pd

nltk.download('stopwords')



##Load Dataset 1
5,000 negative tweets and 5,000 positive tweets

In [None]:
nltk.download('twitter_samples')

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

##Load Dataset 2

10,000 thousand positive tweets and 10,000 negative tweets


In [None]:

positive_df = pd.read_excel('positive_tweets.xlsx', header=None, dtype=str)
negative_df = pd.read_excel('negative_tweets.xlsx', header=None, dtype=str)


all_positive_tweets = positive_df[0].astype(str).tolist()
all_negative_tweets = negative_df[0].astype(str).tolist()

##Pre-process the data
Remove retweets, URLs, hashtags, reduces long words, removes handles, filters out stopwords, and reduces words to root form.

In [None]:
import re

def is_number(word):
    if word.isdigit():
        return True

    try:
        float(word)
        return True
    except ValueError:
        pass

    if re.match(r"^[+-]?\d{1,3}(?:,\d{3})*(?:\.\d+)?$", word):
        return True

    return False

words = ["123", "123.45", "1,000", "-123.45", "3.14e2", "abc"]
for word in words:
    if is_number(word):
        print(f"{word} is a number.")
    else:
        print(f"{word} is not a number.")


123 is a number.
123.45 is a number.
1,000 is a number.
-123.45 is a number.
3.14e2 is a number.
abc is not a number.


In [None]:
def process_tweet(tweet):
    stemmer = nltk.PorterStemmer()
    stopwords_english = stopwords.words('english')
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tokenizer = nltk.TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and
                word not in string.punctuation and is_number(word)==False):
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)

    return tweets_clean

## Count how many times each word was mentioned in positive and negative tweets.

    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
  
  --> You are outputting the reduced frequency list as such:
  { ('happy', 1) : 1, ('sad', 0) : 2 }

In [None]:
def build_freqs(tweets, ys):
    yslist = np.squeeze(ys).tolist()
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    return freqs

## Test the above code

In [None]:
tweets = ['i not happy','i am too tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
res = build_freqs(tweets, ys)
print(res)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}


## Splitting Data for Training and Testing

First 4000 set for training and last 1000 set for testing (Data set 1).

First 8000 set for training and last 2000 set for testing (Data set 2).

In [None]:
train_pos = all_positive_tweets[:8000]
train_neg = all_negative_tweets[:8000]
test_pos = all_positive_tweets[8000:]
test_neg = all_negative_tweets[8000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

##Create Frequency Dictionary

In [None]:
freqs = build_freqs(train_x, train_y)
print("len(freqs) = " + str(len(freqs.keys())))

len(freqs) = 19300


## Test the above data splitting

In [None]:
print('This is an example of a positive tweet: \n', train_x[22])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[22]))

This is an example of a positive tweet: 
 @blitzmegaplex u're just teasing me, right? Well, I'm sold! Is waiting for the 21.30 show @ GI 

This is an example of the processed version of the tweet: 
 ["u'r", 'teas', 'right', 'well', "i'm", 'sold', 'wait', 'show', 'gi']


## Extracting the Features

    Input:
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output:
        x: a feature vector of dimension (1,3)

In [None]:
def extract_features(tweet, freqs):
    word_l = process_tweet(tweet)
    x = np.zeros((1, 3))
    x[0, 0] = 1
    for word in word_l:
        x[0, 1] += freqs.get((word, 1.0), 0) / max(1, freqs.get((word, 1.0), 0) + freqs.get((word, 0.0), 0))
        x[0, 2] += freqs.get((word, 0.0), 0) / max(1, freqs.get((word, 1.0), 0) + freqs.get((word, 0.0), 0))
    return x

## Test the above function

In [None]:
tmp1 = extract_features(train_x[22], freqs)
print(tmp1)

[[1.         5.14750277 3.85249723]]


## Sigmoid Function and Gradient Descent

In [None]:
def sigmoid(z):
    zz = np.negative(z)
    h = 1 / (1 + np.exp(zz))
    return h

def gradientDescent(x, y, theta, alpha, num_iters):
    m = x.shape[0]
    for i in range(0, num_iters):
        z = np.dot(x, theta)
        h = sigmoid(z)
        cost = -1. / m * (np.dot(y.transpose(), np.log(h)) + np.dot((1 - y).transpose(), np.log(1 - h)))
        theta = theta - (alpha / m) * np.dot(x.transpose(), (h - y))

    cost = float(cost)
    return cost, theta

## Training the Data

In [None]:
X_train = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X_train[i, :] = extract_features(train_x[i], freqs)


X_test = np.zeros((len(test_x), 3))
for i in range(len(test_x)):
    X_test[i, :] = extract_features(test_x[i], freqs)

J, theta = gradientDescent(X_train, train_y, np.zeros((3, 1)), 1e-9, 15000)

    Input:
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output:
        y_pred: the probability of a tweet being positive or negative
        # if y_pred > 0.5 => Positive
        # else => Negative

In [None]:
def predict_tweet(tweet, freqs, theta):
    x = extract_features(tweet, freqs)
    y_pred = sigmoid(np.dot(x, theta))
    return y_pred

    Input:
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output:
        accuracy: (# of tweets classified correctly) / (total # of tweets)

In [None]:
def test_logistic_regression(test_x, test_y, freqs, theta):
    y_hat = []
    for tweet in test_x:
        y_pred = predict_tweet(tweet, freqs, theta)
        if y_pred > 0.5:
            y_hat.append(1)
        else:
            y_hat.append(0)

    accuracy = (y_hat == np.squeeze(test_y)).sum() / len(test_x)
    return accuracy

## Test the above functions

      Accuracy = Correct Predictions / Total Predictions

In [None]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.6370


## kNN Test Function

In [None]:
def test_knn(test_x, test_y, knn):
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(test_y, y_pred)
    return accuracy

## Initialize kNN Classifier and Fit

In [None]:
k = 19
knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
knn.fit(X_train, np.squeeze(train_y))

## Test the Above Functions

In [None]:
accuracy_knn = test_knn(test_x, np.squeeze(test_y), knn)
print(f"kNN model's accuracy with k={k} = {accuracy_knn:.4f}")

kNN model's accuracy with k=19 = 0.7153


## Choosing the Right K

In [None]:
results = []

for k in range(1, 51, 2):
    classifier = KNeighborsClassifier(n_neighbors=k)
    classifier.fit(X_train, np.squeeze(train_y))
    y_pred = classifier.predict(X_test)
    accuracy_knn = test_knn(test_x, np.squeeze(test_y), classifier)
    results.append([k, accuracy_knn])
    print ("k=",k," Accuracy=", accuracy_knn)

k= 1  Accuracy= 0.67325
k= 3  Accuracy= 0.6985
k= 5  Accuracy= 0.70425
k= 7  Accuracy= 0.70175
k= 9  Accuracy= 0.7045
k= 11  Accuracy= 0.707
k= 13  Accuracy= 0.71075
k= 15  Accuracy= 0.71175
k= 17  Accuracy= 0.71375
k= 19  Accuracy= 0.71525
k= 21  Accuracy= 0.71525
k= 23  Accuracy= 0.71625
k= 25  Accuracy= 0.716
k= 27  Accuracy= 0.71625
k= 29  Accuracy= 0.71675
k= 31  Accuracy= 0.7175
k= 33  Accuracy= 0.717
k= 35  Accuracy= 0.7175
k= 37  Accuracy= 0.71825
k= 39  Accuracy= 0.71625
k= 41  Accuracy= 0.71925
k= 43  Accuracy= 0.71725
k= 45  Accuracy= 0.71775
k= 47  Accuracy= 0.71725
k= 49  Accuracy= 0.71725


## Now let's predict new data using kNN and Logistic regression

### *New Input*

In [None]:
my_tweet = 'i am happy'

### *kNN*

In [None]:
x = extract_features(my_tweet, freqs)
predicted_sentiment_knn = knn.predict(x)

if predicted_sentiment_knn == 1:
    print('kNN prediction: Positive sentiment')
else:
    print('kNN prediction: Negative sentiment')

kNN prediction: Positive sentiment


### *Logistic Regression*

In [None]:
def pre(sentence):
    yhat = predict_tweet(sentence, freqs, theta)
    if yhat > 0.5:
        return 'Positive sentiment'
    else:
        return 'Negative sentiment'

res = pre(my_tweet)
print(f"Logistic regression prediction: {res}")

Logistic regression prediction: Positive sentiment


## What if we input:
    I am NOT happy!

In [None]:
my_tweet = 'I am NEVER happy'

x = extract_features(my_tweet, freqs)
predicted_sentiment_knn = knn.predict(x)

if predicted_sentiment_knn == 1:
    print('kNN prediction: Positive sentiment')
else:
    print('kNN prediction: Negative sentiment')

def pre(sentence):
    yhat = predict_tweet(sentence, freqs, theta)
    if yhat > 0.5:
        return 'Positive sentiment'
    else:
        return 'Negative sentiment'

res = pre(my_tweet)
print(f"Logistic regression prediction: {res}")

kNN prediction: Positive sentiment
Logistic regression prediction: Positive sentiment


##Process the data (with negation)

Why is the original processing giving an obviously negative sentiment a positive output?

Feature extraction does not differentiate "happy" and "not happy". They are both treated like "happy" which has a high positive sentiment. To fix this we attempt to change the data preprocessing part to handle negations like "not", "no", "never", and "n't".

In [None]:
def process_tweet(tweet):
    from nltk.stem.porter import PorterStemmer
    import string
    import re
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tokenizer = nltk.TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    negation = False
    for word in tweet_tokens:
        if word in ["not", "no", "never", "n't"]:
            negation = True
            continue
        if negation and (word not in stopwords_english and word not in string.punctuation and  is_number(word)==False):
            word = "NEG_" + word
            negation = False
        if word not in stopwords_english and word not in string.punctuation and is_number(word)==False:
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
    return tweets_clean

## Re-run the rest of the code

In [None]:
train_pos = all_positive_tweets[:4000]
train_neg = all_negative_tweets[:4000]
test_pos = all_positive_tweets[4000:]
test_neg = all_negative_tweets[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

freqs = build_freqs(train_x, train_y)

tmp1 = extract_features(train_x[22], freqs)

X_train = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X_train[i, :] = extract_features(train_x[i], freqs)

X_test = np.zeros((len(test_x), 3))
for i in range(len(test_x)):
    X_test[i, :] = extract_features(test_x[i], freqs)

J, theta = gradientDescent(X_train, train_y, np.zeros((3, 1)), 1e-9, 15000)

tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

k = 20
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, np.squeeze(train_y))

accuracy_knn = test_knn(test_x, np.squeeze(test_y), knn)
print(f"kNN model's accuracy with k={k} = {accuracy_knn:.4f}")

my_tweet = 'I am NOT happy'

x = extract_features(my_tweet, freqs)
predicted_sentiment_knn = knn.predict(x)

if predicted_sentiment_knn == 1:
    print('kNN prediction: Positive sentiment')
else:
    print('kNN prediction: Negative sentiment')

def pre(sentence):
    yhat = predict_tweet(sentence, freqs, theta)
    if yhat > 0.5:
        return 'Positive sentiment'
    else:
        return 'Negative sentiment'

res = pre(my_tweet)
print(f"Logistic regression prediction: {res}")

  cost = float(cost)


Logistic regression model's accuracy = 0.6592
kNN model's accuracy with k=20 = 0.7003
kNN prediction: Negative sentiment
Logistic regression prediction: Negative sentiment


## Why did the accuracy change?

Initially, the model was trained on dataset 1 with the first pre-processing function. The two models perform well because it learns strong corelations between individual words like "happy" and "sad". They both score around 92% in accuracy with kNN being slightly better.

However, it was noted that the model with the first pre-processing function disregards stopwords like "not". This means that statements such as "I am not happy" will be considered to have positive sentiment as the model will only consider "happy" => "happi".

Therefore, the model was altered with the second pre-processing function differentiates between "happy" => "happi" and "not happy" => "NEG_happi".

  The model was tested with Dataset1, but was still not able to detect "not happy" as a negative sentiment. After analysis, it was clear  that Dataset 1 is insufficient to train the model to handle negation.

Thus, Dataset2 was used with twice the positive and negative tweets, and after training the model , it was able to correctly identify "not happy" => "NEG_happi" as  a negative sentiment.

## Why did the change in the Dataset lead to a drop in the accuracy?

The accuracy of logistic regression dropped to around 60% and kNN to 70%.

The algorithm for the sentiment analysis used by the model is inherently simple, meaning it does not work well with more complex data as well as varying tweets as is the case with Dataset 2 and handling negation.

Overall, kNN was a slightly better model.