## SENTIMENT ANALYSIS PROJECT

This project's workflow and objectives are as follows:

Create a "Text==> Vader Pipeline"

Cretae a "Text ==> Remove Stopwords ==> Vader Pipeline"

Create a "Text ==> Remove Stopwords ==>  Bag Of Words ==> Custom Model"

Create a "Text ==> Remove Stopwords ==>  TF-IDF ==> Custom Model"

Project scope

Data Collection

Text Processing

Modelling

Model Evaluation

### Data loading

In [1]:
#import libraries
import pandas as pd

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt_tab')

stop_words = set(stopwords.words("english"))   #breaks text into manageable pieces (tokens)

from nltk.tokenize import word_tokenize
#stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\t_ongep\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\t_ongep\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
#load dataset
train_data = pd.read_csv("e commerce reviews train.csv")
train_data.head(10)

Unnamed: 0,labels,text
0,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2,Excellent Soundtrack: I truly like this soundt...
4,__label__2,"Remember, Pull Your Jaw Off The Floor After He..."
5,__label__2,an absolute masterpiece: I am quite sure any o...
6,__label__1,"Buyer beware: This is a self-published book, a..."
7,__label__2,Glorious story: I loved Whisper of the wicked ...
8,__label__2,A FIVE STAR BOOK: I just finished reading Whis...
9,__label__2,Whispers of the Wicked Saints: This was a easy...


In [3]:
print(train_data.iloc[6]["text"])

Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a "worst book" contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on "To Kill a Mockingbird"--a book I am quite sure Ms. Haddon never heard of. Anyway, unless you are in a mood to send a book to someone as a joke---stay far, far away from this one!


In [4]:
test_data = pd.read_csv("e commerce reviews test.csv")
test_data.head()

Unnamed: 0,labels,text
0,__label__2,Great CD: My lovely Pat has one of the GREAT v...
1,__label__2,One of the best game music soundtracks - for a...
2,__label__1,Batteries died within a year ...: I bought thi...
3,__label__2,"works fine, but Maha Energy is better: Check o..."
4,__label__2,Great for the non-audiophile: Reviewed quite a...


In [5]:
#Change the label names
mapping_values = {
  "__label__1"	:"negative",
  "__label__2" : "positive"

}
train_data["labels"] = train_data["labels"].map(mapping_values)
test_data["labels"] = test_data["labels"].map(mapping_values)
print(train_data.head())
print(test_data.head())

     labels                                               text
0  positive  Stuning even for the non-gamer: This sound tra...
1  positive  The best soundtrack ever to anything.: I'm rea...
2  positive  Amazing!: This soundtrack is my favorite music...
3  positive  Excellent Soundtrack: I truly like this soundt...
4  positive  Remember, Pull Your Jaw Off The Floor After He...
     labels                                               text
0  positive  Great CD: My lovely Pat has one of the GREAT v...
1  positive  One of the best game music soundtracks - for a...
2  negative  Batteries died within a year ...: I bought thi...
3  positive  works fine, but Maha Energy is better: Check o...
4  positive  Great for the non-audiophile: Reviewed quite a...


## Text processing

In [8]:
# Function to remove stopwords
def remove_stopwords(text, stop_words=stop_words):
    """
    Purpose: This function removes stopwords from a given text string to enhance the quality of text data for natural language processing tasks.

    Parameters:
    text: The input text from which stopwords will be removed.
    stop_words: A predefined list of stopwords (default is a list called stop_words).

    Process:
    Tokenization: Splits the input text into individual words using NLTK's word_tokenize function.
    Filtering: Creates a new list of words that excludes any words found in the stop_words list, ignoring case.
    Reconstruction: Joins the filtered words back into a single string, effectively reconstructing the text without stopwords.

    Return Value: The function returns the filtered text, which can improve the subsequent analysis and processing of text data by focusing on meaningful words rather than common, less informative ones.
    """
    words = nltk.word_tokenize(text)
    # Remove stopwords from the text
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Reconstruct the text without stopwords
    filtered_text = " ".join(filtered_words)
    
    return filtered_text

In [9]:
train_data["text"].head(10).apply(remove_stopwords)

0    Stuning even non-gamer : sound track beautiful...
1    best soundtrack ever anything . : 'm reading l...
2    Amazing ! : soundtrack favorite music time , h...
3    Excellent Soundtrack : truly like soundtrack e...
4    Remember , Pull Jaw Floor Hearing : 've played...
5    absolute masterpiece : quite sure actually tak...
6    Buyer beware : self-published book , want know...
7    Glorious story : loved Whisper wicked saints ....
8    FIVE STAR BOOK : finished reading Whisper Wick...
9    Whispers Wicked Saints : easy read book made w...
Name: text, dtype: object

In [10]:
from tqdm.notebook import tqdm   #The tqdm library is used to display a progress bar for iterations in Python code, making it easier to monitor the progress of operations, especially for long-running processes.



tqdm.pandas()
train_data['stop words'] = train_data['text'].progress_apply(remove_stopwords)

  0%|          | 0/3600010 [00:00<?, ?it/s]

In [None]:
#save the processed text data to my local machine
train_data.to_csv(r"C:\Users\t_ongep\Downloads\processed_text.csv", index=False)

In [11]:

tqdm.pandas()
test_data["stop words"] = test_data["text"].progress_apply(remove_stopwords)

  0%|          | 0/400000 [00:00<?, ?it/s]

## Bag of words

In [12]:
from sklearn.feature_extraction.text import CountVectorizer   # creates a "bag-of-words" representation by counting the number of times each word appears in the documents.
from sklearn.feature_extraction.text import TfidfVectorizer        # adds another layer of complexity by not only counting how often a word appears in a document (term frequency) but also taking into account how rare or common the word is across all documents (inverse document frequency).




In [13]:
#create a bag of words(BoW) from stop words column
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(train_data["stop words"])
X_test_bow = vectorizer.transform(test_data["stop words"])


In [14]:
#creating a tfidf matrix from stop words colum

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data["stop words"])
X_test_tfidf = tfidf_vectorizer.transform(test_data["stop words"])

In [None]:
test_data.head()

### Modelling and Evaluation

In [None]:
#steps
# vADER on normal sentence
#VADER on sentences without stopwords
#custom on train_tfidf
#custom on train_bow
#Identify the model with the best score

In [15]:
#vader on normal sentence
#Vader is a pretrained model for analyzing sentiment in sentences
import nltk

# download the VADER lexicon and model
nltk.download("vader_lexicon")

#import the SentimentIntensityAnalyzer class from vader, this class is where the magic happens
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\t_ongep\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [16]:
#sentiment analyzer function to get polarity scores, then we apply the function

def analyze_sentence(sentence, threshold=0):
    """
    Analyze the sentiment of a given sentence and classify it as positive or negative.

    # Parameters:
    - sentence (str): The input sentence whose sentiment is to be analyzed.
    - threshold (float): The cutoff score to determine sentiment classification. 
                         Sentiment is classified as "positive" if the compound score 
                         is greater than this threshold; otherwise, it is classified as "negative".
                         
    # Returns:
    - str: The sentiment classification of the input sentence ("positive" or "negative").
    
    # Example:
    >>> analyze_sentence("I love this product!", threshold=0.1)
    'positive'
    
    >>> analyze_sentence("This is the worst experience I've ever had.", threshold=0.1)
    'negative'
    """
    
    sentiment_scores = analyzer.polarity_scores(sentence)
    compound_score = sentiment_scores['compound']

    if compound_score > threshold:
        sentiment = "positive"
    else:
        sentiment = "negative"

    return sentiment

In [17]:

tqdm.pandas()
vader_on_text = test_data['text'].progress_apply(analyze_sentence)

  0%|          | 0/400000 [00:00<?, ?it/s]

USING THE ACCURACY METRICS AND CLASSIFIACTION REPORT

In [18]:
from sklearn.metrics import accuracy_score, classification_report

In [19]:
accuracy_score(vader_on_text, test_data['labels']) #vader_on_text model scores

0.716675

In [20]:
print(classification_report(test_data['labels'],vader_on_text ))

              precision    recall  f1-score   support

    negative       0.87      0.51      0.64    200000
    positive       0.65      0.92      0.76    200000

    accuracy                           0.72    400000
   macro avg       0.76      0.72      0.70    400000
weighted avg       0.76      0.72      0.70    400000



In [21]:
#VADER ON STOPWORDS
#now lets repeat on stopwords, lets see if by removing context irrelvant words we can improve the scores of vader
vader_on_stopwords = test_data['stop words'].progress_apply(analyze_sentence)

  0%|          | 0/400000 [00:00<?, ?it/s]

In [22]:
#we get the accuracy scores, then the classifcation report
accuracy_score(vader_on_stopwords, test_data['labels'])

0.68083

In [23]:
print(classification_report(test_data['labels'],vader_on_stopwords ))


              precision    recall  f1-score   support

    negative       0.86      0.43      0.57    200000
    positive       0.62      0.93      0.75    200000

    accuracy                           0.68    400000
   macro avg       0.74      0.68      0.66    400000
weighted avg       0.74      0.68      0.66    400000



In [24]:
###TRAINING AND TESTING CUSTOM MODELS: Multinomial NB
##choosing it for its simplicity, speed and compatibility with bag of words and tfidf
from sklearn.naive_bayes import MultinomialNB
#create a classifier
classifier = MultinomialNB()
#fit on bag_of_words
classifier.fit(X_train_bow, train_data['labels'])

##lets make predictions and evaluate the model

y_pred = classifier.predict(X_test_bow)
accuracy = accuracy_score(test_data['labels'], y_pred)

#printing results
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(test_data['labels'], y_pred))

Accuracy: 0.85
              precision    recall  f1-score   support

    negative       0.84      0.86      0.85    200000
    positive       0.85      0.84      0.85    200000

    accuracy                           0.85    400000
   macro avg       0.85      0.85      0.85    400000
weighted avg       0.85      0.85      0.85    400000



In [25]:
#create and train a second classifier on tf-idf
classifier2 = MultinomialNB()
classifier2.fit(X_train_tfidf, train_data["labels"])
y_pred = classifier.predict(X_test_tfidf)
accuracy = accuracy_score(test_data['labels'], y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(test_data['labels'], y_pred))

Accuracy: 0.83
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83    200000
    positive       0.84      0.82      0.83    200000

    accuracy                           0.83    400000
   macro avg       0.83      0.83      0.83    400000
weighted avg       0.83      0.83      0.83    400000



**CONCLUSION**
The best performing model/pipeline is text to stopword remover o bag of words to MultinomialNB()

In [26]:
def inference(text):
  filtered_text = remove_stopwords(text)
  bow = vectorizer.transform([filtered_text])
  sentiment = classifier.predict(bow)
  return sentiment

In [None]:
from flask import Flask, request, jsonify
import joblib
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already present
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# Load your models and vectorizer
vectorizer = joblib.load('vectorizer.pkl')               # Make sure you have this model saved
classifier = joblib.load('classifier.pkl')               # Your trained MultinomialNB model

stop_words = set(stopwords.words("english"))

# Initialize Flask app
app = Flask(__name__)

# Function to remove stopwords
def remove_stopwords(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

@app.route('/inference', methods=['POST'])  # Use POST for better practice
def inference():
    data = request.json
    text = data.get('text')

    if text:
        filtered_text = remove_stopwords(text)
        bow = vectorizer.transform([filtered_text])
        sentiment = classifier.predict(bow)
        return jsonify({"sentiment": sentiment[0]})  # Return sentiment as JSON
    else:
        return jsonify({"error": "No text provided"}), 400  # Bad request if no text is given

if __name__ == '__main__':
    app.run(debug=True)

SAVE MODEL

In [27]:
##create an inference function to recive a text, remove stopwords, convert to bow and pass to MUltinomialNB model

stop_words = set(stopwords.words("english"))

def remove_stopwords(text, stop_words=stop_words):
    words = nltk.word_tokenize(text)
    # Remove stopwords from the text
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Reconstruct the text without stopwords
    filtered_text = " ".join(filtered_words)
    
    return filtered_text

def inference(text):
    filtered_text = remove_stopwords(text)
    bow = vectorizer.transform([filtered_text])
    sentiment = classifier.predict(bow)
    
    return sentiment

In [28]:
example_text = "i hate this book."
inference(example_text)

array(['negative'], dtype='<U8')

In [None]:
import joblib

# Save the model and vectorizer
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(classifier, 'classifier.pkl')

# Load the model and vectorizer
vectorizer = joblib.load('vectorizer.pkl')
classifier = joblib.load('classifier.pkl')
