This project's workflow and objectives are as follows:

Create a "Text==> Vader Pipeline"

Cretae a "Text ==> Remove Stopwords ==> Vader Pipeline"

Create a "Text ==> Remove Stopwords ==>  Bag Of Words ==> Custom Model"

Create a "Text ==> Remove Stopwords ==>  TF-IDF ==> Custom Model"

1. Text ==> Vader Pipeline
This pipeline uses VADER (Valence Aware Dictionary and sEntiment Reasoner), a sentiment analysis tool designed to capture both the polarity and intensity of sentiments.

Text: The raw textual data (e.g., customer feedback) is the input.
VADER: The text is analyzed using VADER, which assigns a sentiment score indicating positive, neutral, or negative sentiment.

2. Text ==> Remove Stopwords ==> Vader Pipeline
This pipeline includes a preprocessing step to remove stopwords (common words like 'and', 'the', etc.) before applying the VADER sentiment analysis.

Text: The raw textual data.
Remove Stopwords: Common words that do not contribute significantly to the sentiment are removed.
VADER: The cleaned text is analyzed using VADER.

3. Text ==> Remove Stopwords ==> Bag Of Words ==> Custom Model
This pipeline extends preprocessing by including the Bag of Words (BoW) representation, which converts text into numerical features before applying a custom model for tasks like classification.

Text: The raw textual data.
Remove Stopwords: Stopwords are removed.
Bag of Words: The text is converted into a BoW vector representation, where each word is represented by a unique integer index, and its frequency is recorded.
Custom Model: A machine learning model (e.g., logistic regression, SVM) is trained on these features for tasks like sentiment classification.

4. Text ==> Remove Stopwords ==> TF-IDF ==> Custom Model
Similar to the previous pipeline, but uses Term Frequency-Inverse Document Frequency (TF-IDF) instead of Bag of Words, which gives more weight to unique words in a document and reduces the impact of common words.

Text: The raw textual data.
Remove Stopwords: Stopwords are removed.
TF-IDF: Converts the cleaned text into TF-IDF vectors, emphasizing unique words while de-emphasizing common ones.
Custom Model: A machine learning model is trained on these TF-IDF features.

In order to accomplish this we neeed to perfoem:

Data Collection

Text Processing

Modelling

Model Evaluation

**DATA LOADING**

In [None]:
import pandas as pd
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
#nltk requires this package to be installed alongside stopwords
nltk.download('punkt')

stop_words = stopwords.words("english")

from tqdm import tqdm

from sklearn.feature_extraction.text import CountVectorizer # for creating bags of words
from sklearn.feature_extraction.text import TfidfVectorizer # creating tf-id

# vader on normal sentences
import nltk
# download the VADER lexicon and model
nltk.download('vader_lexicon')

#next we import the Sentiment Intensity Analyzer class from vader, this class is where the magic happens
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

from sklearn.metrics import accuracy_score, classification_report

In [368]:
#read in the text data: train and test set

In [370]:
train_dataset = pd.read_csv("e commerce reviews train.csv")

In [371]:
test_dataset = pd.read_csv("e commerce reviews test.csv")

In [372]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600010 entries, 0 to 3600009
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   labels  object
 1   text    object
dtypes: object(2)
memory usage: 54.9+ MB


In [373]:
test_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   labels  400000 non-null  object
 1   text    400000 non-null  object
dtypes: object(2)
memory usage: 6.1+ MB


In [374]:
train_dataset.head(10)

Unnamed: 0,labels,text
0,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2,Excellent Soundtrack: I truly like this soundt...
4,__label__2,"Remember, Pull Your Jaw Off The Floor After He..."
5,__label__2,an absolute masterpiece: I am quite sure any o...
6,__label__1,"Buyer beware: This is a self-published book, a..."
7,__label__2,Glorious story: I loved Whisper of the wicked ...
8,__label__2,A FIVE STAR BOOK: I just finished reading Whis...
9,__label__2,Whispers of the Wicked Saints: This was a easy...


In [375]:
# Getting the row at index 6
print(train_dataset.iloc[6]['text'])

Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a "worst book" contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on "To Kill a Mockingbird"--a book I am quite sure Ms. Haddon never heard of. Anyway, unless you are in a mood to send a book to someone as a joke---stay far, far away from this one!


**TEXT PROCESSING**

In [377]:
#FIRST LETS CHANGE THE LABELS

In [378]:
###label 1: 1 and 2 stars ratings ==> negative
###label 2: 4 and 5 stars rating ==> positive

In [379]:
train_dataset['labels'].unique()

array(['__label__2', '__label__1'], dtype=object)

In [380]:
##lets map the labels to sentiment words, positive, negative

mapped_values = {
    "__label__1": "negative",
    "__label__2": "positive"
}

In [381]:
train_dataset['labels'] = train_dataset['labels'].map(mapped_values)

In [382]:
test_dataset['labels'] =test_dataset['labels'].map(mapped_values)

In [383]:
test_dataset.head(10)

Unnamed: 0,labels,text
0,positive,Great CD: My lovely Pat has one of the GREAT v...
1,positive,One of the best game music soundtracks - for a...
2,negative,Batteries died within a year ...: I bought thi...
3,positive,"works fine, but Maha Energy is better: Check o..."
4,positive,Great for the non-audiophile: Reviewed quite a...
5,negative,DVD Player crapped out after one year: I also ...
6,negative,"Incorrect Disc: I love the style of this, but ..."
7,negative,DVD menu select problems: I cannot scroll thro...
8,positive,Unique Weird Orientalia from the 1930's: Exoti...
9,negative,"Not an ""ultimate guide"": Firstly,I enjoyed the..."


In [384]:
train_dataset.head(10)

Unnamed: 0,labels,text
0,positive,Stuning even for the non-gamer: This sound tra...
1,positive,The best soundtrack ever to anything.: I'm rea...
2,positive,Amazing!: This soundtrack is my favorite music...
3,positive,Excellent Soundtrack: I truly like this soundt...
4,positive,"Remember, Pull Your Jaw Off The Floor After He..."
5,positive,an absolute masterpiece: I am quite sure any o...
6,negative,"Buyer beware: This is a self-published book, a..."
7,positive,Glorious story: I loved Whisper of the wicked ...
8,positive,A FIVE STAR BOOK: I just finished reading Whis...
9,positive,Whispers of the Wicked Saints: This was a easy...


# NLTK
NLTK, or the Natural Language Toolkit, is a comprehensive library for natural language processing (NLP) in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text-processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

# Use Cases of NLTK
Text Cleaning and Preprocessing: Tokenizing, removing stopwords, stemming, and lemmatizing text to prepare it for further analysis.
Sentiment Analysis: Analyzing the sentiment or emotional tone of text data, such as customer reviews or social media posts.
Language Translation: Assisting in tasks related to translating text between languages.
Named Entity Recognition: Identifying proper names and classifying them into predefined categories like names of persons, organizations, locations, etc.
Text Classification: Automatically categorizing text into predefined categories, useful in spam detection, news categorization, and more.
NLTK is highly regarded in the field of NLP for educational purposes and rapid prototyping due to its comprehensive functionality and ease of use.

# Key Features of NLTK
1. Text Processing Functions:

Tokenization: Splitting text into sentences or words.
Stemming and Lemmatization: Reducing words to their root form or base form.
Part-of-Speech Tagging: Assigning grammatical categories (like noun, verb, adjective) to each word.
Named Entity Recognition (NER): Identifying and classifying named entities in text (such as people, organizations, locations).

2. Corpora and Lexical Resources:

Corpora: NLTK includes access to a variety of text corpora, which are large and structured sets of texts used for linguistic research and NLP tasks.
WordNet: A lexical database of English, which groups words into sets of synonyms called synsets and provides short definitions and usage examples.

3. Text Classification:

NLTK offers tools to classify text into categories using machine learning techniques, which can be trained using labeled data.

4. Parsing and Syntactic Analysis:

Tools for analyzing the grammatical structure of sentences, including context-free grammar (CFG) parsing.

5. Semantic Analysis:

Functions for understanding the meaning of text, including tools for sentiment analysis and more complex semantic tasks.

In [388]:
#LETS REMOVE STOPWORDS
#Bag of word
#tfidf

In [389]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
#nltk requires this package to be installed alongside stopwords
nltk.download('punkt')

stop_words = stopwords.words("english")
# Example sentence
text = "This is an example sentence with some stopwords that we want to remove"


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [390]:

# Tokenize the text (split it into words)
words = nltk.word_tokenize(text)

words

['This',
 'is',
 'an',
 'example',
 'sentence',
 'with',
 'some',
 'stopwords',
 'that',
 'we',
 'want',
 'to',
 'remove']

In [391]:
stop_words[1]

'me'

In [392]:

# Get a list of English stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords from the text
filtered_words = [word for word in words if word.lower() not in stop_words]

# Reconstruct the text without stopwords
filtered_text = " ".join(filtered_words)

print(filtered_text)


example sentence stopwords want remove


In [394]:
#modularize code into a function
stop_words = set(stopwords.words("english"))
def remove_stopwords(text,stop_words = stop_words):
  '''
  This function takes in words and tokenizes the sentence. 
  filters out the stopwords and reutens a more compact sentence  
  '''
  words = nltk.word_tokenize(text)
  # Remove stopwords from the text
  filtered_words = [word for word in words if word.lower() not in stop_words]
  # Reconstruct the text without stopwords
  filtered_text = " ".join(filtered_words)
  #print(filtered_text)

  return filtered_text

In [395]:
remove_stopwords(text)

'example sentence stopwords want remove'

In [396]:
len(train_dataset)

3600010

In [397]:
train_dataset["text"].head(10)

0    Stuning even for the non-gamer: This sound tra...
1    The best soundtrack ever to anything.: I'm rea...
2    Amazing!: This soundtrack is my favorite music...
3    Excellent Soundtrack: I truly like this soundt...
4    Remember, Pull Your Jaw Off The Floor After He...
5    an absolute masterpiece: I am quite sure any o...
6    Buyer beware: This is a self-published book, a...
7    Glorious story: I loved Whisper of the wicked ...
8    A FIVE STAR BOOK: I just finished reading Whis...
9    Whispers of the Wicked Saints: This was a easy...
Name: text, dtype: object

In [399]:
train_dataset["text"].head(10).apply(remove_stopwords)

0    Stuning even non-gamer : sound track beautiful...
1    best soundtrack ever anything . : 'm reading l...
2    Amazing ! : soundtrack favorite music time , h...
3    Excellent Soundtrack : truly like soundtrack e...
4    Remember , Pull Jaw Floor Hearing : 've played...
5    absolute masterpiece : quite sure actually tak...
6    Buyer beware : self-published book , want know...
7    Glorious story : loved Whisper wicked saints ....
8    FIVE STAR BOOK : finished reading Whisper Wick...
9    Whispers Wicked Saints : easy read book made w...
Name: text, dtype: object

In [156]:
from tqdm import tqdm

In [429]:
##i would love to see a progress bar when we process for all the 3.6 million reviews
total_rows = len(train_dataset)
tqdm.pandas(total = total_rows)
train_dataset['stop words'] = train_dataset['text'].progress_apply(remove_stopwords)

100%|█████████████████████████████████████████████████████████████████████| 3600010/3600010 [1:04:17<00:00, 933.14it/s]


In [430]:
train_dataset

Unnamed: 0,labels,text,stop words
0,positive,Stuning even for the non-gamer: This sound tra...,Stuning even non-gamer : sound track beautiful...
1,positive,The best soundtrack ever to anything.: I'm rea...,best soundtrack ever anything . : 'm reading l...
2,positive,Amazing!: This soundtrack is my favorite music...,"Amazing ! : soundtrack favorite music time , h..."
3,positive,Excellent Soundtrack: I truly like this soundt...,Excellent Soundtrack : truly like soundtrack e...
4,positive,"Remember, Pull Your Jaw Off The Floor After He...","Remember , Pull Jaw Floor Hearing : 've played..."
...,...,...,...
3600005,negative,Don't do it!!: The high chair looks great when...,n't ! ! : high chair looks great first comes b...
3600006,negative,"Looks nice, low functionality: I have used thi...","Looks nice , low functionality : used highchai..."
3600007,negative,"compact, but hard to clean: We have a small ho...","compact , hard clean : small house , really wa..."
3600008,negative,what is it saying?: not sure what this book is...,saying ? : sure book supposed . really rehash ...


In [434]:
total_rows = len(test_dataset)
tqdm.pandas(total = total_rows)
test_dataset['stop words'] = test_dataset['text'].progress_apply(remove_stopwords)

100%|█████████████████████████████████████████████████████████████████████████| 400000/400000 [07:22<00:00, 904.59it/s]


In [435]:
test_dataset

Unnamed: 0,labels,text,stop words
0,positive,Great CD: My lovely Pat has one of the GREAT v...,Great CD : lovely Pat one GREAT voices generat...
1,positive,One of the best game music soundtracks - for a...,One best game music soundtracks - game n't rea...
2,negative,Batteries died within a year ...: I bought thi...,Batteries died within year ... : bought charge...
3,positive,"works fine, but Maha Energy is better: Check o...","works fine , Maha Energy better : Check Maha E..."
4,positive,Great for the non-audiophile: Reviewed quite a...,Great non-audiophile : Reviewed quite bit comb...
...,...,...,...
399995,negative,Unbelievable- In a Bad Way: We bought this Tho...,Unbelievable- Bad Way : bought Thomas son huge...
399996,negative,"Almost Great, Until it Broke...: My son reciev...","Almost Great , Broke ... : son recieved birthd..."
399997,negative,Disappointed !!!: I bought this toy for my son...,Disappointed ! ! ! : bought toy son loves `` T...
399998,positive,Classic Jessica Mitford: This is a compilation...,Classic Jessica Mitford : compilation wide ran...


# LETS CREATE A BAG OF WORDS AND TF-IDF


# Bag of Words (BoW)
A "Bag of Words" (BoW) is a simple and commonly used technique in natural language processing (NLP) and text analysis to represent text data as numerical features. It is used to transform a collection of text documents into a format that can be processed by machine learning algorithms. The idea behind the Bag of Words model is to disregard the order and structure of words in a text and focus only on the frequency of each word's occurrence.

The key idea is that the order of words and the grammatical structure of sentences are ignored, and the analysis is purely based on the presence or absence of specific words and their frequencies.

# TF-IDF
TF-IDF, which stands for "Term Frequency-Inverse Document Frequency," is a numerical statistic used in information retrieval and natural language processing (NLP) to evaluate the importance of a word within a document relative to a collection of documents, typically a corpus.

The TF-IDF score provides a measure of how important a term is within a specific document and across a collection of documents. Terms that appear frequently in a document but rarely in other documents receive higher TF-IDF scores, making them indicative of the content of that document.

In [456]:
from sklearn.feature_extraction.text import CountVectorizer # for creating bags of words
from sklearn.feature_extraction.text import TfidfVectorizer # creating tf-idf

## Creating bag of words from stop words column

In [459]:
vectorizer = CountVectorizer() # You can adjust max_features as needed
train_bow = vectorizer.fit_transform(train_dataset['stop words'])
test_bow = vectorizer.transform(test_dataset['stop words'])

## Creating a tf-idf matrix from stopwords

In [461]:
tfidf_vectorizer = TfidfVectorizer()  # You can adjust max_features as needed
train_tfidf = tfidf_vectorizer.fit_transform(train_dataset['stop words'])
test_tfidf = tfidf_vectorizer.transform(test_dataset['stop words'])

# MODELLING AND EVALUATION

- vader on normal sentences
- vader on sentence with stop words
- custom on train_tf-idf
- custom on train_bow

In [439]:
# vader on normal sentences
import nltk
# download the VADER lexicon and model
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

# vader: is a pretrained model for analysing sentiment sentences

In [441]:
#next we import the Sentiment Intensity Analyzer class from vader, this class is where the magic happens
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [443]:
##lets test out the sentiment analyzer with an example text

example_text  = "i love the orange flavor, it is a good product"
sentiment_scores = analyzer.polarity_scores(example_text)

# The sentiment_scores dictionary will contain the scores.
print(sentiment_scores)

{'neg': 0.0, 'neu': 0.36, 'pos': 0.64, 'compound': 0.7964}


In [444]:
#getting the sentiment scores
compound_score = sentiment_scores['compound']
#now lets make a decision for the cut off for a postitive or negative score
if compound_score > 0:
    sentiment = "Positive"
else:
  sentiment = "Negative"

print(f"The sentiment is {sentiment} (Compound Score: {compound_score})")


The sentiment is Positive (Compound Score: 0.7964)


In [445]:
##we want to apply all we just did to all the text in our dataset, so lets first
##create the function, then we apply the function

# Defining a function to analyze sentiment of a given sentence
def analyze_sentence(sentence):
  '''
  takes a sentence 
  get the sentiment using analyzer
  return positive if the compound score > 0, other return negative
  '''
  sentiment_scores = analyzer.polarity_scores(sentence)
  compound_score = sentiment_scores['compound']

  if compound_score > 0:
    sentiment = "positive"
  else:
    sentiment = "negative"

  return sentiment

In [446]:
 analyze_sentence(example_text)

'positive'

In [447]:
#vader on text
test_dataset["vader_on_text"] = test_dataset["text"].apply(analyze_sentence)

In [448]:
# Vader on stopwords column
test_dataset["vader_on_text_without_stopwords"] = test_dataset["stop words"].apply(analyze_sentence)

# training custom models on bag of words and tf-idf

In [None]:
train_tfidf

In [None]:
train_bow

In [452]:
# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

In [453]:
# Creating classifier objects for tf-idf and bow models
classifier_bow = MultinomialNB()
classifier_tfidf = MultinomialNB()

In [454]:
# fit these models 
classifier_bow.fit(train_bow, train_dataset['labels'])
classifier_tfidf.fit(train_tfidf, train_dataset['labels'])

In [463]:
train_dataset['labels']

0          positive
1          positive
2          positive
3          positive
4          positive
             ...   
3600005    negative
3600006    negative
3600007    negative
3600008    negative
3600009    positive
Name: labels, Length: 3600010, dtype: object

In [464]:
test_bow

<400000x980808 sparse matrix of type '<class 'numpy.int64'>'
	with 13961182 stored elements in Compressed Sparse Row format>

In [465]:
test_tfidf

<400000x980808 sparse matrix of type '<class 'numpy.float64'>'
	with 13961182 stored elements in Compressed Sparse Row format>

# Making predictions with classifier

In [466]:
test_dataset['bow'] = classifier_bow.predict(test_bow)

In [467]:
test_dataset['tfidf'] = classifier_bow.predict(test_tfidf)

In [468]:
test_dataset

Unnamed: 0,labels,text,stop words,vader_on_text,vader_on_text_without_stopwords,bow,tfidf
0,positive,Great CD: My lovely Pat has one of the GREAT v...,Great CD : lovely Pat one GREAT voices generat...,positive,positive,positive,positive
1,positive,One of the best game music soundtracks - for a...,One best game music soundtracks - game n't rea...,positive,positive,positive,positive
2,negative,Batteries died within a year ...: I bought thi...,Batteries died within year ... : bought charge...,positive,positive,negative,negative
3,positive,"works fine, but Maha Energy is better: Check o...","works fine , Maha Energy better : Check Maha E...",positive,positive,negative,negative
4,positive,Great for the non-audiophile: Reviewed quite a...,Great non-audiophile : Reviewed quite bit comb...,positive,positive,positive,positive
...,...,...,...,...,...,...,...
399995,negative,Unbelievable- In a Bad Way: We bought this Tho...,Unbelievable- Bad Way : bought Thomas son huge...,positive,positive,negative,negative
399996,negative,"Almost Great, Until it Broke...: My son reciev...","Almost Great , Broke ... : son recieved birthd...",negative,positive,negative,negative
399997,negative,Disappointed !!!: I bought this toy for my son...,Disappointed ! ! ! : bought toy son loves `` T...,positive,positive,negative,negative
399998,positive,Classic Jessica Mitford: This is a compilation...,Classic Jessica Mitford : compilation wide ran...,positive,positive,positive,positive


# Model Evaluation: Using Accuracy Scores and Classification Report

In [476]:
from sklearn.metrics import accuracy_score, classification_report

In [478]:
# what are we evaluating 
test_dataset.head(5)

Unnamed: 0,labels,text,stop words,vader_on_text,vader_on_text_without_stopwords,bow,tfidf
0,positive,Great CD: My lovely Pat has one of the GREAT v...,Great CD : lovely Pat one GREAT voices generat...,positive,positive,positive,positive
1,positive,One of the best game music soundtracks - for a...,One best game music soundtracks - game n't rea...,positive,positive,positive,positive
2,negative,Batteries died within a year ...: I bought thi...,Batteries died within year ... : bought charge...,positive,positive,negative,negative
3,positive,"works fine, but Maha Energy is better: Check o...","works fine , Maha Energy better : Check Maha E...",positive,positive,negative,negative
4,positive,Great for the non-audiophile: Reviewed quite a...,Great non-audiophile : Reviewed quite bit comb...,positive,positive,positive,positive


In [486]:
vader_text_accuracy_score = accuracy_score(test_dataset['labels'], test_dataset['vader_on_text'])

In [488]:
vader_text_accuracy_score*100

71.66550000000001

In [490]:
print(classification_report(test_dataset['labels'], test_dataset['vader_on_text']))

              precision    recall  f1-score   support

    negative       0.87      0.51      0.64    200000
    positive       0.65      0.92      0.76    200000

    accuracy                           0.72    400000
   macro avg       0.76      0.72      0.70    400000
weighted avg       0.76      0.72      0.70    400000



In [491]:
vader_text_stopwords_accuracy_score = accuracy_score(test_dataset['labels'], test_dataset['vader_on_text_without_stopwords'])

In [492]:
vader_text_stopwords_accuracy_score*100

68.08024999999999

In [496]:
print(classification_report(test_dataset['labels'], test_dataset['vader_on_text_without_stopwords']))

              precision    recall  f1-score   support

    negative       0.86      0.43      0.57    200000
    positive       0.62      0.93      0.75    200000

    accuracy                           0.68    400000
   macro avg       0.74      0.68      0.66    400000
weighted avg       0.74      0.68      0.66    400000



In [502]:
bow_score = accuracy_score(test_dataset['labels'], test_dataset['bow'])

In [504]:
bow_score*100

84.8735

In [498]:
print(classification_report(test_dataset['labels'], test_dataset['bow']))

              precision    recall  f1-score   support

    negative       0.84      0.86      0.85    200000
    positive       0.85      0.84      0.85    200000

    accuracy                           0.85    400000
   macro avg       0.85      0.85      0.85    400000
weighted avg       0.85      0.85      0.85    400000



In [506]:
tfidf_score = accuracy_score(test_dataset['labels'], test_dataset['tfidf'])

In [508]:
tfidf_score*100

83.22125

In [510]:
print(classification_report(test_dataset['labels'], test_dataset['tfidf']))

              precision    recall  f1-score   support

    negative       0.83      0.84      0.83    200000
    positive       0.84      0.82      0.83    200000

    accuracy                           0.83    400000
   macro avg       0.83      0.83      0.83    400000
weighted avg       0.83      0.83      0.83    400000



# Conclusion: The best performing model/pipeline is the text to stopwords remover to bow to multinomialNB

In [542]:
def inference(text):
    """
    remove stopwords
    convert the remaining words to bag of words
    convert the bof words to classifier_bow
    return prediction
    """
    filtered_test = remove_stopwords(text)
    bow_single = vectorizer.transform([filtered_test])
    inference = classifier_bow.predict(bow_single)
    return inference[0]

In [544]:
example = 'I love the orange flavor, it is so good'

In [546]:
inference(example)

'positive'

# Model Interpretation and Comparison
1. VADER on Raw Text
- Accuracy: 71.67%
- Precision, Recall, F1-Score:
- Negative: Precision (0.87), Recall (0.51), F1-Score (0.64)
- Positive: Precision (0.65), Recall (0.92), F1-Score (0.76)
- Macro Avg: Precision (0.76), Recall (0.72), F1-Score (0.70)
- Weighted Avg: Precision (0.76), Recall (0.72), F1-Score (0.70)

VADER performed moderately well, with higher precision for negative sentiments but better recall for positive sentiments. The overall accuracy indicates that VADER can classify sentiments, but there's room for improvement, particularly in handling negative sentiments.

2. VADER with Stopwords Removed
- Accuracy: 68.08%
- Precision, Recall, F1-Score:
- Negative: Precision (0.86), Recall (0.43), F1-Score (0.57)
- Positive: Precision (0.62), Recall (0.93), F1-Score (0.75)
- Macro Avg: Precision (0.74), Recall (0.68), F1-Score (0.66)
- Weighted Avg: Precision (0.74), Recall (0.68), F1-Score (0.66)

Removing stopwords slightly decreased the overall accuracy and F1-scores. The precision for negative sentiments remained high, but the recall dropped significantly, indicating that VADER struggled more to correctly identify negative reviews without stopwords.

3. Bag of Words (BoW) Model
- Accuracy: 84.87%
- Precision, Recall, F1-Score:
- Negative: Precision (0.84), Recall (0.86), F1-Score (0.85)
- Positive: Precision (0.85), Recall (0.84), F1-Score (0.85)
- Macro Avg: Precision (0.85), Recall (0.85), F1-Score (0.85)
- Weighted Avg: Precision (0.85), Recall (0.85), F1-Score (0.85)

The BoW model significantly outperformed VADER, achieving higher accuracy and balanced precision and recall scores for both sentiment classes. This suggests that BoW, likely with a more sophisticated classifier, was better at capturing the nuances in the text data.

4. TF-IDF Model
- Accuracy: 83.22%
- Precision, Recall, F1-Score:
- Negative: Precision (0.83), Recall (0.84), F1-Score (0.83)
- Positive: Precision (0.84), Recall (0.82), F1-Score (0.83)
- Macro Avg: Precision (0.83), Recall (0.83), F1-Score (0.83)
- Weighted Avg: Precision (0.83), Recall (0.83), F1-Score (0.83)

The TF-IDF model also performed well, with slightly lower accuracy than the BoW model but still much better than VADER. The balanced scores across all metrics indicate that TF-IDF is effective for this classification task.

# Business Implications
1. Improved Customer Feedback Analysis: The higher accuracy of the BoW and TF-IDF models indicates more reliable sentiment analysis, enabling TechTrends to better understand customer opinions and preferences.
2. Resource Allocation: Automated, accurate sentiment analysis reduces the need for manual review, saving time and resources.
3. Enhanced Customer Satisfaction: By accurately identifying and addressing negative feedback, TechTrends can improve products and services, leading to higher customer satisfaction and loyalty.
4. Data-Driven Decisions: Reliable sentiment data supports strategic decisions, such as product development and marketing strategies, based on real customer feedback.
# Recommendations
1. Deploy BoW or TF-IDF Models: Given their superior performance, these models should be preferred for operational use. Regular updates to the training data should be considered to maintain accuracy.
2. Enhance Data Preprocessing: Further refine preprocessing steps, such as better handling of special characters and synonyms, to improve model performance.
3. Combine Models: Consider an ensemble approach that combines the strengths of multiple models for even better accuracy.
4. Continuous Monitoring and Feedback Loop: Regularly monitor model performance and customer feedback to adjust and improve the sentiment analysis system.
5. Scalability: Ensure the infrastructure can handle large volumes of data as the company grows and more feedback is collected.