# **Sentiment Analysis Application Using Naive Bayes**

## Introduction

This project is my first project in Natural Language Processing (NLP), it implements a sentiment analysis system from scratch using Python. I see this project a key application of NLP that identifies and categorizes sentiments expressed in text data, such as positive and negative reviews. <br />

I will be using the Naive Bayes classifier, a probabilitic machine learning algorithm, to perform sentiment classification. The entire pipeline is built from the ground up, covering preprocessing of text data, feature extraction through vectorization, and training/testing the classifier. The implementation does not rely on external libraries for machine learning, making it an educational example of sentiment analysis fundamentals. <br />

This project is valuable for understanding the inner workings of Naive Bayes, as well as essential NLP techniques such as tokenization, stopword removal, and stemming. It demonstrates the entire workflow of text-based machine learning and showcases the practicality of applying foundational concepts to real-world datasets. <br />

The primary objective of this project is to classify Amazon product reviews into positive or negative sentiments

#### Import Required Libraries:


Here are all libraries we need for this project. For reading and manipulating the dataset, we need **Pandas**. For regular expressions, we use **re** to clean text. We have also the **nltk** library, which provides text preprocessing tools like tokenization, stopword removal and stemming. Also, we will be calling **Numpy** library for numerical operations and efficient data handling.

In [3]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import numpy as np

#### Download Necessary NLTK Resources:

As mentioned before, we are in need of tokenizer for splitting sentences into words, which we get that from **punkt**, and a list of common words (like "the," "is") to exclude from analysis, from **stopwords**.

In [4]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

####  Load and Prepare the Dataset:

In [6]:
df = pd.read_csv('amazon.csv')

X = df['reviewText'].fillna("")  # Handle missing values by replacing them with empty strings
Y = df['Positive']  # Labels: 1 for positive sentiment, 0 for negative sentiment

#### Preprocess Reviews:

In the following cell, we need to standardize text by:
1. Remove special characters, URLs, and numbers.
2. Tokenize sentences into words.
3. Remove stop words (e.g., "is", "the", "and").
4. Convert text to lowercase.
5. Stem or lemmatize words (reduce words to their root form).<br />

For example: <br />
* Input: "I absolutely love this product!" <br />
* Output: ['absolut', 'love', 'product']

In [7]:
def preprocess_tweet(tweet):
    tweet = tweet.lower()  # Convert text to lowercase
    tweet = re.sub(r'[^\w\s]', '', tweet)  # Remove punctuation
    tokens = word_tokenize(tweet.lower())  # Split text into individual words and lowercase it
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]  # Reduce words to their root forms
    return tokens

preprocessed_reviews = [preprocess_tweet(review) for review in X]


#### Create Vocabulary:

Then we need to build a dictionary mapping each unique word in the dataset to a unique index. <br />

For example:
* Input: ['absolut', 'love', 'product']
* Output: {'absolut': 0, 'love': 1, 'product': 2}

In [8]:
def create_vocabulary(tweets):
    vocab = {}
    index = 0
    for tweet in tweets:
        for word in tweet:
            if word not in vocab:
                vocab[word] = index
                index += 1
    return vocab

vocab = create_vocabulary(preprocessed_reviews)

#### Convert Reviews into Vectors:

In this part, we will convert each review into a numerical vector based on word frequencies.

In [9]:
def vectorize_tweets(tweets, vocab):
    vectors = []
    for tweet in tweets:
        vector = [0] * len(vocab)  # Initialize a vector of zeros
        for word in tweet:
            if word in vocab:
                vector[vocab[word]] += 1  # Count occurrences of each word
        vectors.append(vector)
    return np.array(vectors)

tweet_vectors = vectorize_tweets(preprocessed_reviews, vocab)

#### Build the Model:

In this project we have chosen the Naive Bayes Classifier to deal with the classification:

Naive Bayes Implementation (Baseline)
A Naive Bayes classifier is a simple yet effective algorithm for text classification. We implement it as follows:

**Calculate Prior Probabilities:**

$P(positive)$ = $Number of positive samples \over Total samples$

Similarly, calculate $P(negative)$.

**Calculate Likelihoods:** For each word in the vocabulary, calculate:

$P(word∣positive)$= $Count of the word in positive samples + 1 \over Total words in positive samples + Vocabulary size$

Use Laplace smoothing to avoid zero probabilities.

**Calculate Posterior Probabilities:** For a new sentence, calculate:

$P(positive∣sentence)$ ∝ $P(positive)$ x $P(word1|positive)$ x $P(word2|positive)$ x $...$

Classify based on the higher probability.

In [11]:
class NaiveBayes:
    def __init__(self):
        self.class_probs = {}
        self.word_probs = {}

    def train(self, X, y):
        n_samples, n_features = X.shape
        self.class_probs = {label: np.mean(y == label) for label in np.unique(y)}  # Prior probabilities
        self.word_probs = {}
        for label in np.unique(y):
            label_features = X[np.array(y) == label]
            word_counts = np.sum(label_features, axis=0) + 1  # Laplace smoothing
            total_count = np.sum(word_counts)
            self.word_probs[label] = word_counts / total_count  # Likelihood

    def predict(self, X):
        predictions = []
        for x in X:
            posteriors = {}
            for label in self.class_probs:
                prior = np.log(self.class_probs[label])  # Log of prior
                likelihood = np.sum(np.log(self.word_probs[label]) * x)  # Log of likelihood
                posteriors[label] = prior + likelihood  # Posterior
            predictions.append(max(posteriors, key=posteriors.get))
        return np.array(predictions)

#### Train-Test Split:

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(tweet_vectors, Y, test_size=0.2, random_state=42)

#### Train and Evaluate the Mode:

Here we will trains the Naive Bayes model on the training set, and 
evaluates accuracy on the testing set.

In [15]:
nb = NaiveBayes()
nb.train(X_train, np.array(Y_train))

predictions = nb.predict(X_test)

accuracy = np.mean(predictions == np.array(Y_test))
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 88.17%


#### Test with New Input:

Finally, we can preprocess a new review, convert it to a vector, and predict sentiment.

In [18]:
def classify_text(text, vocab, model):
    processed = preprocess_tweet(text)
    vector = vectorize_tweets([processed], vocab)
    prediction = model.predict(vector)[0]
    return "positive" if prediction == 1 else "negative"

new_review = "This product is fantastic! I absolutely love it."
print(f"Sentiment for '{new_review}': {classify_text(new_review, vocab, nb)}")

Sentiment for 'This product is fantastic! I absolutely love it.': positive


## Conclusion:

In this project, we successfully implemented a sentiment analysis system from scratch using a Naive Bayes classifier to classify Amazon product reviews as positive or negative. The system achieved an accuracy of **88.17%**, which demonstrates the effectiveness of Naive Bayes for text classification tasks, particularly in scenarios with a clear distinction between classes.  

Despite the high accuracy, there are several ways to further improve the model's performance:  

1. **Feature Engineering**:  
   - Include **n-grams** (e.g., bigrams and trigrams) to capture context and relationships between words.  
   - Use **TF-IDF** (Term Frequency-Inverse Document Frequency) instead of simple word counts to weigh important terms more effectively.  

2. **Advanced Preprocessing**:  
   - Incorporate **lemmatization** instead of stemming for more meaningful root words.  
   - Address misspelled words or slang using spell-checking and normalization techniques.  

3. **Larger Vocabulary**:  
   - Train the model on a more extensive vocabulary by using a larger or more diverse dataset.  

4. **Data Augmentation**:  
   - Use techniques like synonym replacement or back-translation to artificially expand the dataset and improve generalization.  

5. **Other Classifiers**:  
   - Experiment with other machine learning algorithms like Logistic Regression or Support Vector Machines (SVM).  
   - Explore deep learning models like LSTMs or Transformers for better context understanding.  

This project serves as a strong foundation for understanding the basics of sentiment analysis and the Naive Bayes algorithm. By incorporating the suggested improvements, the model can achieve higher accuracy and better adapt to real-world scenarios.