# Naive Bayes

![Logo Tec de Monterrey](https://javier.rodriguez.org.mx/itesm/2014/tecnologico-de-monterrey-blue.png)

**Author**: Mario Ignacio Frias Piña  
**Assignment**: Advanced Artificial Intelligence for Data Science I  
**Teacher**: Dr. Esteban Castillo Juarez  
**Date**: September 8 2024

## Introduction

In this notebook, we explore the Naive Bayes algorithm, a fundamental machine learning model known for its simplicity and efficiency in text classification tasks. Naive Bayes works on the assumption that the features are independent, which, while not always true in real-world data, leads to effective classification results in a wide range of applications, particularly in Natural Language Processing (NLP).

The project focuses on training a Naive Bayes model on a dataset, analyzing its performance, and applying various strategies such as feature selection, rebalancing, and handling stopwords to enhance the model's accuracy. Through this exploration, we aim to understand the strengths and limitations of Naive Bayes in text classification scenarios.

## Experimentation

### Naive Bayes Algorithm

The Naive Bayes algorithm is a probabilistic classification technique based on Bayes' Theorem. It assumes that the features in a dataset are conditionally independent given the class label, which simplifies the computation of the posterior probability for classification. Despite this "naive" assumption of feature independence, the algorithm performs well in many practical applications, particularly in text classification and spam detection.

Bayes' Theorem provides a way to calculate the probability of a class given the observed data, and it is expressed as:

$$P(c|X) = \frac{P(X|c)P(c)}{P(X)}$$

Naive Bayes classifiers are particularly effective when the number of features is large, as is often the case in natural language processing (NLP). The algorithm's simplicity allows it to scale efficiently to large datasets, and it is often used for tasks like document classification, sentiment analysis, and spam filtering.

### Bag of words

Bag of Words (BoW) is a simple and widely used method for text vectorization in Natural Language Processing (NLP). In this approach, a text document is represented as a vector of word frequencies, disregarding grammar and word order. Each unique word in the corpus forms a feature, and the corresponding value in the vector indicates the number of times that word appears in a given document.

In the context of the Naive Bayes implementation, BoW was likely used to convert text data into numerical vectors, which are suitable for input into the model. By limiting the vocabulary to the most important words and removing stopwords, the model focuses on the most informative features, reducing noise and improving classification performance. This method is effective for transforming raw text into a structured format that machine learning algorithms can process.

### The dataset

The dataset is comprised of 4187 training examples, and 867 test examples, divided into 3 categories: Positive, Neutral and Negative. The samples were collected from Twitter. And it was preprocessed in the following way:

- Removed URLs
- Removed Punctuation
- Removed Excess Whitespace
- Removed Non-ASCII Characters

| Category | Training | Test |
| --- | --- | --- |
| Total | 4187 | 867 |
| Positive | 2249 | 358 |
| Neutral | 1079 | 330 |
| Negative | 859 | 179 |

### Optimization Techniques

- Rebalancing

Involves removing extra examples from the dataset in order to balance the classes. Rebalancing is useful to improve the accuracy of the model by reducing the bias of the model towards certain classes.

- Removal of stopwords

Involves eliminating common, often trivial words from text data to improve the focus of analysis or processing. Stopwords are frequently used words (like "the," "and," "in") that might not carry significant meaning for certain analyses, such as text mining or natural language processing.

## Code

### Read the dataset

In [1]:
import math
import plotly.express as px
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [2]:
training = pd.read_csv('training.txt', header=None, delimiter='@@@', names=['text', 'sentiment'], engine='python')
test = pd.read_csv('test.txt', header=None, delimiter='@@@', names=['text', 'sentiment'], engine='python')

### Get the vocabulary

In [3]:

def getVocab(input):
    """
    Get the vocabulary of the input

    Args:
        input (DataFrame): The input

    Returns:
        list: The vocabulary
    """
    vocab = {}
    for index, row in training.iterrows():
        for word in row['text'].split():
            if word not in vocab:
                vocab[word] = 0
            vocab[word] += 1
    vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)

    return vocab

vocab = getVocab(training)

vocab[:10]

[('the', 3509),
 ('to', 1978),
 ('in', 1304),
 ('on', 1217),
 ('a', 1152),
 ('and', 1150),
 ('i', 1140),
 ('of', 1026),
 ('for', 985),
 ('is', 849)]

### Do the Naive Bayes training 

In [4]:
# Create functions to be used later
def vectorize(input, features):
    """
    Vectorize the input data using the features

    Args:
        input (pandas.DataFrame): The input data
        features (list): The vocabulary

    Returns:
        list: The vectorized data
        list: The labels
    """
    Vectors = []
    Labels = []

    for index, row in input.iterrows():
        temp = []
        for feature in features:
            if feature[0] in row['text'].split():
                #Calculate the frequency of the word in the document to use bag of words
                temp.append(row['text'].count(feature[0]))
            else:
                temp.append(0)
        Vectors.append(temp)
        Labels.append(row['sentiment'])

    return Vectors, Labels

def NaiveBayesTrain(trainingVectors, trainingLabels):
    """
    Train the Naive Bayes model

    Args:
        Vectors (list): The vectorized data
        labels (list): The labels

    Returns:
        dict: The Naive Bayes model with the probabilities of each word, the probabilities of each label, and the different labels
    """
    numberTrainingDocuments=len(trainingVectors)

    #Number of features in the vectors (training and test have the same number)
    NumberFeatures=len(trainingVectors[0])

    #Calculate the probability that a document is associated to a label
    differentLabels=set(trainingLabels)
    probability = {}
    for label in differentLabels:
        probability[label] = trainingLabels.count(label) / float(numberTrainingDocuments)

    #initialize the numerator and denominator for the p(xi|positive) and p(xi|negative) calculations.
    NumeratorProbability= {}
    for label in differentLabels:
        NumeratorProbability[label] = [1]*NumberFeatures

    DenominatorProbability = {}
    for label in differentLabels:
        DenominatorProbability[label] = 2
    
    #Iterate over training documents
    for x in range(numberTrainingDocuments):
        counter = 0
        for y in trainingVectors[x]:
            NumeratorProbability[trainingLabels[x]][counter] += y
            counter += 1
        DenominatorProbability[trainingLabels[x]] += sum(trainingVectors[x])
    
    WordProbability = {}
    for label in differentLabels:
        WordProbability[label] = []
        for x in NumeratorProbability[label]:
            WordProbability[label].append(math.log(x/float(DenominatorProbability[label])))
    
    return {"WordProbability": WordProbability, "probability": probability, "labels": differentLabels}

def NaiveBayesPredict(vector, model):
    """
    Predict the label of the vector using the Naive Bayes model

    Args:
        vector (list): The vector to predict
        model (dict): The Naive Bayes model

    Returns:
        string: The predicted label
    """

    probabilities = { label: [] for label in model["labels"]}

    #Go through the vector bag of words
    for counter, x in enumerate(vector):
        # Go through the different labels
        for label in model["labels"]:
            probabilities[label].append(x * model["WordProbability"][label][counter])

    p = {}
    for label in model["labels"]:
        p[label] = sum(probabilities[label]) + math.log(model["probability"][label])

    return max(p, key=p.get)

def calculate_results(predictions, testLabels):
    """
    Calculate the accuracy, precision, recall, and F1 score of the model

    Args:
        predictions (list): The predicted labels
        testLabels (list): The true labels

    Returns:
        dict: The results of the model
    """
    results = {}

    results['test_accuracy'] = accuracy_score(testLabels, predictions)
    results['test_precision_macro'] = precision_score(testLabels, predictions, average='macro')
    results['test_recall_macro'] = recall_score(testLabels, predictions, average='macro')
    results['test_f1_macro'] = f1_score(testLabels, predictions, average='macro')

    return results

def model_NB(train, test, numFeatures):
    """
    Train and test the Naive Bayes model

    Args:
        train (DataFrame): The training data
        test (DataFrame): The test data
        numFeatures (int): The number of features to use

    Returns:
        dict: The results of the model
    """
    # Vectorize the data
    trainingVectors, trainingLabels = vectorize(train, vocab[:numFeatures])
    testVectors, testLabels = vectorize(test, vocab[:numFeatures])
    # Manual Naive Bayes
    model = NaiveBayesTrain(trainingVectors, trainingLabels)
    predictions = []
    for index, row in enumerate(testVectors):
        predictions.append(NaiveBayesPredict(row, model))
    manual_results = calculate_results(predictions, testLabels)

    # Sklearn Naive Bayes
    model = MultinomialNB()
    model.fit(trainingVectors, trainingLabels)
    predictions = model.predict(testVectors)
    result = calculate_results(predictions, testLabels)

    return manual_results, result

# Compare the results with different number of features
list_results = {'test_accuracy': [], 'test_precision_macro': [], 'test_recall_macro': [], 'test_f1_macro': []}
list_sklearn_results = {'test_accuracy': [], 'test_precision_macro': [], 'test_recall_macro': [], 'test_f1_macro': []}
for numFeatures in [20, 40, 60, 80, 100, 120, 200]:
    manual_results, result = model_NB(training, test, numFeatures)
    list_results['test_accuracy'].append(manual_results['test_accuracy'])
    list_results['test_precision_macro'].append(manual_results['test_precision_macro'])
    list_results['test_recall_macro'].append(manual_results['test_recall_macro'])
    list_results['test_f1_macro'].append(manual_results['test_f1_macro'])

    list_sklearn_results['test_accuracy'].append(result['test_accuracy'])
    list_sklearn_results['test_precision_macro'].append(result['test_precision_macro'])
    list_sklearn_results['test_recall_macro'].append(result['test_recall_macro'])
    list_sklearn_results['test_f1_macro'].append(result['test_f1_macro'])

df_results = pd.DataFrame(list_results)
df_results

Unnamed: 0,test_accuracy,test_precision_macro,test_recall_macro,test_f1_macro
0,0.381776,0.366771,0.35068,0.309495
1,0.385236,0.380024,0.357689,0.330611
2,0.434833,0.433122,0.416224,0.39684
3,0.467128,0.465163,0.447915,0.437166
4,0.485582,0.48319,0.469031,0.45818
5,0.495963,0.488519,0.478201,0.469663
6,0.512111,0.49689,0.493748,0.487559


## Results

In [5]:
manual_results, result = model_NB(training, test, 200)
px.bar(y=[manual_results['test_accuracy'], result['test_accuracy']], x=["Manual Naive Bayes", "Sklearn Naive Bayes"], title="Accuracy", range_y=[0, 1])

In [6]:
fig = px.line(df_results, y=['test_accuracy', 'test_precision_macro', 'test_recall_macro', 'test_f1_macro'], x=[20, 40, 60, 80, 100, 120, 200], title="Results", range_y=[0, 1])
fig.show()

### Rebalance the training set

In [7]:
#Get the category with the least number of rows in the training set
number = training['sentiment'].value_counts().min()

# Drop the rows that exceed the number of each category
training = training.groupby('sentiment').head(number).reset_index(drop=True)

training['sentiment'].value_counts()

sentiment
positive    859
negative    859
neutral     859
Name: count, dtype: int64

In [8]:
new_manual_results, new_result = model_NB(training, test, 200)

new_manual_results

{'test_accuracy': 0.5155709342560554,
 'test_precision_macro': 0.5125947059321162,
 'test_recall_macro': 0.5176457310535523,
 'test_f1_macro': 0.5051647613201565}

### Remove stopwords

In [9]:
# Remove stopwords from the vocabulary
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print(len(vocab))

vocab = [word for word in vocab if word[0] not in ENGLISH_STOP_WORDS]

print(len(vocab))

trainingVectors, trainingLabels = vectorize(training, vocab[:200])
testVectors, testLabels = vectorize(test, vocab[:200])

model = MultinomialNB()
model.fit(trainingVectors, trainingLabels)
predictions = model.predict(testVectors)
result = calculate_results(predictions, testLabels)

result

12702
12438


{'test_accuracy': 0.5178777393310265,
 'test_precision_macro': 0.5174768614794235,
 'test_recall_macro': 0.5263077704418486,
 'test_f1_macro': 0.5110746573344999}

In [10]:
df_final = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1'] * 3,
    'Technique': ['Original Naive Bayes'] * 4 + ['Rebalanced Naive Bayes'] * 4 + ['No Stopwords Naive Bayes'] * 4,
    'Value': df_results.tail(1).values.tolist()[0] + list(new_manual_results.values()) + list(result.values())
}
)

# Plot the results
px.bar(df_final, x='Metric', y='Value', color='Technique', barmode='group', title="Comparison between different techniques for Naive Bayes", range_y=[0, 1])

## Results

The notebook compares the performance of Naive Bayes with and without rebalancing the dataset. Rebalancing is often done to address issues with class imbalance, which can negatively affect the accuracy of classification models. After applying rebalancing techniques, the results showed an improvement in the model's accuracy, suggesting that the model benefits from more balanced data distributions, even though the improvement was not very significant.

One of the key metrics used to evaluate the performance of the Naive Bayes classifier was test accuracy. The rebalanced model achieved a test accuracy of approximately 0.515 after rebalancing and removing stopwords. This value represents an improvement over the initial model but still shows there is room for further enhancement. While this accuracy may seem modest, it demonstrates the challenges that Naive Bayes faces in certain datasets, especially when features are not truly independent.

In addition to accuracy, the notebook also evaluates precision, recall, and F1 scores using macro averaging. These metrics provide a more nuanced view of performance across different classes. The precision, recall, and F1 scores were all around 0.51, indicating that the model is performing consistently across all classes. However, the relatively low values suggest that further work is needed to boost the model's ability to capture the nuances of the dataset.

After removing stopwords from the vocabulary, there was an improvement in the model's precision and recall. This suggests that these common words, which occur frequently across all classes, added noise rather than contributing to the discriminative power of the model. By removing them, the model was able to focus on more informative terms, thereby improving its performance.

## Conclusion

This notebook demonstrates the practical implementation of the Naive Bayes classifier for text classification. We experimented with multiple techniques to improve model performance, including vocabulary selection, rebalancing the dataset, and removing stopwords.

Despite the model’s simplicity, Naive Bayes achieved reasonable accuracy, but there remains room for improvement, especially when dealing with imbalanced data and feature correlations. Overall, Naive Bayes remains a valuable tool for quick and efficient classification tasks, and further improvements can be explored by integrating more sophisticated feature engineering and model tuning techniques.