# Naive Bayes

Example usecase with Sentiment Analysis

Using bayes theorem to classify text into positive, neutral, or negative

$$
P(positive | texts) = \frac{P(texts | positive) * P(positive)}{P(texts)}
\\
P(neutral | texts) = \frac{P(texts | neutral) * P(neutral)}{P(texts)}
\\
P(negative | texts) = \frac{P(texts | negative) * P(negative)}{P(texts)}
$$

Where $P(positive)$, $P(neutral)$ and $P(negative)$ are the prior probabilities of the classes, equivalent to the fraction of the training set that belongs to each class.

$P(texts | positive)$ is the likelihood of the texts given it's from the positive class, and so on for the other classes.

Given $P(texts)$, the denominator, is constant for all classes, we can ignore it and just compare the numerators.

Assuming independence in the occurence of words, we can calculate the likelihood of the texts given it's from the positive class as (Same for negative & neutral classes):

$$
P(texts | positive) = \prod_{i=1}^{n} P(word_i | positive)
$$

We should take the log of the likelihoods to avoid underflow because of the multiplication of many small probabilities. The logs will still preserve the relative magnitude of the likelihoods.

$$
logP(texts | positive) = \sum_{i=1}^{n} logP(word_i | positive)
$$

Now for the likelihood of each word given the class, we calculate it as follows:

$$
P(word_i | positive) = \frac{count(text\:with\:word_i, positive)}{count(positive)}
$$

And, set $P(word_i | positive) = 1$ if $word_i$ does not appear in the training set. This assumes each word appears at least once in all of the positive or negative texts. Another way to handle this is to use Laplace smoothing.

In [1]:
# imports
import os
import re
import string
from typing import Iterable, Callable
from functools import partial

import numpy as np
import pandas as pd
import emoji
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nltk.download("stopwords")
nltk.download("punkt_tab")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samridhashrestha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/samridhashrestha/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/samridhashrestha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Download sentiment analysis dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("abhi8923shriv/sentiment-analysis-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/samridhashrestha/.cache/kagglehub/datasets/abhi8923shriv/sentiment-analysis-dataset/versions/9


In [3]:
train_path = path + "/train.csv"
test_path = path + "/test.csv"

assert os.path.exists(train_path)
assert os.path.exists(test_path)

In [4]:
train_df = pd.read_csv(train_path, encoding="latin1")
test_df = pd.read_csv(test_path, encoding="latin1")

print(train_df.shape, train_df.columns)
print(test_df.shape, test_df.columns)

(27481, 10) Index(['textID', 'text', 'selected_text', 'sentiment', 'Time of Tweet',
       'Age of User', 'Country', 'Population -2020', 'Land Area (Km²)',
       'Density (P/Km²)'],
      dtype='object')
(4815, 9) Index(['textID', 'text', 'sentiment', 'Time of Tweet', 'Age of User',
       'Country', 'Population -2020', 'Land Area (Km²)', 'Density (P/Km²)'],
      dtype='object')


In [5]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [6]:
train_df = train_df[["text", "sentiment"]]
test_df = test_df[["text", "sentiment"]]

print(train_df.shape, train_df.columns)
print(test_df.shape, test_df.columns)

(27481, 2) Index(['text', 'sentiment'], dtype='object')
(4815, 2) Index(['text', 'sentiment'], dtype='object')


In [7]:
train_df["sentiment"].value_counts(), test_df["sentiment"].value_counts()

(sentiment
 neutral     11118
 positive     8582
 negative     7781
 Name: count, dtype: int64,
 sentiment
 neutral     1430
 positive    1103
 negative    1001
 Name: count, dtype: int64)

## Preprocess Dataset

Remove all stopwords, punctuations, and convert all words to lowercase in the text column. Next, tokenize the text column.

Convert sentiment column labels positive, neutral, and negative reviews to 1, 0 and -1 respectively.

In [8]:
LABEL_TXT2NUM = {"positive": 1, "negative": -1, "neutral": 0}
LABEL_NUM2TXT = {v: k for k, v in LABEL_TXT2NUM.items()}


def preprocess_text(text_array):
    """ """
    lemmatizer = WordNetLemmatizer()
    # Make a set with the stopwords and punctuation
    stop = set(stopwords.words("english") + list(string.punctuation))

    if isinstance(text_array, str):
        text_array = np.asarray([text_array])

    X_preprocessed = []
    for i, text in enumerate(text_array):
        text = np.asarray(
            [
                lemmatizer.lemmatize(w.lower())
                for w in word_tokenize(text)
                if w.lower() not in stop
            ]
        ).astype(text_array.dtype)
        X_preprocessed.append(text)

    return X_preprocessed[0] if len(X_preprocessed) == 1 else X_preprocessed


def preprocess_labels(y: Iterable):
    """
    Maps sentiment labels ('positive', 'negative', 'neutral') to integers
    (1, -1, 0) and converts them to a numpy array.

    Args:
        y (Iterable[str]): An iterable of sentiment labels.

    Returns:
        np.ndarray: An array of integers corresponding to the sentiment labels.
    """
    try:
        y = [LABEL_TXT2NUM[label] for label in y]
    except KeyError as e:
        raise ValueError(
            f"Invalid label found: {e.args[0]}. Allowed labels are {list(remap.keys())}."
        )
    return np.asarray(y).astype(int)


def proprocess_df(df: pd.DataFrame):
    """
    Processes a DataFrame with 'text' and 'sentiment' columns by preprocessing
    the text data and mapping sentiment labels to integers.

    Args:
        df (pd.DataFrame): A pandas DataFrame with columns 'text' and 'sentiment'.

    Returns:
        tuple: Preprocessed text (list of lists) and sentiment labels (numpy array).
    """
    # Ensure required columns exist
    if not {"text", "sentiment"}.issubset(df.columns):
        raise ValueError("DataFrame must contain 'text' and 'sentiment' columns.")
    # Precompile regex patterns
    url_pattern = re.compile(r"http\S+|www\S+")
    number_pattern = re.compile(r"\d+")
    mention_pattern = re.compile(r"@\w+")
    hashtag_pattern = re.compile(r"#\w+")

    def clean_text(text):
        """
        Cleans a single text string by:
        - Removing URLs
        - Removing numbers
        - Converting emojis to text
        - Removing mentions and hashtags
        """
        text = url_pattern.sub("", text)  # Remove URLs
        text = number_pattern.sub("", text)  # Remove numbers
        text = emoji.demojize(text)  # Convert emojis to text
        text = mention_pattern.sub("", text)  # Remove @ mentions
        text = hashtag_pattern.sub("", text)  # Remove hashtags
        return text.strip()

    # Filter out non-string rows and clean text
    df = df[df["text"].apply(lambda x: isinstance(x, str))]
    df.loc[:, "text"] = df["text"].map(clean_text)

    X, y = df["text"], df["sentiment"]
    X, y = preprocess_text(X), preprocess_labels(y)
    return X, y

In [9]:
X_train, y_train = proprocess_df(train_df)
X_test, y_test = proprocess_df(test_df)

In [10]:
print(len(X_train), y_train.shape)
print(len(X_test), y_test.shape)

27480 (27480,)
3534 (3534,)


## Required calculations

$$
P(positive) = \frac{number\ of\ positive\ texts}{total\ number\ of\ texts}
\\
logP(texts | positive) = \sum_{i=1}^{n} logP(word_i | positive)
\\
P(word_i | positive) = \frac{\#\ texts\ containing\ word_i\ in\ positive\ class}{\#\ of\ texts\ in\ positive\ class}
$$

To eventually calculate:

$$
P(positive|text) = \frac{P(text|positive)P(positive)}{P(text)}
\\
where
\\
P(text|positive) = P(word_0|positive)P(word_1|positive)...P(word_n|positive)P(positive)
$$

In [18]:
def get_class_freq_dict(y: np.ndarray):
    return {k: y[y == v].shape[0] for k, v in LABEL_TXT2NUM.items()}


def get_word_freq_dict(X: np.ndarray, y: np.ndarray) -> dict:
    word_freq_dict = {}
    label2str_dict = {v: k for k, v in LABEL_TXT2NUM.items()}
    for i, (text, label) in enumerate(zip(X, y)):
        for word in text:
            if word not in word_freq_dict:
                word_freq_dict[word] = {"positive": 1, "neutral": 1, "negative": 1}
            word_freq_dict[word][label2str_dict[label]] += 1

    return word_freq_dict


def get_prob_word_given_class(
    word: str, cls: str, word_freq_dict: dict, class_freq_dict: dict
) -> float:
    """
    Calculate the conditional probability of a given word occurring in a specific class.
    P(word_i | class) = freq(word_i, class) / freq(class)

    Assume each word appears at least once in each class.
    """
    return word_freq_dict[word][cls] / class_freq_dict[cls]


def get_prob_text_given_class(
    text: list, cls: str, word_freq_dict: dict, class_freq_dict: dict
) -> float:
    """
    Calculate the conditional probability of a given text occurring in a specific class.
    P(text | class) = P(word_1 | class) * P(word_2 | class) * ... * P(word_n | class)
    """
    prob = 1
    for word in text:
        # only compute for words in the word_freq_dict
        if word in word_freq_dict:
            prob *= get_prob_word_given_class(
                word, cls, word_freq_dict, class_freq_dict
            )
    return prob


def get_log_prob_text_given_class(
    text: list, cls: str, word_freq_dict: dict, class_freq_dict: dict
) -> float:
    """
    Calculate the log of the conditional probability of a given text occurring in a specific class.
    log(P(text | class)) = log(P(word_1 | class)) + log(P(word_2 | class)) + ... + log(P(word_n | class))
    """
    prob = 0
    for word in text:
        # only compute for words in the word_freq_dict
        if word in word_freq_dict:
            prob += np.log(
                get_prob_word_given_class(word, cls, word_freq_dict, class_freq_dict)
            )
    return prob


def clsf_sentiment_naive_bayes(
    text: str, word_freq_dict: dict, class_freq_dict: dict, use_log: bool = False
) -> dict:
    """
    Classify the sentiment of a given text using the Naive Bayes algorithm.
    """
    pred = {}
    for cls in LABEL_TXT2NUM.keys():
        if use_log:
            log_p_text_given_cls = get_log_prob_text_given_class(
                text, cls, word_freq_dict, class_freq_dict
            )
            log_p_cls = np.log(class_freq_dict[cls] / sum(class_freq_dict.values()))
            log_p_text_given_cls_likelihood = log_p_text_given_cls + log_p_cls

            pred[cls] = log_p_text_given_cls_likelihood
        else:
            p_text_given_cls = get_prob_text_given_class(
                text, cls, word_freq_dict, class_freq_dict
            )
            p_cls = class_freq_dict[cls] / sum(class_freq_dict.values())
            p_text_given_cls_likelihood = p_text_given_cls * p_cls

            pred[cls] = p_text_given_cls_likelihood
    pred_max_cls = max(pred, key=pred.get)

    return {
        "pred_cls": LABEL_TXT2NUM[pred_max_cls],
        "pred_cls_label": pred_max_cls,
        "pred_cls_likelihood": pred[pred_max_cls],
    }

In [19]:
# get class freq
class_freq_dict = get_class_freq_dict(y_train)
# get freq per class for each word
word_freq_dict = get_word_freq_dict(X_train, y_train)

In [20]:
print(
    clsf_sentiment_naive_bayes(
        "That movie was garbage", word_freq_dict, class_freq_dict
    )
)
print(
    clsf_sentiment_naive_bayes(
        "That movie was garbage", word_freq_dict, class_freq_dict, use_log=True
    )
)

{'pred_cls': 0, 'pred_cls_label': 'neutral', 'pred_cls_likelihood': 1.2852100867013203e-39}
{'pred_cls': 0, 'pred_cls_label': 'neutral', 'pred_cls_likelihood': np.float64(-89.54989643018753)}


In [24]:
def calculate_metrics(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="weighted")
    recall = recall_score(y_test, y_pred, average="weighted")
    f1 = f1_score(y_test, y_pred, average="weighted")

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


def get_preds(X_test: np.ndarray, naive_bayes_classifier: Callable):
    y_pred = []
    for text in X_test:
        y_pred.append(naive_bayes_classifier(text)["pred_cls"])
    return y_pred

In [25]:
clsf_sentiment_naive_bayes_test = partial(
    clsf_sentiment_naive_bayes,
    word_freq_dict=word_freq_dict,
    class_freq_dict=class_freq_dict,
)
y_pred = get_preds(X_test, clsf_sentiment_naive_bayes_test)

print(calculate_metrics(y_test, y_pred))

{'accuracy': 0.6406338426711942, 'precision': 0.6545125861963154, 'recall': 0.6406338426711942, 'f1': 0.6266208360574069}


In [26]:
clsf_sentiment_naive_bayes_test = partial(
    clsf_sentiment_naive_bayes,
    word_freq_dict=word_freq_dict,
    class_freq_dict=class_freq_dict,
    use_log=True,
)
y_pred = get_preds(X_test, clsf_sentiment_naive_bayes_test)

print(calculate_metrics(y_test, y_pred))

{'accuracy': 0.6403508771929824, 'precision': 0.6542193305834327, 'recall': 0.6403508771929824, 'f1': 0.6262676456080541}
