# Multinomial Naive Bayes for classifying text to movie genres
### An assignment as part of the course INF367: Selected Topics in Artificial Intelligence at the University of Bergen

---


### Description of the problem
In the field of natural language processing, we often want to assign a category (label) to text documents depending on their content. The Naive Bayes classifiers are simple approaches to solving the text categorisation problem that often perform suprisingly well. They rely on a simple bag-of-words representation of the document, and make use of Bayes' rule: 

$\Large P(c|d) = \frac{P(d|c)P(c)}{P(d)}$

For classifying text documents, the formula can be distilled into taking the maximum of a likelihood and a prior:

<div>
<img src="../img/naive_bayes_argmax_formula.png" width="400"/>
</div>
<div align="center"><i><b>Fig. 1</b>: The Naive Bayes argmax formula for classification.<br></i></div>

<br>

The Multinomial Naive Bayes algorithm makes two important assumptions: 
1. The position of words in the document doesn't matter
2. The feature probabilities are independent.
These assumptions ensure that calculating the probabilities is simple and efficient, though the accuracy of a Naive Bayes classifier is generally lower than for more complicated algorithms. 

The rest of part 1 of this report will aim to provide and describe the implementation of the Multinomial Naive Bayes approach. 


### Description of the approach
I decided to use the algorithm described on page 62 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf), Third Edition, by Daniel Jurafsky and James H. Martin, as a guideline for my implementation of Multinomial Naive Bayes. The algorithm can be seen in its entirety below:


<div>
<img src="../img/naive_bayes_alg.png" width="600"/>
</div>

<div align="center"><i><b>Fig. 2</b>: The Multinomial Naive Bayes Algorithm. (Jurafsky & Martin, 2022)<br></i></div>

It is worthy to note that the algorithm calculates the sum of the logarithm of the probabilities, instead of simply multiplying them according to the formula. This is because multiplying many low probabilities can cause a floating-point underflow, so using the logs makes things more computationally efficient.

I implemented the algorithm to solve the problem of classifying which genre a movie belongs to, based on some text document describing the movie. The document can be any related text such as reviews, a synopsis or the movie's English subtitles. The aim was to be able to classify the simple document `D = fast, couple, shoot, fly` with the label `action`. The classifier was trained with a tiny dataset, consisting of five documents with 3-5 keywords and a label of either `comedy` or `action`. 



### Description of the software
There are two ways to run the software: <br>
1. **As part of this Jupyter Notebook.** <br>
Open this file in Jupyter Notebook or an IDE capable of running notebooks (e.g. PyCharm). Select "Kernel" from the top menu, then "Restart and Run All" to run the notebook in your local environment. 
Since there are no dependencies required to run the code, re-use through copy-pasting the code is trivial. Simply copy-paste the code over to your Python file and you should be able to run it. 
2. **By pip installation** <br>
A directory called `nb_sgd_classify` has been provided along with this notebook. To install the software, simply download the folder, navigate to it in the terminal, and run the command `python3 -m pip install .` (punctuation included. Indicates that pip should install the current directory.) This will install both the software package and required dependencies (numpy), unless these are already installed. Then you may use the SGD and Naive Bayes classes as you wish by importing them as usual. See the `test` directory for example usage. 
<br><br>

The classifier is implemented as a Python class with three methods: 
1. `set_feature_freq_per_class`: Used to extract the class of each word in the training data and count their frequency. 
2. `train`: Calculates the a priori probabilities for each class and the log likelihood of each feature (word). Stores these values as class attributes to use for testing new documents. 
3. `test`: Predicts the class of a single document. Calculates the sum of the log likelihoods for each class for each feature (word) in the test document, and returns the class with the highest sum of log likelihoods. 

In [1]:
import math
from collections import defaultdict


class MultinomialNaiveBayes:
    """
    A class for creating and testing a Multinomial Naive Bayes classifier
    with optional add-k smoothing.
    """

    def __init__(self):
        self.classes = []  # The classification labels
        self.class_feature_freq = defaultdict(dict)  # The number of times each feature (word) occurs for each class
        self.class_total_feature_freq = {}  # The total number of features per class
        self.total_doc_count = 0  # The total number of documents in the training data
        self.priorlogs = {}  # The a priori probabilities by class
        self.feature_ll = defaultdict(dict)  # The log likelihoods for each feature
        self.sum_ll = {}  # The test document's cumulative log likelihood ratios for each class

    def set_feature_freq_per_class(self, X_train: list, y_train: list, k: int = 1):
        """
        Extracts the class for each word in the training data, counts their frequency
        and sets the class_feature_freq attribute accordingly.
        :param X_train: The training data (documents). A list of lists.
        :param y_train: The training data classification labels.
        :param k: The add-k smoothing parameter. Default is 1.
        """

        # Loop over the training examples and update the frequency of each word for each class
        for doc, label in zip(X_train, y_train):
            for word in doc:
                if word in self.class_feature_freq[label].keys():
                    self.class_feature_freq[label][word] += 1
                else:
                    # If it is a new word for the current class, add it to the frequency dict with an initial value
                    self.class_feature_freq[label][word] = 1 + k

    def train(self, X_train: list, y_train: list, k: int = 1):
        """
        Fits the Naive Bayes model to the training data. 
        An implementation inspired by the Multinomial Naive Bayes algorithm presented in
        Speech and Language Processing (Jurafsky & Martin, 2022), page 62, Figure 4.2.
        :param X_train: The training data features. A list of lists containing the tokenized and lemmatized documents.
        :param y_train: The training data labels.
        :param k: The parameter for add-k smoothing (optional).
        """
        
        # Retrieve unique classes in the training data
        self.classes = list(set(y_train))
        print("Unique classes: ", self.classes)

        # Calculates the total number of documents
        self.total_doc_count = len(X_train)
        print(f"Total number of documents: ", self.total_doc_count)

        # Calculates the frequency of each word per class in the training data
        self.set_feature_freq_per_class(X_train, y_train, k)
        print("Training sample counts per class: ", self.class_feature_freq.items())

        for label in self.classes:
            # Calculates the total number of features per class
            self.class_total_feature_freq[label] = sum(self.class_feature_freq[label].values())
            print(f"{label} total feature counts:", self.class_total_feature_freq[label])
            # Calculates the log a priori probabilities by class
            self.priorlogs[label] = math.log(self.class_total_feature_freq[label] / self.total_doc_count)
            print(f"Log prior for {label}: ", self.priorlogs[label])

            # Calculates the log likelihood of each feature (word) for each class.
            for word, count in self.class_feature_freq[label].items():
                self.feature_ll[label][word] = math.log(count
                                                        / (self.class_total_feature_freq[label] - count))
        print(f"\nLog likelihood ratio of each feature for each class: \n", self.feature_ll)

    def test(self, tokenized_document: list):
        """
        Classifies a single text document.
        A continuation of the implementation inspired by the Multinomial Naive Bayes algorithm presented in
        Speech and Language Processing (Jurafsky & Martin, 2022), page 62, Figure 4.2.
        :param tokenized_document: Test document on the form of a single list of lemmatized tokens.
        :return: A string with the predicted classification
        """
        smoothing_values = {}
        # Retrieves the priors for the genres and saves them to a new variable
        for label in self.classes:
            self.sum_ll[label] = self.priorlogs[label]

            # Values for smoothing the probabilities when words in the test document aren't found in the current genre,
            # but are found in a different one
            smoothing_values[label] = math.log(1 / (self.class_total_feature_freq[label] + self.total_doc_count))

        # Loops over the feature classes and the words in the input document
        # and calculates the sums of the log likelihoods for each class
        for current_class in self.classes:
            for word in tokenized_document:
                # If the word has occurred in training data for current class
                if word in self.feature_ll[current_class]:
                    # Add its log likelihood to the sum of likelihoods for the current class
                    self.sum_ll[current_class] += self.feature_ll[current_class][word]
                # Else, check if the word has occurred in the training data for a different class
                else:
                    other_classes = [label for label in self.classes if label != current_class]
                    for other_class in other_classes:
                        # If it does, smooth the probability for the current class
                        if word in self.feature_ll[other_class]:
                            self.sum_ll[current_class] += smoothing_values[current_class]
        print(f"Sum of log likelihoods for D = {tokenized_document}: \n", self.sum_ll, "\n\n")

        # Determines the test document class by selecting the class with the maximum log likelihood
        return max(self.sum_ll, key=self.sum_ll.get)


### Testing the Multinomial Naive Bayes implementation

In [2]:
# Training data
X_train = [
    ["fun", "couple", "love", "love"],
    ["fast", "furious", "shoot"],
    ["couple", "fly", "fast", "fun", "fun"], 
    ["furious", "shoot", "shoot", "fun"],
    ["fly", "fast", "shoot", "love"]
]
y_train = ["comedy", "action", "comedy", "action", "action"]

# Test data (two different samples)
X_test_1 = ["fast", "couple", "shoot", "fly"]
X_test_2 = ["fast", "couple", "love", "furious"]

In [3]:
# Building the classifier
nb = MultinomialNaiveBayes()
nb.train(X_train, y_train, k=2)

Unique classes:  ['comedy', 'action']
Total number of documents:  5
Training sample counts per class:  dict_items([('comedy', {'fun': 5, 'couple': 4, 'love': 4, 'fly': 3, 'fast': 3}), ('action', {'fast': 4, 'furious': 4, 'shoot': 6, 'fun': 3, 'fly': 3, 'love': 3})])
comedy total feature counts: 19
Log prior for comedy:  1.33500106673234
action total feature counts: 23
Log prior for action:  1.5260563034950492

Log likelihood ratio of each feature for each class: 
 defaultdict(<class 'dict'>, {'comedy': {'fun': -1.0296194171811581, 'couple': -1.3217558399823195, 'love': -1.3217558399823195, 'fly': -1.6739764335716716, 'fast': -1.6739764335716716}, 'action': {'fast': -1.55814461804655, 'furious': -1.55814461804655, 'shoot': -1.041453874828161, 'fun': -1.8971199848858813, 'fly': -1.8971199848858813, 'love': -1.8971199848858813}})


In [4]:
# Running predictions on two different test samples
y_pred_1 = nb.test(X_test_1)
y_pred_2 = nb.test(X_test_2)
print(f"Prediction for D = {X_test_1} is {y_pred_1}")
print(f"Prediction for D = {X_test_2} is {y_pred_2}")

Sum of log likelihoods for D = ['fast', 'couple', 'shoot', 'fly']: 
 {'comedy': -6.512761470741268, 'action': -6.302866684440747} 


Sum of log likelihoods for D = ['fast', 'couple', 'love', 'furious']: 
 {'comedy': -6.160540877151917, 'action': -6.819557427659136} 


Prediction for D = ['fast', 'couple', 'shoot', 'fly'] is action
Prediction for D = ['fast', 'couple', 'love', 'furious'] is comedy


In [5]:
# Re-training on a multi-class dataset
# Training data
X_train = [
    ["fun", "couple", "love", "love"],
    ["fast", "furious", "shoot"],
    ["couple", "fly", "fast", "fun", "fun"], 
    ["furious", "shoot", "shoot", "fun"],
    ["fly", "fast", "shoot", "love"],
    ["slow", "couple", "emotional", "impactful", "love"],
    ["great", "emotional", "love", "story", "historic"],
    ["tense", "scary", "frightening", "fast", "shoot", "dark"],
    ["horrific", "tense", "dark", "kill", "clown", "sewer"]

]
y_train = ["comedy", "action", "comedy", "action", "action", "drama", "drama", "horror", "horror"]

# Building the classifier
nb = MultinomialNaiveBayes()
nb.train(X_train, y_train, k=1)

Unique classes:  ['comedy', 'horror', 'drama', 'action']
Total number of documents:  9
Training sample counts per class:  dict_items([('comedy', {'fun': 4, 'couple': 3, 'love': 3, 'fly': 2, 'fast': 2}), ('action', {'fast': 3, 'furious': 3, 'shoot': 5, 'fun': 2, 'fly': 2, 'love': 2}), ('drama', {'slow': 2, 'couple': 2, 'emotional': 3, 'impactful': 2, 'love': 3, 'great': 2, 'story': 2, 'historic': 2}), ('horror', {'tense': 3, 'scary': 2, 'frightening': 2, 'fast': 2, 'shoot': 2, 'dark': 3, 'horrific': 2, 'kill': 2, 'clown': 2, 'sewer': 2})])
comedy total feature counts: 14
Log prior for comedy:  0.44183275227903923
horror total feature counts: 22
Log prior for horror:  0.8938178760220965
drama total feature counts: 18
Log prior for drama:  0.6931471805599453
action total feature counts: 17
Log prior for action:  0.6359887667199967

Log likelihood ratio of each feature for each class: 
 defaultdict(<class 'dict'>, {'comedy': {'fun': -0.916290731874155, 'couple': -1.2992829841302609, 'love'

In [6]:
# Test data (three different samples)
X_test_1 = ["fast", "couple", "shoot", "fly"]
X_test_2 = ["fast", "couple", "love", "furious"]
X_test_3 = ["scary", "historic", "bad", "futuristic", "dark"]

# Running predictions on two different test samples
y_pred_1 = nb.test(X_test_1)
y_pred_2 = nb.test(X_test_2)
y_pred_3 = nb.test(X_test_3)

print(f"Prediction for D = {X_test_1} is {y_pred_1}")
print(f"Prediction for D = {X_test_2} is {y_pred_2}")
print(f"Prediction for D = {X_test_3} is {y_pred_3}")

Sum of log likelihoods for D = ['fast', 'couple', 'shoot', 'fly']: 
 {'comedy': -10.71195760216563, 'horror': -17.447301127906577, 'drama': -24.457152423150195, 'action': -10.311021108166281} 


Sum of log likelihoods for D = ['fast', 'couple', 'love', 'furious']: 
 {'comedy': -7.083986901138687, 'horror': -22.012690443882825, 'drama': -16.179079737571307, 'action': -10.975997411759531} 


Sum of log likelihoods for D = ['scary', 'historic', 'bad', 'futuristic', 'dark']: 
 {'comedy': -8.96464989550841, 'horror': -6.688581111955426, 'drama': -7.977968093128549, 'action': -9.13830084734445} 


Prediction for D = ['fast', 'couple', 'shoot', 'fly'] is action
Prediction for D = ['fast', 'couple', 'love', 'furious'] is comedy
Prediction for D = ['scary', 'historic', 'bad', 'futuristic', 'dark'] is horror
