# Multinomial Naive Bayes

## Overview
- [1. Multinomial Naive Bayes](#1)
- [2. Multinomial Naive Bayes Model from Scratch](#2)
- [3. Multinomial Naive Bayes Model in `Sklearn`](#3)
- [4. Multinomial Naive Bayes for `Out of Vocabulary`](#4)
- [5. References](#5)

<a name='1' ></a>
## 1. Multinomial Naive Bayes
**Multinomial Naive Bayes** implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). 

Assume that we have pairs of data points $(w^{(i)}, y_i)$, where
- $\mathbf{w}^{(i)}$ is an observation (a text contain words)
- $y_i$ is the label of $w^{(i)}$ 

for $i=1,2,...n$

Construct a vocabulary from all texts $\mathbf{w}^{(i)}$, we will get $V = \{w_1, w_2, ..., w_d\}$, $d$ is the number of distinct words. From $V$, $w^{(i)}$ present a $(1 \times d)$ vector 

$$\mathbf{w}^{(i)} = \begin{bmatrix} N_{i1} & N_{i2} & \dots & N_{id} \end{bmatrix}$$

where $N_{ij}$ is the frequency of word $j$ (or $w_j$) appears in the text $\mathbf{w}^{(i)}$, $j=1,2,...,d$

The probability that $\mathbf{w}^{(i)}$ belongs to class $y=c$ is

\begin{split}
\begin{eqnarray}P(y=c|\mathbf{w}^{(i)}) & = & \frac{P(\mathbf{w}^{(i)}| y=c) P(y=c)}{P(\mathbf{w}^{(i)})} \\
& \propto & \underbrace{P(y=c)}_{\text{prior}} \underbrace{\prod_{j=1}^{d} P(w_j|y=c)^{N_{yj}}}_{\text{likelihood}}
\end{eqnarray}
\end{split}

Take the logarithm

\begin{align*}
P(y=c|\mathbf{w}^{(i)}) &\propto \log \left(P(y=c) \prod_{j=1}^{d} P(w_j|y=c)^{N_{yj}} \right) \\
&= \log \left( P(y=c) \right) + \sum_{j=1}^{d} N_{yj} \log \left( P(w_j|y=c) \right)
\end{align*}

We assign $\theta_{yj}$ is the probability $P(w_j \mid y=c)$ of feature $j$ (word $w_j$) appearing in a sample belonging to class $y=c$. From that, the distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{yd})$  for each class $y=c$.

The parameters $\theta_{y}$  is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

$$\hat{\theta}_{yj} = \frac{N_{yj}}{N_y}$$

where $N_{yj} = \sum_{\mathbf{w} \in T} w_j = \sum_{i=1}^n N_{ij}$ is the number of times feature $j$  appears in a sample of class $y=c$  in the training set $T$, and $N_{y} = \sum_{j=1}^{d} N_{yj}$ is the total count of all features for class $y=c$.

However, if there is a word $w$ which doesn't appear in the training set with $y=c$, then $\hat{\theta}_{yj} = 0$ and $\log (0)$ is unknown. To avoid, we maybe use

$$\hat{\theta}_{yj} = \frac{ N_{yj} + \alpha}{N_j + \alpha d}$$

The smoothing priors $\alpha \geq 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha=1$ is called Laplace smoothing, while $\alpha < 1$  is called Lidstone smoothing.

<a name='2' ></a>
## 2. Multinomial Naive Bayes Model from Scratch

### Import package

In [1]:
# Import package
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import datasets

### Dataset

In [2]:
# Dataset
X, y = datasets.fetch_20newsgroups(return_X_y=True)

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X)
X_counts.shape

(11314, 130107)

### Build model

In [3]:
class MNaiveBayes():
    """Build Multinomial Naive Bayes from scratch."""
    
    def __init__(self, alpha=1):
        self.alpha = alpha
        self.log_prior = None
        self.log_prob = None     # log likelihood
        self.classes = None     # unique label
        self.class_count = None     # count each unique label
        self.n_features = None    # number of distinct words (V)
        self.n_samples = None    # number of observations (text)
        
    def _likelihood(self, freqs):
        """Calculate likelihood of probabilites P(w_i | y)."""
        lmbda = (freqs + self.alpha) / (np.sum(freqs) + self.n_features * self.alpha)
        return lmbda
    
    def fit(self, X, y):
        """
        Traing model to find log_prior and log_prob (or log likelihood).
        
        Args:
            X: (n_samples, n_features) array, each sample is a text
            y: (n_samples, ) labels correspond to samples
        
        """
        self.classes, self.class_count = np.unique(y, return_counts=True)
        self.n_samples, self.n_features = X.shape
        
        # Calculate log prior for P(y=c)
        self.log_prior = np.log(self.class_count / self.n_samples)
        
        # Calculate log likelihood for each class in y
        self.log_prob = []
        for label in self.classes:
            indices = np.where(y == label)[0]
            freqs = np.sum(X[indices, :], axis=0)
            log_prob_label = np.log(self._likelihood(freqs))
            self.log_prob.append(log_prob_label)
            
        self.log_prob = np.array(self.log_prob)

In [4]:
mnb = MNaiveBayes()
mnb.fit(X_counts, y)

In [5]:
mnb.log_prior

array([-3.16001007, -2.96389519, -2.95198016, -2.95367364, -2.97422231,
       -2.94860178, -2.96218433, -2.94691686, -2.94020542, -2.94187906,
       -2.93686652, -2.94523477, -2.95198016, -2.94691686, -2.94860178,
       -2.93853458, -3.0311772 , -2.99874192, -3.19175877, -3.40155099])

In [6]:
mnb.log_prob

array([[[-11.20532976,  -9.03627606, -12.59162412, ..., -12.59162412,
         -12.59162412, -12.59162412]],

       [[ -8.92151479,  -9.58649109, -12.47686285, ..., -12.47686285,
         -12.47686285, -12.47686285]],

       [[ -9.40038692, -10.52885217, -12.92674744, ..., -12.92674744,
         -12.92674744, -12.92674744]],

       ...,

       [[ -9.38895408,  -7.58259581, -12.22216743, ..., -12.91531461,
         -12.91531461, -12.91531461]],

       [[ -9.35181599,  -8.45643194, -12.71911182, ..., -12.71911182,
         -12.71911182, -12.71911182]],

       [[-10.86655118,  -9.58561733, -12.47598909, ..., -12.47598909,
         -12.47598909, -12.47598909]]])

<a name='3' ></a>
## 3. Multinomial Naive Bayes Model in `sklearn`

In [7]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_counts, y)

In [8]:
clf.class_log_prior_

array([-3.16001007, -2.96389519, -2.95198016, -2.95367364, -2.97422231,
       -2.94860178, -2.96218433, -2.94691686, -2.94020542, -2.94187906,
       -2.93686652, -2.94523477, -2.95198016, -2.94691686, -2.94860178,
       -2.93853458, -3.0311772 , -2.99874192, -3.19175877, -3.40155099])

In [9]:
clf.feature_log_prob_

array([[-11.20532976,  -9.03627606, -12.59162412, ..., -12.59162412,
        -12.59162412, -12.59162412],
       [ -8.92151479,  -9.58649109, -12.47686285, ..., -12.47686285,
        -12.47686285, -12.47686285],
       [ -9.40038692, -10.52885217, -12.92674744, ..., -12.92674744,
        -12.92674744, -12.92674744],
       ...,
       [ -9.38895408,  -7.58259581, -12.22216743, ..., -12.91531461,
        -12.91531461, -12.91531461],
       [ -9.35181599,  -8.45643194, -12.71911182, ..., -12.71911182,
        -12.71911182, -12.71911182],
       [-10.86655118,  -9.58561733, -12.47598909, ..., -12.47598909,
        -12.47598909, -12.47598909]])

**Nice! It's easy to see that our model and sklearn's model share the common result! Are you wondering that we haven't written `predict` for our model? The answer is in the below part.**

<a name='4' ></a>
## 4. Multinomial Naive Bayes for `Out of Vocabulary`
Go into detail, you read `predict funtion` of `Multinomial Naive Bayes` in [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). It demands a vector `(1, n_features)`, where `n_features` is the number of appeared distinct words. However, if you have a new dataset for testing, have words not appearing in the training dataset (They is called `out of vocabulary`). How to get the presented vector to predict?

So we need to custom our model to face the problem.

In [1]:
class MNaiveBayesV2():
    """Build Multinomial Naive Bayes for OOV."""
    
    def __init__(self, alpha=1):
        self.alpha = alpha
        self.log_priors = None
        self.log_likelihoods = None
        self.labels_count = None
        self.vocab = None
        self.V = None     # length of vocabulary (d)
        self.freqs = None
        self.label_list = None
        
    def _process(self, sentence):
        """Split a sentence into a list."""
        return sentence.split(' ')
    
    def _get_freqs(self, data, label):
        """Count frequency of each word in a label."""
        freqs = {}
        for sentence, label in zip(data, label):
            for word in self._process(sentence):
                # define key of dict
                pair = (word, label)
                # If the key exists in the dictionary, increment to 1
                if pair in freqs:
                    freqs[pair] += 1        
                # else, if the key new, set equal to 1
                else:
                    freqs[pair] = 1
        return freqs
    
    def _labels_count(self, label):
        """Calculate the number of words depending on label y=c."""
        Ny = 0
        for pair, freq in self.freqs.items():
            if pair[1] == label:
                Ny += freq
        return Ny
    
    def _likelihood(self, Nyj, Ny):
        """
        Laplace smoothing formula to compute probability of word w_i belong to class c
        """
        lmbda = (Nyj + self.alpha) / (Ny + self.V * self.alpha)
        return lmbda 

    def _lookup(self, word, label):
        """
        Args:
            freqs: a dictionary with the frequency of each pair (or tuple)
            word: the word to look up
            label: the label corresponding to the word
        Return:
            the number of times the word with its corresponding label appearing.
        """
        return self.freqs.get((word, label), 0)
    
    def fit(self, X, y):
        """
        Training model to find parameters.
        
        Args:
            X: a list or array contains all traing texts
            y: label for each text in X
        """
        # calculate log prior P(y=c)
        n_samples = len(y)
        self.label_list, freqs = np.unique(y, return_counts=True)
        self.log_priors = {label: np.log(freq / n_samples) 
                           for label, freq in zip(self.label_list, freqs)}

        # get labels_count
        self.freqs = self._get_freqs(X, y)
        self.labels_count = {label: self._labels_count(label) for label in self.label_list}

        # calculate likelihood
        self.log_likelihoods = {}
        self.vocab = set([pair[0] for pair in self.freqs.keys()])
        self.V = len(self.vocab)
        for label in self.label_list:
            # total words of label c
            Ny = self.labels_count[label]
            for word in self.vocab:
                # total word w_i in label c
                Nyj = self._lookup(word, label)
                # likelihood P(wi|y=c)^Nyj
                theta = np.log(self._likelihood(Nyj, Ny))
                self.log_likelihoods[(word, label)] = Nyj * theta
                
                
    def predict(self, sentence):
        """Predict the label for input text."""
        word_list = self._process(sentence)
        word_set = set(word_list)
        score_list = []
        
        # choose label with the maximum probability
        for label in self.label_list:
            prob = self.log_priors[label]
            for word in word_set:
                if word in self.vocab:
                    log_likelihood = self.log_likelihoods[(word, label)]
                    prob += log_likelihood

            score_list.append(prob)
        return np.argmax(score_list)

In [62]:
# Initialize dataset and train model
sentence1 = 'Pho is one of the traditional food on the Hanois street'
sentence2 = 'taste of roasted duck creates a traditional food'
sentence3 = 'football is the king sport sport sport sport'
sentence4 = 'world cup is the biggest football football football festival on the planet'

X_train = [sentence1, sentence2, sentence3, sentence4]
y_train = [0, 0, 1, 1]

# Initialize model
mnbv2 = MNaiveBayesV2()
mnbv2.fit(X_train, y_train)

# # Test model
X_test = ['street food, a traditional culture of Vietnamese people', 
          'Messi has been acquired by a football team sport in France']

for sentence in X_test:
    y_pred = mnbv2.predict(sentence)
    print(f'Sentence: {sentence} - Label: {y_pred}')

Sentence: street food, a traditional culture of Vietnamese people - Label: 1
Sentence: Messi has been acquired by a football team sport in France - Label: 0


<a name='5' ></a>
## 5. References

- [https://en.wikipedia.org/wiki/Naive_Bayes_classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case)
- [https://machinelearningcoban.com/2017/08/08/nbc/](https://machinelearningcoban.com/2017/08/08/nbc/)
- [https://scikit-learn.org/stable/modules/naive_bayes.html#](https://scikit-learn.org/stable/modules/naive_bayes.html#)
- [https://www.python-engineer.com/courses/mlfromscratch/05_naivebayes/](https://www.python-engineer.com/courses/mlfromscratch/05_naivebayes/)
- [https://phamdinhkhanh.github.io/deepai-book/ch_ml/NaiveBayes.html](https://phamdinhkhanh.github.io/deepai-book/ch_ml/NaiveBayes.html)