# Naive Bayes text classifier

In this exercise, you'll implement a Naive Bayes classifier for text from scratch.

In [None]:
pip install ipytest

Collecting ipytest
  Downloading ipytest-0.13.3-py3-none-any.whl (14 kB)
Collecting jedi>=0.16 (from ipython->ipytest)
  Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipytest
Successfully installed ipytest-0.13.3 jedi-0.19.0


In [None]:
import ipytest
from typing import List

ipytest.autoconfig()

### Training the model

  - Calculate $P(y)$ for each class label in the training data
  - Calculate $P(x_i|y)$ for each feature (term) for each class label in the training data using Laplace (add-one) smoothing
  
$$P(x_i|y)=\frac{c_{i,y} + 1}{c_y + m}$$

where
  - $c_{i,y}$ is the number of times term $x_i$ appears in class $y$
  - $c_y$ is the total number of term occurrences in class $y$
  - $m$ is the size of the vocabulary


### Applying the model

Return the class $y \in Y$ that maximizes $P(y) \prod_{x_i} P(x_i|y)$.

Note that we need to consider $x_i$ at each *word position* in the document. Thus, we need to multiply with $P(x_i|y)$ as many times as $x_i$ appears in the document.
We can rewrite it as: $$P(y|x) \propto P(y) \prod_{i \in d} P(x_i|y)^{c_{i,d}}$$ where $c_{i,d}$ is the number of times term $i$ appears in document $d$.

Finally, we perform the computations in the log domain, that is, $$\log P(y) +  \sum_{i=1}^n (c_{i,d} \log P(x_i|y))$$

## 1) Probability estimations

The estimation of probabilities $P(x_i|y)$ and $P(y)$ are refactored to a separate class to make them testable.

In [None]:
class NBProbabilityEstimator:

    def get_prior_prob(self, y: int, training_labels: List[int]) -> float:
        """Computes the class prior probability, P(y).

        Args:
            y: Class ID.
            training_labels: Class labels in training data.

        Returns:
            The probability P(y).
        """
        return training_labels.count(y) / len(training_labels)

    def get_term_prob(self,
                      count_term_in_class: int,
                      count_all_terms_in_class: int,
                      num_terms: int) -> float:
        """Computes the smoothed term probability for a given class, P(x_i|y).

        Args:
          count_term_in_class: Number of times the term appears in the given class.
          count_all_terms_in_class: Total number of term occurrences in class.
          num_terms: Size of the vocabulary.

        Returns:
          The probability P(x_i|y).
        """
        return (count_term_in_class + 1) / (count_all_terms_in_class + num_terms)

### Tests

In [None]:
%%run_pytest[clean]

def test_prior_prob():
    nbpe = NBProbabilityEstimator()
    assert nbpe.get_prior_prob(1, [0, 1, 2, 3]) == 0.25
    assert nbpe.get_prior_prob(1, [1, 1, 2, 3]) == 0.5

def test_term_prob():
    nbpe = NBProbabilityEstimator()
    assert nbpe.get_term_prob(5, 20, 10) == 0.2
    assert nbpe.get_term_prob(74, 90, 10) == 0.75
    assert nbpe.get_term_prob(0, 6, 10) == 0.0625

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m
[32m[32m[1m2 passed[0m[32m in 0.02s[0m[0m


%%run_pytest[clean] and %%run_pytest are deprecated in favor of %%ipytest. %%ipytest will clean tests, evaluate the cell and then run pytest. To disable cleaning, configure ipytest with ipytest.config(clean=False).
ipytest.clean_tests is deprecated in favor of ipytest.clean


## 2) Naive Bayes classifier

Implement training and prediction for a Naive Bayes classifier.  We are operating with dense matrices for simplicity.

In [None]:
import numpy as np
import math

class NBClassifier:

    def __init__(self) -> None:
        self._nbprob = NBProbabilityEstimator()
        self._num_classes = 0
        self._prior_prob = None  # Holds P(y) values
        self._term_prob = None  # Holds P(x_i|y) values


    def fit(self, X_train: List[List[int]], y_train: List[int]) -> None:
        """Fits the model.

        Args:
          X_train: Document-term matrix for training data.
              Rows correspond to documents and columns correspond to terms.
          y_train: Class labels corresponding to training documents.
        """
        self._num_classes = len(np.unique(y_train))
        num_docs = len(X_train)
        num_terms = len(X_train[0])
        self._term_prob = np.zeros((num_terms, self._num_classes))

        # Calculate c_y values, i.e., the total number of term occurrences in each class.
        count_all_terms_in_class = [0] * self._num_classes
        for d in range(num_docs):
          count_all_terms_in_class[y_train[d]] += sum(X_train[d])

        # Iterating through the vocabulary
        for i in range(num_terms):
            # Holds c_{i,y} values, i.e., the number of times term i appears with class y.
            count_term_in_class = [0] * self._num_classes
            for d in range(num_docs):
                count_term_in_class[y_train[d]] += X_train[d][i]

            # Calculate P(x_i|y)
            for y in range(self._num_classes):
                self._term_prob[i, y] = self._nbprob.get_term_prob(
                    count_term_in_class[y],
                    count_all_terms_in_class[y],
                    num_terms)

        # Pre-compute class prior probabilities
        self._prior_prob = []
        for y in range(self._num_classes):
            self._prior_prob.append(self._nbprob.get_prior_prob(y, y_train))


    def _predict_instance(self, x: List[int]) -> int:
        """Predict class for a single instance (document).

        Args:
          x: Document term vector.

        Returns:
          The predicted class label (0-indexed).
        """
        probs = []

        for y in range(self._num_classes):
            p = math.log(self._prior_prob[y])
            for i in range(len(x)):
                if x[i] > 0:
                    p += x[i] * math.log(self._term_prob[i][y])
            probs.append(p)

        # Get the class with the highest probability.
        return probs.index(max(probs))


    def predict(self, X_test: List[List[int]]) -> List[float]:
        """Make predictions for a set of documents.

        Args:
          X_test: Document-term matrix for test data.

        Returns:
          List with predictions.
        """
        return [self._predict_instance(x) for x in X_test]

## 3) Testing on real data

We will be using a subset of the 20Newsgroups collection.

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "soc.religion.christian",
    "talk.religion.misc",
    "comp.sys.ibm.pc.hardware",
    "comp.sys.mac.hardware"
]

train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=123)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=123)

### Feature extraction

Get term frequencies using `CountVectorizer`. (We ignore terms that appear in less than 10 documents to speed up computation.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(min_df=10)  # Note: Removing min_df will yield a slower model with better accuracy
X_train_counts = count_vect.fit_transform(train.data)
X_test_counts = count_vect.transform(test.data)

### Train and apply model

Note that we convert sparse matrices to dense ones. This is not efficient and should be avoided when working with large datasets. Nevertheless, this simplifies the implementation for this exercise.

In [None]:
nb = NBClassifier()
nb.fit(X_train_counts.toarray(), train.target.tolist())
predicted = nb.predict(X_test_counts.toarray())

### Evaluation

In [None]:
from sklearn import metrics

print(f"{metrics.accuracy_score(test.target, np.asarray(predicted)):.3f}")

0.838


## Optional exercises

If you're done, try to implement it without making a conversion to dense matrices.

Also, do we really need to precompute and store all term probabilities?