# Naive Bayes for Text Classification
  
## Math behind it
Bayes rule:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$  
  
In multinomial Naive Bayes, the probability of a document d being in class c is computed as  
  
$P(c|d) \varpropto P(c) \prod_{1 \le k \le n_d}P(t_k|c) $  
  
where $P(t_k|c)$ is the conditional probability of term $t_k$ occurring in a document of class $c$  
  
In text classiﬁcation, our goal is to find the best class for the document. The best class in NB classification is the most likely or maximum a posteriori (MAP) class $c_{map}$ :
  
$c_{map} = argmax P(c|d) = argmax P(c) \prod_{1 \le k \le n_d}P(t_k|c)$
  
Since the production of probabilities may result in a ﬂoating point underflow, it would be better to use logarithm function to convert the production to sum.

Given the training data, we can estimate the prior probability of each class by a simple count:  
  
$P(c) = \frac{N_c}{N}$  
where $N_c$ is the number of documents in the class $c$, and $N$ is the total number of documents

The conditional probability $P(t | c)$ can be estimated by the relative frequency of term $t$ in documents belonging to class $c$:  
  
$P(t|c) = \frac{T_{ct}}{T_c}$  
where $T_{ct}$ is the number of occurrences of $t$ in training documents from class $c$ and $T_c$ is the total occurrences of words in training documents from class $c$

## Why Naive Bayes?
* Very fast learning and testing (basically just count words)
* Low storage requirements
* Very good in domains with many equally important features
* More robust to irrelevant features than many learning methods
* More robust to concept drift (changing class definition over time)
* A good dependable baseline for text classification 

## Implementation of Naive Bayes for text classification

For the rest part, I will use Python and Numpy to implement a simple Naive Bayes Classifier for text classification task

## Read the data

The data is in the format of **label word:count word:count ...**  
For example, "talk.politics.guns a:4 accidents:2 advance:1 age:2 an:1 and:3 any:1"  

To load the data, simply split every line by space, take the first item as label and split with colon for the rest items to get word and corresponding count

In [1]:
def parse_line(line):
    """
    Parse one line in the file
    return: set of words, class
    """
    l = line.strip().split()
    tag = l[0]
    word_set = set([w.split(':')[0] for w in l[1:]])
    return word_set, tag

def parse_file(filename):
    """
    Parse file
    return: X - word set list, y - class list
    """
    X, y = [], []
    with open(filename, 'r') as f:
        for line in f.readlines():
            word_set, target = parse_line(line)
            X.append(word_set)
            y.append(target)
    return X, y

## Build the Naive Bayes class
The input to the model is raw words list with its label, the dictionary building up is during the training process.  
  
The key part is calculating the probabilities, which is easy to implement with Numpy

In [2]:
import sys
import numpy as np
from scipy.special import exp10
from collections import Counter
from itertools import chain
import math

class MBNaiveBayes:
    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.idx_word = None
        self.idx_label = None
        self.word_idx = None
        self.label_idx = None
        self.p_label = None
        self.m_org = None
        self.m_train = None

    def __create_feature_class_matrix__(self, X, y):
        bow = set.union(*X)
        self.X_train = X
        self.y_train = y
        self.idx_word = {i: w for i, w in enumerate(sorted(bow))}
        self.word_idx = {v: k for k, v in self.idx_word.items()}
        self.idx_label = {i: l for i, l in enumerate(sorted(set(y)))}
        self.label_idx = {v: k for k, v in self.idx_label.items()}
        m = np.zeros([len(self.idx_label), len(self.idx_word)])
        for X_, y_ in zip(X, y):
            for w in X_:
                m[self.label_idx[y_]][self.word_idx[w]] += 1
        self.m_org = m

    def train(self, X, y, class_prior_delta=0, cond_prob_delta=0.1):
        """
        Training Naive Bayes
        X: training feature (list of word set)
        y: training label (list of class)
        class_prior_delta: smoothing parameter for prior class probability
        cond_prob_delta: smoothing parameter for conditional probability
        """
        self.__create_feature_class_matrix__(X, y)
        label_cnt = Counter(y)
        p_label = np.zeros([len(label_cnt.keys())])
        for k, v in label_cnt.items():
            p_label[self.label_idx[k]] = v
        p_label += class_prior_delta
        self.p_label = p_label / p_label.sum()
        m = self.m_org + cond_prob_delta
        m = m / m.sum(axis=1, keepdims=True)
        self.m_train = m

    def __predict__(self, X):
        prob = np.array([np.log10(self.p_label) for _ in range(len(X))])
        m_pre = np.zeros([len(X), len(self.idx_word)])
        for idx, X_ in enumerate(X):
            for w in X_:
                pos = self.word_idx.get(w)
                if pos:
                    m_pre[idx][pos] = 1
        pre = m_pre.dot(np.log10(self.m_train).T)
        return pre

    def predict(self, X):
        pre = self.__predict__(X)
        return [self.idx_label[i] for i in pre.argmax(axis=1)]

    def evaluate(self, X, y):
        pre = self.__predict__(X).argmax(axis=1)
        true = np.array([self.label_idx.get(i, len(self.label_idx)) for i in y])
        return np.sum(pre==true) / len(pre)

## Train the NB model

It's time to try the model and get some evaluation

In [3]:
training_data = 'data/train.txt'
test_data = 'data/test.txt'

X_train, y_train = parse_file(training_data)
X_test, y_test = parse_file(test_data)

In [4]:
nb = MBNaiveBayes()
nb.train(X_train, y_train)
nb.evaluate(X_test, y_test)

0.91

The performance is pretty good with such a simple model