Apply Bayes Rule to mutli-class document classfication <br> (from scratch)
-----

<center><img src="http://www.saedsayad.com/images/Bayes_rule.png" width="700"/></center>

Always Add  Smoothing to Naive Bayes
------

![](images/laplace.png)
Laplace Smoothing in this case

------
Naive Bayes Classification Steps
-------

1. Get labeled data
1. Preprocess
1. Apply Mulitnomial Naive Bayes
    1. Calculate document class priors
    1. Calculate conditional probabilities of each word for each class
    1. Calculate the proportional probabilities for each class of new document
    1. Pick the winning class
1. Evaluate with metrics

Get data & preprocess
-----

In [51]:
reset -fs

In [52]:
import sys
import subprocess

try:
    from dataclasses import dataclass
except ModuleNotFoundError:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'dataclasses'])
    from dataclasses import dataclass

In [53]:
@dataclass
class Data:
    id_num: int
    label: str
    tokens: list

In [54]:
train = [Data(id_num=42, label='cat', tokens="🐯 😺 🐩 😺".split()),
         Data(43, 'dog',  "🐶 🐶 🐈 🐩 🐈 🐶 🐶".split()),
         Data(44, 'fish', "🐟 🐠 🐠".split()),
         Data(45, 'cat',  "🐶 🐈 🐈 🐈".split()),
         Data(46, 'fish', "🐟 🐬 🐬 🐠 🐠".split()),
         Data(47, 'fish', "🐡 🐡 🐠 🐶".split()),
         Data(48, 'dog',  "🐶 🐶 🐈 🐩 🐶 🐶".split()),
        ]

Calculate document class priors
---- 

$$P(c) = \frac{N_c}{N}$$

In [55]:
labels = {doc.label for doc in train}
labels

{'cat', 'dog', 'fish'}

In [56]:
n_docs = len(train)
n_docs

7

In [57]:
doc_priors = {label:sum(1 for data in train if data.label == label)/n_docs
                 for label in labels}
doc_priors

{'dog': 0.2857142857142857,
 'fish': 0.42857142857142855,
 'cat': 0.2857142857142857}

Calculate conditional probabilities of each word for each class
-----

$$P(w|c) = \frac{count(w,c)+1}{count(c)+|V|}$$

with Laplace smoothing

In [58]:
from collections import defaultdict
from itertools import chain

In [59]:
flatten = chain.from_iterable

In [60]:
# Cardinality of vocabulary
vocab = set(flatten(doc.tokens for doc in train))
print("Vocab:", vocab)

v = len(vocab)
print("Cardinality of vocab:", v)

Vocab: {'🐠', '😺', '🐬', '🐟', '🐶', '🐈', '🐯', '🐡', '🐩'}
Cardinality of vocab: 9


In [61]:
cond_prob = defaultdict(lambda: defaultdict(int))

for label in labels:
    # Get a single list of all the tokens for all the docs for a given label
    label_tokens = list(chain.from_iterable(doc.tokens for doc in train if doc.label == label))
    for token in vocab:
        # Find conditional probability: token count / total count
        cond_prob[label][token] = (label_tokens.count(token)+1) / (len(label_tokens) + v)
cond_prob

defaultdict(<function __main__.<lambda>()>,
            {'dog': defaultdict(int,
                         {'🐠': 0.045454545454545456,
                          '😺': 0.045454545454545456,
                          '🐬': 0.045454545454545456,
                          '🐟': 0.045454545454545456,
                          '🐶': 0.4090909090909091,
                          '🐈': 0.18181818181818182,
                          '🐯': 0.045454545454545456,
                          '🐡': 0.045454545454545456,
                          '🐩': 0.13636363636363635}),
             'fish': defaultdict(int,
                         {'🐠': 0.2857142857142857,
                          '😺': 0.047619047619047616,
                          '🐬': 0.14285714285714285,
                          '🐟': 0.14285714285714285,
                          '🐶': 0.09523809523809523,
                          '🐈': 0.047619047619047616,
                          '🐯': 0.047619047619047616,
                          '🐡': 0.1428571

In [62]:
# Test that each label is a pmf
for label in labels:
    assert round(sum(cond_prob[label].values())) == 1

Given a new document,  calculate the proportional probabilities for each class
-------

In [63]:
import operator
from functools import reduce

def  product(iterable):
    return reduce(operator.mul, iterable, 1)

In [70]:
test = Data(id_num=90, label=None, tokens="😺".split())
# test = Data(id_num=91, label=None, tokens="🐶 🐶".split()) 
# test = Data(id_num=92, label=None, tokens="🐶 😺".split())
# test = Data(id_num=93, label=None, tokens="🐈 🐈 🐶 🐶 🐡 🐬 🐡 🐬 🐡 🐬".split())
# test = Data(id_num=94, label=None, tokens="🐬".split())
# test = Data(id_num=95, label=None, tokens="🐹".split()) # Out of sample prediction
# test = Data(id_num=95, label=None, tokens="🐹 🐶".split()) # Out of sample prediction

prob_predicted = defaultdict(float)
for label in labels:
    prob_predicted[label] = doc_priors[label] * product(cond_prob[label][token] for token in test.tokens)
    
print(*dict(prob_predicted).items(), sep='\n')

('dog', 0.012987012987012986)
('fish', 0.02040816326530612)
('cat', 0.05042016806722689)


# Pick the winning class

In [65]:
from operator import itemgetter

In [66]:
# Naive
max(prob_predicted.items(),
    key=itemgetter(1))[0]

'dog'

In [67]:
# Handle ties
max_value = max(prob_predicted.items(),
                key=itemgetter(1))[1]
[k for k, v in prob_predicted.items() if v == max_value]

['dog', 'fish', 'cat']

Summary
------

- Naive Bayes is simple and powerful algorithm for text classification
- Always add smoothing to Naive Bayes
- Use the Standard Library to create elegant and performant code
- Dataclasses are coming soon; Get ready

----
Resources
-----

![](images/bayes_slide.png)

In [68]:
# # The data from the slide
# train = [Data(id_num=1, label='c', tokens="C B C".split()),
#          Data(2, 'c', "C C S".split()),
#          Data(3, 'c', "C M".split()),
#          Data(4, 'j', "T J C".split()),
#         ]

# test = Data(5, label=None, tokens="C C C T J".split())

The Zoo
------

🐶
🐕
🐩
🐯
🐈
😺
🐟
🐠
🐡
🐬
🐹