Apply Bayes Rule to mutli-class document classfication <br> (from scratch)
-----

![](images/bayes_rule.png)

Always Add  Smoothing to Naive Bayes
------

![](images/laplace.png)
Laplace Smoothing in this case

------
Naive Bayes Classification Steps
-------

1. Get labeled data
1. Preprocess
1. Apply Mulitnomial Naive Bayes
    1. Calculate document class priors
    1. Calculate conditional probabilities of each word for each class
    1. Calculate the proportional probabilities for each class of new document
    1. Pick the winning class
1. Evaluate with metrics

Get data & preprocess
-----

In [2]:
reset -fs

In [3]:
import sys
import subprocess

try:
    from dataclasses import dataclass
except ModuleNotFoundError:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'dataclasses'])
    from dataclasses import dataclass

In [4]:
@dataclass
class Data:
    id_num: int
    label: str
    tokens: list

In [5]:
train = [Data(id_num=42, label='cat', tokens="🐯 😺 🐩 😺".split()),
         Data(43, 'dog',  "🐶 🐶 🐈 🐩 🐈 🐶 🐶".split()),
         Data(44, 'fish', "🐟 🐠 🐠".split()),
         Data(45, 'cat',  "🐶 🐈 🐈 🐈".split()),
         Data(46, 'fish', "🐟 🐬 🐬 🐠 🐠".split()),
         Data(47, 'fish', "🐡 🐡 🐠 🐶".split()),
         Data(48, 'dog',  "🐶 🐶 🐈 🐩 🐶 🐶".split()),
        ]

Calculate document class priors
---- 

$$P(c) = \frac{N_c}{N}$$

In [6]:
# What labels are we dealing with?
labels = {d.label for d in train}
labels

{'cat', 'dog', 'fish'}

In [21]:
# How many documents are dealing with?
n_docs = len(train)
n_docs

7

In [22]:
doc_priors = {label:sum(1 for d in train if d.label == label)/n_docs
                 for label in labels}
doc_priors

{'fish': 0.42857142857142855,
 'dog': 0.2857142857142857,
 'cat': 0.2857142857142857}

Calculate conditional probabilities of each word for each class
-----

$$P(w|c) = \frac{count(w,c)+1}{count(c)+|V|}$$

with Laplace smoothing

In [9]:
from collections import defaultdict
from itertools import chain

In [10]:
flatten = chain.from_iterable

In [26]:
# Get all tokens, aka the vocabulary
vocab = set(flatten(d.tokens for d in train))
print("Vocab:", vocab)

# Cardinality of vocabulary
v = len(vocab)
print("Cardinality of vocab:", v)

Vocab: {'😺', '🐩', '🐡', '🐟', '🐈', '🐬', '🐶', '🐯', '🐠'}
Cardinality of vocab: 9


In [29]:
cond_prob = defaultdict(lambda: defaultdict(float))

for label in labels:
    # For a given label, get a list of all the tokens for all the docs 
    label_tokens = list(chain.from_iterable(d.tokens for d in train if d.label == label))
    for token in vocab:
        # Find conditional probability: token count / total count
        cond_prob[label][token] = (label_tokens.count(token)+1) / (len(label_tokens) + v)
cond_prob

defaultdict(<function __main__.<lambda>()>,
            {'fish': defaultdict(float,
                         {'😺': 0.047619047619047616,
                          '🐩': 0.047619047619047616,
                          '🐡': 0.14285714285714285,
                          '🐟': 0.14285714285714285,
                          '🐈': 0.047619047619047616,
                          '🐬': 0.14285714285714285,
                          '🐶': 0.09523809523809523,
                          '🐯': 0.047619047619047616,
                          '🐠': 0.2857142857142857}),
             'dog': defaultdict(float,
                         {'😺': 0.045454545454545456,
                          '🐩': 0.13636363636363635,
                          '🐡': 0.045454545454545456,
                          '🐟': 0.045454545454545456,
                          '🐈': 0.18181818181818182,
                          '🐬': 0.045454545454545456,
                          '🐶': 0.4090909090909091,
                          '🐯': 0.0454

In [30]:
# Test that each label is a pmf
for label in labels:
    assert round(sum(cond_prob[label].values())) == 1

Given a new document without a label,  calculate the proportional probabilities for each class
-------

$$ P(c | X) = P(c) •  \prod_{i=1}^n P(x_i | c)$$

In [32]:
import operator
from functools import reduce

def  product(iterable):
    return reduce(operator.mul, iterable, 1)

In [36]:
# test = Data(id_num=90, label=None, tokens="😺".split())
# test = Data(id_num=91, label=None, tokens="🐶 🐶".split()) 
# test = Data(id_num=92, label=None, tokens="🐶 😺".split())
# test = Data(id_num=93, label=None, tokens="🐈 🐈 🐶 🐶 🐡 🐬 🐡 🐬 🐡 🐬".split())
# test = Data(id_num=94, label=None, tokens="🐬".split())
test = Data(id_num=95, label=None, tokens="🐹".split()) # Out of sample prediction
# test = Data(id_num=95, label=None, tokens="🐹 🐶".split()) # Out of sample prediction

prob_predicted = defaultdict(float)
for label in labels:
    prob_predicted[label] = doc_priors[label] * product(cond_prob[label][t] for t in test.tokens)
    
print(*dict(prob_predicted).items(), sep='\n')

('fish', 0.0)
('dog', 0.0)
('cat', 0.0)


# Pick the winning class

In [37]:
from operator import itemgetter

In [47]:
# Naive
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
print(label)

fish


In [52]:
# Handle ties
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
print(*(k for k, v in prob_predicted.items() if v == prob))

fish dog cat


In [54]:
# Handle ties and fall back to document priors if winning probability is zero
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
if prob > 0:
    print(*(k for k, v in prob_predicted.items() if v == prob))
else:
    label, prob = max(doc_priors.items(),
                      key=itemgetter(1))
    print(label)

fish


Summary
------

- Naive Bayes is simple and powerful algorithm for text classification
- Always add smoothing to Naive Bayes
- Use the Standard Library to create elegant and performant code
- Dataclasses are coming soon; Get ready

----
Resources
-----

![](images/bayes_slide.png)

In [19]:
# # The data from the slide
# train = [Data(id_num=1, label='c', tokens="C B C".split()),
#          Data(2, 'c', "C C S".split()),
#          Data(3, 'c', "C M".split()),
#          Data(4, 'j', "T J C".split()),
#         ]

# test = Data(5, label=None, tokens="C C C T J".split())

The Zoo
------

🐶
🐕
🐩
🐯
🐈
😺
🐟
🐠
🐡
🐬
🐹