Writing Naive Bayes From Scratch
-----

<center><img src="https://imgs.xkcd.com/comics/modified_bayes_theorem.png" width="75%"/></center>

![](images/bayes_rule.png)

By The End Of This Session You Should Be Able To:
----

1. Write idiomatic Python to model data and calculate probability
1. List the steps to fit Naive Bayes
1. Implitament Naive

------
Naive Bayes Classification Steps
-------

1. Get labeled data
1. Preprocess
1. Apply Mulitnomial Naive Bayes
    1. Calculate document class priors
    1. Calculate conditional probabilities of each word for each class
    1. Calculate the proportional probabilities for each class of new document
    1. Pick the winning class
1. Evaluate with metrics

Get data & preprocess
-----

In [57]:
reset -fs

```python
# Let's make a data class to hold our data
data = LabeledTextData(id_num=42, label='cat', tokens="🐱 🐱 🐈 🐶".split())
```

In [58]:
class LabeledTextData:
    def __init__(self, id_num, label, tokens):
        self.id_num = id_num
        self.label = label 
        self.tokens = tokens

In [59]:
data = LabeledTextData(id_num=42, label='cat', tokens="🐱 🐱 🐈 🐶".split())

__THERE MUST BE A BETTER WAY!__

In [60]:
from dataclasses import dataclass

Learn more about dataclasses [here](https://realpython.com/python-data-classes/)

In [61]:
@dataclass
class LabeledTextData:
    id_num: int
    label: str
    tokens: list

In [62]:
train = [LabeledTextData(42, 'cat',  "🐈 🐯 🐱 🐩 🐱".split()),
         LabeledTextData(43, 'dog',  "🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶".split()),
         LabeledTextData(45, 'cat',  "🐈 🐈 🐯 🐶 🐈".split()),
         LabeledTextData(45, 'cat',  "🐈 🐈 🐈".split()),
         LabeledTextData(48, 'dog',  "🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 ".split()),
        ]

Calculate document class priors
---- 

$$P(c) = \frac{N_c}{N}$$

In [63]:
# What labels are we dealing with?
labels = {d.label for d in train}
labels

{'cat', 'dog'}

In [64]:
# How many documents are dealing with?
n_docs = len(train)
n_docs

5

In [65]:
from collections import defaultdict

In [66]:
# For each label, find the probability of baseline occurance
doc_priors = defaultdict(float)

for label in labels:
    doc_priors[label] = sum(1 for d in train if d.label == label) / n_docs

print(*doc_priors.items(), sep='\n')

('cat', 0.6)
('dog', 0.4)


Calculate conditional probabilities of each word for each class
-----

<br>
<br> 
<br>

----