In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

# Naive Baye's Classifier

Naive Baye's classifier is based on Baye's theorem and is used for classification problems. For instance in NLP text classification: topic modeling, sentiment analysis, spam detection etc

## Bayes' theorem:

Partition of set $X$ on subsets $\{A_i \subset X|I \in I\}$ is $X = \bigcup_{i \in I}A_i$ and $A_i \cap A_j = \emptyset$ for every pair $i,j \in I$

For each subset $A \subset X$ we have partition $X = A \cup A^c$

#### Total probability theorem:
For every partition $A_1, A_2 \dots, A_k$ of $\Omega$ and event $B \subset \Omega$:
$$P(B) = \sum_{i=1}^{k}P(B|A_i)P(A_i)$$

#### Theorem (Bayes' theorem):
Let $A_1, A_2 \dots, A_k$ be a partition of $\Omega$ such that $P(A_i) > 0$ for each $i \in \{1, 2, \dots, k\}$. Then for $B \subset \Omega$ event, such that $P(B) > 0$, for each $i \in \{1, 2, \dots, k\}$:
$$
P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{k}P(B|A_j)P(A_j)}
$$

#### Note:
We call $P(A_i)$ the prior probability and $P(A_i|B)$ the posterior probability

For the events $A$ and $B$ such that $P(B) \gt 0$ we have:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
<br>
We can consider the partition of $\Omega$ on $A$ and $A^c$, the from Byes' theorem we have:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)} = \text{(by the total probability) }\frac{P(B|A)P(A)}{P(B)}
$$

#### Example:
Divide emails $A_1 = \text{"spam"}$, $A_2 = \text{"low priority"}$ and $A_3 = \text{"high priority"}$ and let: $P(A_1) = 0.7$, $P(A_2) = 0.2$ and $P(A_3) = 0.1$. ($P(A_1) + P(A_2) + P(A_3) = 0.7 + 0.2 + 0.1 = 1$)
<br>
Let $B$ be the event that email contains the word "free" and we know from previous experience that: $P(B|A_1) = 0.9$, $P(B|A_2) = 0.01$ and $P(B|A_3) = 0.01$.
<br>
If we receive the email with word "free" in it, what is the probability, that this email is spam?
From Bayes' theorem:
$$
P(A_1|B) = \frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + P(B|A_3)P(A_3)} = \frac{0.9 \cdot 0.7}{0.9 \cdot 0.7 + 0.01 \cdot 0.2 + 0.01 \cdot .01} = 0.995
$$

## Multi-dimensional case

#### Example Golf and Weather

In [121]:
import pandas as pd
import numpy as np
from pathlib import Path

In [67]:
path = Path('data')
nb = path / 'naive_bayes'
golf_csv = nb / 'golf.csv'

In [174]:
def strip_txt(txt:str) -> str:
    return txt.replace("'", '').strip() if txt else txt

In [175]:
df = pd.read_csv(golf_csv, converters={'outlook':strip_txt,
                                       'temp': strip_txt,
                                       'humidity': strip_txt,
                                       'wind': strip_txt,
                                       'label': strip_txt})
df

Unnamed: 0,outlook,temp,humidity,wind,label
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


Let's calculate each Label probabilities
<br>
$P(\text{"Yes"}) = 9/14$ and $P(\text{"No"}) = 5/14$

Let't conditional probability of outlook 'Sunny' feature with respect of labels
<br>
$P(\text{"Yes"}|\text{"Sunny"}) = 2/9$ and $P(\text{"No"}|\text{"Sunny"}) = 3/5$


Let't conditional probability of outlook 'Overcast' feature with respect of labels
<br>
$𝑃("Yes"|"Overcast")=3/9$  and  $𝑃("No"|"Overcast")=0/5$

In [176]:
df.outlook

0        Sunny
1        Sunny
2     Overcast
3         Rain
4         Rain
5         Rain
6     Overcast
7        Sunny
8        Sunny
9         Rain
10       Sunny
11    Overcast
12    Overcast
13        Rain
Name: outlook, dtype: object

In [192]:
y_vals = df[df.label.str.contains('Yes')].count()[0]
n_vals = df[df.label.str.contains('No')].count()[0]
f_vals = df.count()[0]
y_vals, n_vals, f_vals

(9, 5, 14)

In [178]:
df.outlook.unique()

array(['Sunny', 'Overcast', 'Rain'], dtype=object)

In [179]:
sunny_y = df[df.outlook.str.contains('Sunny') & df.label.str.contains('Yes')].count()[0] 
sunny_n = df[df.outlook.str.contains('Sunny') & df.label.str.contains('No')].count()[0] 
overcast_y = df[df.outlook.str.contains('Overcast') & df.label.str.contains('Yes')].count()[0] 
overcast_n = df[df.outlook.str.contains('Overcast') & df.label.str.contains('No')].count()[0] 
rain_y = df[df.outlook.str.contains('Rain') & df.label.str.contains('Yes')].count()[0]
rain_n = df[df.outlook.str.contains('Rain') & df.label.str.contains('No')].count()[0]
print(f'sunny_y = {sunny_y}/{y_vals}, sunny_n = {sunny_n}/{n_vals}')
print(f'overcast_y = {overcast_y}/{y_vals}, overcast_n = {overcast_n}/{n_vals}')
print(f'rain_y = {rain_y}/{y_vals}, rain_y = {rain_y}/{n_vals}')

sunny_y = 2/9, sunny_n = 3/5
overcast_y = 4/9, overcast_n = 0/5
rain_y = 3/9, rain_y = 3/5


In [180]:
def count_feat(col_name:str, col_val:str) -> int:
    return df[df[col_name].str.contains(col_val)].count()[0]

def count_cond(col_name:str, col_val:str, lab:str) -> int:
    return df[df[col_name].str.contains(col_val) & df.label.str.contains(lab)].count()[0]

def count_probs(col_name:str) -> int:
    col_vals = df[col_name].unique()
    for col_val in col_vals:
        val_y = count_cond(col_name, col_val, 'Yes')
        val_n = count_cond(col_name, col_val, 'No')
        val_f = count_feat(col_name, col_val)
        yield val_y, val_n, val_f, col_val
    

In [181]:
col_vals = df.temp.unique()
temp_vals = [(ys, ns, fs, vls) for (ys, ns, fs, vls) in count_probs('temp')]
temp_vals

[(2, 2, 4, 'Hot'), (4, 2, 6, 'Mild'), (3, 1, 4, 'Cool')]

In [182]:
col_vals = [(col_name, [(ys, ns, fs, vls) for (ys, ns, fs, vls) in count_probs(col_name)]) 
            for col_name in df.columns]
col_vals

[('outlook', [(2, 3, 5, 'Sunny'), (4, 0, 4, 'Overcast'), (3, 2, 5, 'Rain')]),
 ('temp', [(2, 2, 4, 'Hot'), (4, 2, 6, 'Mild'), (3, 1, 4, 'Cool')]),
 ('humidity', [(3, 4, 7, 'High'), (6, 1, 7, 'Normal')]),
 ('wind', [(6, 2, 8, 'Weak'), (3, 3, 6, 'Strong')]),
 ('label', [(0, 5, 5, 'No'), (9, 0, 9, 'Yes')])]

In [183]:
lns = ''
for col_val in col_vals:
    ln = f'{col_val[0]}: \n' + '\n'.join(f'P({nm}) = {f_v}, P({nm}|Yes) = {y_v}/{y_vals}, P({nm}|No) = {n_v}/{n_vals}'
                               for y_v, n_v, f_v, nm in col_val[1]) + '\n'
    lns += ln
    lns += '===============\n'
print(lns)

outlook: 
P(Sunny) = 5, P(Sunny|Yes) = 2/9, P(Sunny|No) = 3/5
P(Overcast) = 4, P(Overcast|Yes) = 4/9, P(Overcast|No) = 0/5
P(Rain) = 5, P(Rain|Yes) = 3/9, P(Rain|No) = 2/5
temp: 
P(Hot) = 4, P(Hot|Yes) = 2/9, P(Hot|No) = 2/5
P(Mild) = 6, P(Mild|Yes) = 4/9, P(Mild|No) = 2/5
P(Cool) = 4, P(Cool|Yes) = 3/9, P(Cool|No) = 1/5
humidity: 
P(High) = 7, P(High|Yes) = 3/9, P(High|No) = 4/5
P(Normal) = 7, P(Normal|Yes) = 6/9, P(Normal|No) = 1/5
wind: 
P(Weak) = 8, P(Weak|Yes) = 6/9, P(Weak|No) = 2/5
P(Strong) = 6, P(Strong|Yes) = 3/9, P(Strong|No) = 3/5
label: 
P(No) = 5, P(No|Yes) = 0/9, P(No|No) = 5/5
P(Yes) = 9, P(Yes|Yes) = 9/9, P(Yes|No) = 0/5



In [185]:
model_vals = {col_name: {vls: (ys, ns, fs) for (ys, ns, fs, vls) in count_probs(col_name)} 
            for col_name in df.columns if col_name != 'label'}
model_vals

{'outlook': {'Sunny': (2, 3, 5), 'Overcast': (4, 0, 4), 'Rain': (3, 2, 5)},
 'temp': {'Hot': (2, 2, 4), 'Mild': (4, 2, 6), 'Cool': (3, 1, 4)},
 'humidity': {'High': (3, 4, 7), 'Normal': (6, 1, 7)},
 'wind': {'Weak': (6, 2, 8), 'Strong': (3, 3, 6)}}

Outlook: "Sunny",
Temperature: "Cool",
Humidity: "High",
Wind: "Strong"


In [196]:
out_v = model_vals['outlook']['Sunny']
tmp_v = model_vals['temp']['Cool']
hum_v = model_vals['humidity']['High']
wnd_v = model_vals['wind']['Strong']
yes_raw = (out_v[0] / y_vals) * (tmp_v[0] / y_vals) * (hum_v[0] / y_vals) * (wnd_v[0] / y_vals) * (y_vals / f_vals) 
no_raw = (out_v[1] / n_vals) * (tmp_v[1] / n_vals) * (hum_v[1] / n_vals) * (wnd_v[1] / n_vals) * (n_vals / f_vals)
yes_raw, no_raw

(0.005291005291005291, 0.02057142857142857)

In [198]:
p_x = (out_v[2] / f_vals) * (tmp_v[2] / f_vals) * (hum_v[2] / f_vals) * (wnd_v[2] / f_vals)
p_x

0.021865889212827987

In [203]:
yes_pred = yes_raw / p_x
no_pred = no_raw / p_x
print(f'yes_pred = {yes_pred}, no_pred = {no_pred}')

yes_pred = 0.2419753086419753, no_pred = 0.9408


In [205]:
def predict(x:tuple) -> tuple:
    out_v = model_vals['outlook'][x[0].capitalize()]
    tmp_v = model_vals['temp'][x[1].capitalize()]
    hum_v = model_vals['humidity'][x[2].capitalize()]
    wnd_v = model_vals['wind'][x[3].capitalize()]
    yes_raw = (out_v[0] / y_vals) * (tmp_v[0] / y_vals) * (hum_v[0] / y_vals) * (wnd_v[0] / y_vals) * (y_vals / f_vals) 
    no_raw = (out_v[1] / n_vals) * (tmp_v[1] / n_vals) * (hum_v[1] / n_vals) * (wnd_v[1] / n_vals) * (n_vals / f_vals)
    p_x = (out_v[2] / f_vals) * (tmp_v[2] / f_vals) * (hum_v[2] / f_vals) * (wnd_v[2] / f_vals)
    yes_pred = yes_raw / p_x
    no_pred = no_raw / p_x
    
    return yes_pred, no_pred

In [208]:
x_vec = ('Sunny', 'Cool', 'High', 'Strong')
y_pr, n_pr = predict(x_vec)
print(f'yes_pred = {y_pr}, no_pred = {n_pr}')

yes_pred = 0.2419753086419753, no_pred = 0.9408


Let's $A_1 = \text{"spam"}$ and $A_2 = \text{"not spam"}$ and let we have $100$ emails with $30$ spam and $70$ not spam so $P(A_1) = 0.3$ and $P(A_2) = 0.7$