## 2.75 Machine Learning - Naive Bayes

### Binary Categorical Features

If all features are categorical implementing Naive Bayes is nothing more than some simple probability calculations based of a few frequency tables.

Naive Bayes uses Bayes Rule to compute the probability of the class $C \in (1,2,..k)$ given the feature vector $X = (x_1,x_2...x_n)$:

$ p(C_k | X) = \frac{p(C_k) \times p(X | C_k)}{p(X)}$

where $p(C_k)$ is the proir probability of class k, $p(X | C_k)$ is the likelihood of the feature vector given the class, and finally $p(X)$ is the probability of the feature vector.

*  The prior $p(C_k)$ is computed using the training data.  
*  The likelihood $p(X | C_k)$ is also computed using the training data.
*  The denominator is not a function of the class - using the Naive Bayes classifier just requires the computation of numerator.  

Under the assumptions of the Naive Bayes classifier we assume that each feature is independent of the other features  given the class variable (known as the conditional independence assumption), and so:

$ p(X | C) = p(x_1|C) \times p(x_2|C) \times  … \times p(x_n|C)$

or

$ p(X | C) = \Pi_{i=1}^{n} p(x_i|C) $

This reduces the computations to estimation of each of the $P(x_i | C)$.  With this assumption we reduce the number of parameters to estimate to be $2n$, where n is number of features.  Without this assumption we would have $2(2^n-1)$ parameters to estimate. For categorical features this can be computed from simple cross tabulations.

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>In reality the conditional independence asumption does not hold - features are often correlated conditional on the class - but even with this violation of the assumption naive bayes classifiers work well in practice.  They are especially prevalent in spam detection and text mining.

We start by considering the simple case of binary categorical features.  We will show that the application of Bayes rule and the assumption of conditional independence gives the same classification as using the scikit learn Naive Bayes classifier.

## Data

We will use a classic golf weather dataset - it contains four weather features (outlook, temperature, humidity and wind) and has a target describing whether a golfer played golf or not.

We are interested in fitting a model to the training data and then using the model to predict whether the golfer will  play golf when the outlok is sunny, the temperature is low, the humidity is low and it is expected to be windy.  In other words we want to compute:

$ p(play=yes | outlook=sunny, temp=low, humidity=low, wind=True) = p(play=yes | weather)$

To compute this we need to compute both outcomes:

$ p(play=yes | outlook=sunny, temp=low, humidity=low, wind=True)= p(play=yes | weather)$
and
$ p(play=no | outlook=sunny, temp=low, humidity=low, wind=True)= p(play=no | weather)$




In [3]:
import pandas as pd

df = pd.read_csv('data/golf_binary.csv')
df.temp = pd.cut(df.temp,bins=[60,70,90],labels=['low','high'])
df.humidity = pd.cut(df.humidity, bins = [0,80,100], labels=['low','high'])
df

Unnamed: 0,outlook,temp,humidity,wind,play
0,sunny,high,high,False,no
1,sunny,high,high,True,no
2,rainy,low,high,False,yes
3,rainy,low,low,False,yes
4,rainy,low,low,True,no
5,sunny,high,high,False,no
6,sunny,low,low,False,yes
7,rainy,high,low,False,yes
8,sunny,high,low,True,yes
9,rainy,high,high,True,no


wind,False,True
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
rainy,0,2
sunny,2,1


In [19]:
0.04/(0.04+0.10)

0.2857142857142857

In [11]:
pd.crosstab(df.wind, df.play)

play,no,yes
wind,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2,4
True,3,1


## Simple Crosstabs

In [163]:
pd.crosstab(df.outlook,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
rainy,0.4,0.6
sunny,0.6,0.4


In [167]:
pd.crosstab(df.temp,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
temp,Unnamed: 1_level_1,Unnamed: 2_level_1
low,0.2,0.6
high,0.8,0.4


In [166]:
pd.crosstab(df.humidity,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
humidity,Unnamed: 1_level_1,Unnamed: 2_level_1
low,0.2,0.8
high,0.8,0.2


In [165]:
pd.crosstab(df.wind,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
wind,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.4,0.8
True,0.6,0.2


In [164]:
pd.DataFrame(df.play.value_counts()).apply(lambda r: r*1.0/r.sum(), axis=0)

Unnamed: 0,play
no,0.5
yes,0.5


$p(play=yes \mid weather)$ is proportional to the product of:

*  p(play=yes) = 0.5
*  p(outlook=sunny \mid play=yes) = 0.40
*  p(temp=low \mid play=yes) = 0.60
*  p(humidity=low \mid play=yes) = 0.80
*  p(wind=True \mid play=yes) = 0.20

$p(play=yes \mid weather) \propto 0.50 \times 0.40 \times 0.60 \times 0.80 \times 0.20 = 0.0192$

Similarly $p(play=no \mid weather)$ is proportional to the product of:

*  p(play=no) = 0.50
*  p(outlook=sunny \mid play=no) = 0.60
*  p(temp=low \mid play=no) = 0.20
*  p(humidity=low \mid play=no) = 0.20
*  p(wind=True \mid play=no) = 0.60

$p(play=no \mid weather) \propto 0.50 \times 0.60 \times 0.20 \times 0.20 \times 0.60 = 0.0072$

Note proportionality allows us to ignore the denominator in Bayes Rule - which is constant for both play=yes and play=no:

We can now normalise to get 

$p(play=yes \mid weather) = \frac{p(play=yes \mid weather)}{p(play=yes \mid weather) + p(play=no \mid weather)}$

hence,

$p(play=yes \mid weather) = \frac{0.0192}{0.0192+0.0072} = 0.7272$

In other words the probability that the person will play given the weather is sunny, low temp, low humidity and windy is 73%. Since the probability is >50% we predict that play=Yes.


In [32]:
yes = 0.5 * 0.4 * 0.6 * 0.8 * 0.2
no = 0.5 * 0.6 * 0.2 * 0.2 * 0.6

print("P(play=yes | weather ) = {0:6.2f}".format(yes / (yes + no)))
print("P(play=no  | weather ) = {0:6.2f}".format(no / (yes + no)))

P(play=yes | weather ) =   0.73
P(play=no  | weather ) =   0.27


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Now it is your turn!   What is the probability the golfer plays golf when the weather is (outlook=rainy, temp=low, humidity=high, wind=False)?

## Using scikit learn

We should be able to get the same result using th scikit-learn naive bayes classifier.  We need to use `BernoulliNB()` since this one handles categorical features in the manner described above.  

In [31]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder 
import numpy as np

dfe = df.apply(LabelEncoder().fit_transform)
x = dfe[['outlook','temp','humidity','wind']]
y = dfe['play']
clf = BernoulliNB(alpha=0.0).fit(x,y)
pred = clf.predict_proba(np.array([2,1,1,1]).reshape(1,-1))

print("P(play=yes | weather ) = {0:6.2f}".format(pred[0][1]))
print("P(play=no  | weather ) = {0:6.2f}".format(pred[0][0]))


P(play=yes | weather ) =   0.73
P(play=no  | weather ) =   0.27


This agrees with our previous calculation.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Validate your previous answer.  Use scikit learn to compute the probability that the golfer plays golf when the weather is (outlook=rainy, temp=low, humidity=high, wind=False)?

## Exact Calculation

When the number of features are small we do not need to invoke the conditional independence assumption.  

Here we compute $P(play=yes \mid outlook = sunny, wind = True)$ exactly and by invoking the conditional independence assumption.  We will ignore the other features at this point to simplify the computation - $n=2$.

### Exact

From Bayes rule, 

$P(play=yes \mid sunny, windy) \propto P(play=yes) \times P(outlook=sunny, wind=True \mid play=yes)$
$P(play=no  \mid sunny, windy) \propto P(play=no ) \times P(outlook=sunny, wind=True \mid play=no)$

We can compute 2-d frequency table conditional on `play` to compute $P(sunny, windy \mid play=yes)$ and $P(sunny, windy \mid play=no)$:

When play = 'no':

In [33]:
pd.crosstab(df[df.play=='no'].outlook, df.wind)

wind,False,True
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
rainy,0,2
sunny,2,1


When play = 'yes':

In [34]:
pd.crosstab(df[df.play=='yes'].outlook, df.wind)

wind,False,True
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
rainy,3,0
sunny,1,1


$P(play=yes \mid sunny, windy) \propto 0.5 \times 0.2 = 0.01$  
$P(play=no \mid sunny, windy) \propto 0.5 \times 0.2 = 0.01$


$P(play=yes \mid sunny, windy ) = \frac{0.01}{0.01 + 0.01}$ = 0.5

### Invoking Conditional Independence Assumption

$P(play=yes \mid sunny, windy) \propto P(play=yes) \times P(outlook=sunny \mid play=yes) \times P(wind=True \mid play=yes)$
$P(play=no  \mid sunny, windy) \propto P(play=no ) \times P(outlook=sunny \mid play=no) \times P(wind=True \mid play=no)$

Here are the required crosstabs:

In [40]:
pd.crosstab(df.outlook,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
rainy,0.4,0.6
sunny,0.6,0.4


In [41]:
pd.crosstab(df.wind,df.play).apply(lambda r: r*1.0/r.sum(), axis=0)

play,no,yes
wind,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.4,0.8
True,0.6,0.2


$P(play=yes \mid sunny, windy) \propto 0.5 \times 0.4 \times 0.2 = 0.04$
$P(play=no  \mid sunny, windy) \propto 0.5 \times 0.6 \times 0.6 = 0.18$

$P(play=yes \mid sunny, windy ) = \frac{0.04}{0.04 + 0.18}$ = 0.29

Because it tends to be windy and sunny when the glofer does not play the conditional independence assumption is violated causing error in the joint probability estimate.  In practice it is generally not feasible to compute the exact joint probability conditional on the class, and even if it was we end up with a large number of parameters and many situations have no training examples so p(X|C)=0 - we have no choice but to invoke the conditional independence assumption.  

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>Overfitting alert!  The conditional independence assumption reduces the variance of the model - so whilst it seems counter intuitive based on the example above it is actually a good thing to do especially as $n$ grows!