The following image illustrates the effect of applying the function `train_test_split` on the data matrix $X$ and response vector $y$. From these, two data matrices and two response vectors are created. 

![alt text](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1543836883/image_6_cfpjpr.png)

Please notice that, each time `train_test_split` is called, it randomly selects which examples to put in the training and test datasets.

# Naive Bayes Classifier

Naive Bayes Classifier is a probabilistic classification model. The model generated by the Naive Bayes algorithm is a set of *conditional probabilities*.

## Estimating probabilities from data

Let us see an example of how probability estimates can be computed from data provided as a dataset. For this, consider the [Play Tennis dataset](https://www.kaggle.com/fredericobreno/play-tennis), which is another toy dataset with four predictors (`outlook`, `temp`, `humidity`, and `wind`) and fourteen examples. The target (`play`) is binary. Each example provides data about the weather condition in a particular day. Therefore, the classification task is to predict whether a given day is appropriate to play tennis or not.

In [None]:
import pandas as pd
df_play_tennis = pd.read_csv('play_tennis.csv')
df_play_tennis

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


# Bayes' theorem

The term "bayesian" comes from [Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes), a British Presbyterian minister who lived in the 18th century and who formulated the famous [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem).

$$
\Pr(y \mid x_1, x_2, \dots, x_n) = \frac{\Pr (y) \Pr (x_1, x_2, \dots x_n \mid y)} {\Pr(x_1, x_2, \dots, x_n)}
$$

The following is a description of each term in the above expression, in the context of a [statistical classification](https://en.wikipedia.org/wiki/Statistical_classification) task:

* $\Pr(y \mid x_1, \dots, x_n)$ represents the probability of the class $y$, given the values ​​of the attributes of the example $\mathbf{x}$. This term, called **posterior probability**, is what must be determined (learned) by the algorithm.

* $\Pr(x_1, \dots x_n \mid y)$ represents the probability that a specific combination of values ​​$x_1, \dots, x_n$ will occur in examples associated with a specific value of the target attribute $y$. This term is called **likelihood**.

* $\Pr(y)$ represents the probability that an example selected at random belongs to a given class (i.e., belongs to a given value of the target attribute $y$). This term is called **prior probability**

* $\Pr (x_1, x_2, \dots, x_n) $ represents the probability that a given combination of values ​​$ x_1, x_2, \dots, x_n$ will occur in an example selected at random.

These probabilities are actually estimated from the training dataset by the Naive Bayes algorithm. These estimates are computed by counting the occurrences of values in a given feature, either separately  or in conjunction with values of other features.

# Prior probabilities

Let us see some examples of probability estimates that can be computed from the above dataset. First, let us compute the estimates for the prior probabilites.

$$
\Pr(\operatorname{play} = \text{Yes}) \approx \frac{5}{14} \approx 36\%
$$

$$
\Pr(\operatorname{play} = \text{No}) \approx \frac{9}{14} \approx 64\%
$$

The way to interpret these prior probabilities is the following: if you do not know anything about the weather conditions in a given day, then there is approximately 64% chance that this day is appropriate to play tennis.

# Conditional probabilities

See a nice explanation about conditional probabilities [here](https://setosa.io/conditional/).

Recall the estimate $\Pr(\operatorname{play} = \text{No}  \mid \operatorname{outlook} = \text{Sunny}) \approx 60\%$. This estimate tells us that, if you are in a sunny day, then the chance is $60$% that this is not a good day to play tennis. Now, compare this value with the estimate for $\Pr(\operatorname{play} = \text{'No'}) \approx 36\%$. We can conclude that, knowing that we are in a sunny day changes our bets that this day is appropriate to play tennis. In other words, it seems to exist a **dependence** between variables `play` and `outlook`. 

In general, two events $A$ and $B$ are said to be independent if and only if both identities below are true:

1. $\Pr(A \mid B) = \Pr(A)$
2. $\Pr(B \mid A) = \Pr(B)$

We can also easily compute estimates for the conditional probabilities from the data. Somes examples:
- $\Pr(\operatorname{outlook} = \text{Sunny}  \mid \operatorname{play} = \text{No}) \approx \frac{3}{5}.$
- $\Pr(\operatorname{outlook} = \text{Sunny} \text{ and } \operatorname{temp} = \text{Hot} \mid \operatorname{play} = \text{No}) \approx \frac{2}{5} = 40\%$
- $\Pr(\operatorname{play} = \text{No}  \mid \operatorname{outlook} = \text{Sunny}) \approx \frac{3}{5} = 60\%$

As an exercise, compute estimates for the following conditional probatilities (likelihoods):

- $\Pr(\text{outlook} = \text{Sunny} \mid \text{play} = \text{Yes})$
- $\Pr(\text{outlook} = \text{Sunny} \mid \text{play} = \text{No})$ 

- $\Pr(\text{temp} = \text{Hot} \mid \text{play} = \text{Yes})$ 
- $\Pr(\text{temp} = \text{Hot} \mid \text{play} = \text{No})$ 

- $\Pr(\text{humidity} = \text{High} \mid \text{play} = \text{Yes})$
- $\Pr(\text{humidity} = \text{High} \mid \text{play} = \text{No})$ 

- $\Pr(\text{wind} = \text{Weak} \mid \text{play} = \text{Yes})$ 
- $\Pr(\text{wind} = \text{Weak} \mid \text{play} = \text{No})$

## Naive Bayes Classifier - derivation of the algorithm

Naive Bayes Classifier is an algorithm consisting of two steps, which are described below. Formally, let $X$ be a dataset. Also consider that $c_1, c_2, \ldots, c_k$ are the classes of the problem (i.e., the possible values ​​of the target) and that $\mathbf{x} = [x_1, x_2, ..., x_n]$ is a new example that should be classified. Let $a_1, a_2, ..., a_n$ be the values for the predictive features $x_1, x_2, ..., x_n$, respectively. 

NBC allows us to compute the probability that a new example $\mathbf{x}$ belongs to each of the classes (i.e. each of the possible values for the target). More concretely, consider that $\mathbf{x}$ is the following:

$$
\mathbf{x} = [\operatorname{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak}]
$$

Hence, we are provided information about the weather conditions of a given day (represented by $\mathbf{x}$), and we want to answer (*predict*) whether or not this is a good day to play tennis. 

There are only two possibilities, Yes (play=Yes) or No (play=No). Therefore, let us define two probability values:

$$
\begin{align*}
\Pr(\text{play = Yes} \mid \mathbf{x}) & \text{: probability that } \mathbf{x} \text{ is a good day to play.}\\
\Pr(\text{play = No} \mid \mathbf{x}) & \text{: probability that } \mathbf{x} \text{ is NOT a good day to play.}
\end{align*}
$$

Notice that, if we know the probabilites values above, the problem is solved. That is because we can use these values to make our decision: if $\Pr(\text{play = Yes} > \Pr(\text{play = No})$, then we predict that $\mathbf{x}$ is a good day to play tennis, that is, we predict play = Yes. Otherwise, we predict play = No.

But, how can we compute those probability values? The answer lies in the Bayes Rule. To compute $\Pr(\text{play = Yes} \mid \mathbf{x})$, we use Bayes Rules and write:

$$
\begin{align*}
\Pr(\text{play = Yes} \mid \mathbf{x}) & = \frac{\Pr(\mathbf{x} \mid \text{play = Yes}) \times \Pr(\text{play} = \text{Yes})}{\Pr(\mathbf{x})} = \\
&=  \frac{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) \times \Pr(\text{play = Yes})}{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})}
\end{align*}
$$

To compute $\Pr(\text{play = No} \mid \mathbf{x})$, we write a similar expression:

$$
\begin{align*}
\Pr(\text{play = No} \mid \mathbf{x}) & = \frac{\Pr(\mathbf{x} \mid \text{play = No}) \times \Pr(\text{play} = \text{No})}{\Pr(\mathbf{x})} = \\
&=  \frac{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No}) \times \Pr(\text{play = No})}{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})}
\end{align*}
$$

By looking at the two expressions above, it seems there are several probability values we need to compute using the provided dataset. Let us list each one of them:

1. $\Pr(\text{play = Yes})$
2. $\Pr(\text{play = No})$
3. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No})$
4. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes})$
5. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})$

Items 1 and 2 are easy to compute, and we arlready know how to estimate them. Another good news is that we don't actually need to compute item 5, since this expression appears as denominator of both $\Pr(\text{play = yes} \mid \mathbf{x})$ and $\Pr(\text{play = No} \mid \mathbf{x})$. We are left with items 3 and 4. Lets us apply the definition of [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability) to one of these expressions (item 4):

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = Yes}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{outlook} = \text{Sunny}, \text{play = Yes}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \text{play = Yes}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \text{play = Yes})
\end{align*}
$$

By looking at the above expression, it seems that we have to compute a lot of probability estimates from the available data. That is when the naive assumption of Naive Bayes Classifier comes in handy. This algorithm assumes that one attribute is conditionally independent from each other, once we know the value of the class. 

> *Conditional independence*. The term [conditional independence](https://www.probabilitycourse.com/chapter1/1_4_4_conditional_independence.php) corresponds to a somewhat advanced concept in Probability Theory. Given three variables A, B, and C. We say that variables A and B are conditionally independet given the variable C if and only if knowing the value of C makes A and B independent of each other.
$$
\Pr(A \mid B, C) = \Pr(A \mid C)
$$

The term *naive* stems from the fact that Naive Bayes considers that the attributes are conditionally independent given the class. When considering this hypothesis, the computation of the conditional probabilities can be simplified. Mathematically, we have:
$$
\Pr(x_1, x_2, \dots x_n \mid y) = \Pr(x_1 \mid y) \times \Pr(x_2 \mid y) \times \ldots \times \Pr(x_n \mid y)
$$

In many practical cases, this statistical independence between predictors does not exist. For example, consider a dataset with information about customers of a company. Also consider that each customer is represented by the following features: *weight*, *education*, *salary*, *age*, etc. In this dataset, the values ​​of the first three feature are correlated with values ​​of the age. In this case, at least in theory, the use of Naive Bayes would overestimate the effect of the age feature. However, practice shows that Naive Bayes is quite effective even in cases where the predictive features are not statistically independent.

Anyway, assuming the naive hypothesis is true, we can simplify the Bayes formula:

$$
\Pr(y \mid x_1, x_2, \dots, x_n) \propto \Pr(y) \times \Pr(x_1 \mid y) \times \Pr(x_2 \mid y) \times \ldots \Pr(x_n \mid y)
$$


Naive Bayes Classifier uses this assumption of conditional independence to simplify the computation of the probability estimates tha should be produced. By applying this assumption to the estimates above, we end up with the following:

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = Yes}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{play = Yes}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{play = Yes}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{play = Yes})
\end{align*}
$$

We can write a similar expression for $\text{play = No}$:

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = No}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{play = No}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{play = No}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{play = No})
\end{align*}
$$

$$
\Pr(c_i \mid \mathbf{x}), \, 1 \leq i \leq n
$$

## Inference (prediction)

Steps:

1. Calculate the posterior probabilities $\Pr(c_j \mid \mathbf{x})$, $j = 1,2, \ldots, k $
2. Classify $\mathbf{x}$ as being of class $c$ such that $\Pr(c \mid \mathbf{x})$ is maximum.


Therefore, to compute the probability that an example $\mathbf{x}$ belongs to a given class, we just need the estimates for $\Pr(y)$ and for $\Pr(x_i \mid y)$, which you already know how to compute (see the above examples for the Play Tennis dataset). Together, these probability values represent the model generated by the Naive Bayes algorithm. 


## Numerical example

Let us present a numerical example of applying Naive Bayes classifier to the PlayTennis dataset. For this, consider the following question: 

> Is it appropriate or not to play tennis on a sunny, hot, high humidity and light wind day?

This question is equivalent to classifying an example $\mathbf{x}$ corresponding to $[\operatorname{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak}]$. To answer this question, we can apply the Naive Bayes classifier. 

For the **prior propabilities**, we find that:

- $\Pr(\operatorname{play} = \text{Yes}) \approx 9/14$ 
- $\Pr(\operatorname{play} = \text{No}) \approx 5/14$

Similarly, estimates for **conditional probabilities** $\Pr(x_i \mid c_j)$ are calculated:

- $\Pr(\operatorname{outlook} = \text{Sunny} \mid \operatorname{play} = \text{Yes}) \approx 5/9$
- $\Pr(\operatorname{outlook} = \text{Sunny} \mid \operatorname{play} = \text{No}) \approx 2/5$

- $\Pr(\operatorname{temp} = \text{Hot} \mid \operatorname{play} = \text{Yes}) \approx 2/9$
- $\Pr(\operatorname{temp} = \text{Hot} \mid \operatorname{play} = \text{No}) \approx 2/5$

- $\Pr(\operatorname{humidity} = \text{High} \mid \operatorname{play} = \text{Yes}) \approx 3/9$
- $\Pr(\operatorname{humidity} = \text{High} \mid \operatorname{play} = \text{No}) \approx 4/5$

- $\Pr(\operatorname{wind} = \text{Weak} \mid \operatorname{play} = \text{Yes}) \approx 6/9$
- $\Pr(\operatorname{wind} = \text{Weak} \mid \operatorname{play} = \text{No}) \approx 2/5$

The posterior probabilities $\Pr(\operatorname{play} = \text{Yes} \mid \mathbf{x})$ and $\Pr(\operatorname{play} = \text{No} \mid \mathbf{x})$ can now be (proportionally)  computed. In the following, we ommit the feature names, for simplicity's sake.

For $\Pr(\operatorname{play} = \text{Yes} \mid \mathbf{x})$:

\begin{align*}
\Pr(\operatorname{play} = \text{Yes} \mid \mathbf{x}) & \propto \Pr(\text{Sunny} \mid \text{Yes}) \times \Pr(\text{Warm} \mid \text{Yes}) \times \Pr(\text{High} \mid \text{Yes}) \times \Pr(\text{Wind} \mid \text{Yes}) \times \Pr(\text{Yes}) = \\
&= 0.0071 \times 9/14 = \\
&= 0.004564286.
\end{align*}

For $\Pr(\operatorname{play} = \text{No} \mid \mathbf{x})$:

\begin{align*}
\Pr(\operatorname{play} = \text{No} \mid \mathbf{x}) & \propto \Pr(\text{Sunny} \mid \text{No}) \times \Pr(\text{Warm} \mid \text{No}) \times \Pr(\text{High} \mid \text{No}) \times \Pr(\text{Wind} \mid \text{No}) \times \Pr(\text{No}) = \\
&= 0.0274 \times 5/14 = \\
&= 0.009785714.
\end{align*}

Now, since $\Pr(\operatorname{play} = \text{Yes} \mid \mathbf{x}) + \Pr(\operatorname{play} = \text{No} \mid \mathbf{x})$ = 1$, the numbers aborve can be mapped back to probabilities:

\begin{align*}
\Pr(\operatorname{play} &= \text{Yes} \mid \mathbf{x}) = \frac{0.004564286}{0.004564286+0.009785714} \approx 32\%\\
\\
\Pr(\operatorname{play} &= \text{No} \mid \mathbf{x}) = \frac{0.009785714}{0.004564286+0.009785714} \approx 68\%\end{align*}


Since the highest value corresponds to $\operatorname{play} = \text{No}$, then the class Naive Bayes predicts for $\mathbf{x}$ is $\text{No}$.

In [None]:
print(0.004564286/(0.004564286+0.009785714))
print(0.009785714/(0.004564286+0.009785714))

0.31806871080139376
0.6819312891986063
