# Naive Bayes

The naive Bayes approach is based on the premise that the probability of prior events can be a good estimate of the probability of future events. For example, when forecasting the probability
of rain for today, we would report on the proportion of prior days with the same weather conditions as today, in which it rained. So, if it rained 4 out of 10 of those days, then we estimate a 40 percent chance of rain today. This approach is useful in several domains and
problem areas.


The naive Bayes method is named after 18th century clergy and mathematician Thomas
Bayes who developed mathematical principles for describing the probability of events
and how those probabilities are to be revised in light of additional information. Those
foundational mathematical principles are known today as Bayesian methods. Applied to
machine learning, an event is the expected outcome (or class) such as “true” or “false,”,  “yes” or “no”. A classifier based on Bayesian methods is one that
attempts to predict the class of unlabeled data by answering this question: “Based on
prior evidence, what is the most likely class of a new unlabeled instance?” It does this by
doing the following:
- Finding all existing instances with the same feature values (or profile) as the
unlabeled instance.
- Determining the most likely class that those instances belong to.
- Assigning the identified class label to the unlabeled instance.
This classification approach uses the concept of conditional probability to determine
the most likely class of an instance.


Reference:


# Probability

The probability of an event is how likely the event is to happen. Since most events cannot
be predicted with total certainty, the chance that an event will occur is often described
in terms of the probability of the event. For example, when a coin is tossed, there are
two possible outcomes: heads or tails. The probability of one of those outcomes, heads
for example, is the number of outcomes we care about (heads) divided by the total
number of possible outcomes (heads or tails). Therefore, the probability of heads is $\frac{1}{2}$.
The mathematical notation for this is $P(head)=\frac{1}{2}$


# Conditional Probability

For dependent events, instead of simply evaluating the probability that events A and
B occurred, we determine the probability of event A given that event B occurred. This is
known as conditional probability, because the probability of event A is conditioned on the
probability of event B. The notation for this is P AB | , which reads the probability of A
given B. This relationship can be represented using Bayes theorem, which describes the
relationship between dependent events A and B as follows:

\begin{equation*}
P(A|B) = \frac{P(A)P(B \cap A)}{P(B)}
\end{equation*}

There are four parts to this formula. The first part is the conditional probability of A
given that B occurred. This is written as $P(A|B)$ and is known as the posterior probability. The second part of the Bayes formula is known as the prior probability. It is written
asP A and describes the probability of event A by itself $P(A)$.

The next part of the Bayes formula represents the inverse of the posterior probability.
It is the probability of B given that A occurred. This is known as the likelihood and is
written as $P(B|A)$ . The fourth part of the Bayes formula is called the marginal likelihood. It represents the probability of event B alone and is written as $P(B)$.


# Classification
Suppose our dataset consists of $n$ predictors denoted as $x1$, $x_2$, $x_3$, $...$, $x_n$, and $m$ distinct class values, which are represented as $C_1$, $C_2$, $...$, $C_m$; then using the Bayes theorem, the conditional probability that an instance belongs to class $C_k$ is denoted as follows:

\begin{equation*}
P(C|x_1,x_2,...,x_n)  =  \frac{P(C)P(x_1,x_2,...,x_n|C)}{P(x_1,x_2,...,x_n)}
\end{equation*}

assuming independence
\begin{align*}
P(C|x_1,x_2,...,x_n) & =  \frac{P(C)P(x_1|C)P(x_2|C),...,P(x_n|C)}{P(x_1,x_2,...,x_n)} \\
P(C|x_1,x_2,...,x_n) & =  \frac{P(C) \prod_{i=1}^{n}P(x_i|C)}{P(x_1,x_2,...,x_n)}
\end{align*}

Considering a binary classification problem, if we have 

\begin{equation*}
\frac{P(C_1) P(x_1|C_1)P(x_2|C_1)}{P(x_1,x_2)} >\frac{P(C_2) P(x_1|C_2)P(x_2|C_2)}{P(x_1,x_2)}
\end{equation*}

Then 
\begin{equation*}
P(C_1) P(x_1|C_1)P(x_2|C_1) > P(C_2) P(x_1|C_2)P(x_2|C_2)
\end{equation*}
 

# Example:

Suppose a tennis team has recorded certain characteristics of the days that are good for playing as well as those that are not. Among the characteristics are: the outlook, temperature, humidity and wind. 


| Outllok(x_1)  | Temperature(x_2) | Humidity(x_3) | Windy(x_4)  | Clase(C)
| --- | --- | --- | --- | --- |
| sunny    |  Hot        | High    | False   | No  |
| sunny    |  Hot        | High    | Frue    | No  |
| Overcast |  Hot        | High    | False   | Yes |
| Rain     |  Mild       | High    | False   | Yes |
| Rain     |  Cool       | Normal  | False   | Yes |
| Rain     |  Cool       | Normal  | True    | No  |
| Overcast |  Cool       | Normal  | True    | Yes |
| Sunny    | Mild        | High    | False   | No  |
| Sunny    | Cool        | Normal  | False   | Yes |
| Rain     | Mild        | Normal  | False   | Yes |
| Sunny    | Mild        | Normal  | True    | Yes |
| Overcast | Mild        | High    | True    | Yes |
| Overcast | Hot         | Normal  | False   | Yes |
| Rain     | Mild        | High    | True    | No  |

Based on the above, the team would like to know, given today's conditions (Outlook=rain, Temperature=hot, Humidity=high and Windy=false), whether or not the game will be played.

The procedure consists in:
- Calculate the a posteriori probabilities, and
- Assign the pattern to the class $C_i$ where P($C_i$|x) is greater.

## Solution:


Outlook:


- $P(sunny|yes) = 2/9$,  $P(sunny|no) = 3/5$ 
- $P(overcast|yes) = 4/9$,  $P(overcast|no) = 0/5 = 0$
- $P(Rain|yes) = 3/9$,  $P(Rain|no) = 2/5$


Temperature:

- $P(hot|yes) = 2/9$,  $P(hot|no) = 2/5$
- $P(mild|yes) = 4/9$,  $P(mild|no) = 1/5$
- $P(cool|yes) = 3/9$,   $P(cool|no) = 1/5$

Humidity:

- $P(high|yes) = 3/9$,   $P(high|no) = 4/5 $
- $P(normal|yes) = 6/9$,  $P(normal|no) = 1/5 $

Windy:

- $P(false|yes) = 6/9$, $P(false|no) = 2/5$
- $P(true|yes) = 3/9$,  $P(true|no) = 3/5$

Total:

- $P(yes) = 9/14$, $P(false) = 5/14$

then, given the condition: Outlook=rain, Temperature=hot, Humidity=high and Windy=false,

\begin{align*}
P(yes|x) & =  P(x|yes)P(yes) \\
P(yes|x) & =  P(rain|yes)P(hot|yes)P(high|yes)P(false|yes)P(yes) \\
P(yes|x) & =  \frac{3}{9} \frac{2}{9} \frac{3}{9} \frac{6}{9} \frac{9}{14} \\
P(yes|x) & =  0.01058 \\
\end{align*}

and, 

\begin{align*}
P(no|x) & =  P(x|no)P(no) \\
P(no|x) & =  P(rain|no)P(hot|no)P(high|no)P(false|no)P(no) \\
P(no|x) & =  \frac{2}{5} \frac{2}{5} \frac{4}{5} \frac{2}{5} \frac{5}{14} \\
P(no|x) & =  0.018 \\
\end{align*}

Since $0.018 > 0.010$ then, the test pattern is assigned to class $No$, which means that the game will not be played.



# Example with R

In [2]:
#install.packages("rsample")
library('rsample')  # data splitting 
library("ISLR")
library("caTools")
library("dplyr")
library("class") # knn
library("ggplot2")
options(repr.plot.width = 16, repr.plot.height = 6)

In [3]:
str(Caravan)

'data.frame':	5822 obs. of  86 variables:
 $ MOSTYPE : num  33 37 37 9 40 23 39 33 33 11 ...
 $ MAANTHUI: num  1 1 1 1 1 1 2 1 1 2 ...
 $ MGEMOMV : num  3 2 2 3 4 2 3 2 2 3 ...
 $ MGEMLEEF: num  2 2 2 3 2 1 2 3 4 3 ...
 $ MOSHOOFD: num  8 8 8 3 10 5 9 8 8 3 ...
 $ MGODRK  : num  0 1 0 2 1 0 2 0 0 3 ...
 $ MGODPR  : num  5 4 4 3 4 5 2 7 1 5 ...
 $ MGODOV  : num  1 1 2 2 1 0 0 0 3 0 ...
 $ MGODGE  : num  3 4 4 4 4 5 5 2 6 2 ...
 $ MRELGE  : num  7 6 3 5 7 0 7 7 6 7 ...
 $ MRELSA  : num  0 2 2 2 1 6 2 2 0 0 ...
 $ MRELOV  : num  2 2 4 2 2 3 0 0 3 2 ...
 $ MFALLEEN: num  1 0 4 2 2 3 0 0 3 2 ...
 $ MFGEKIND: num  2 4 4 3 4 5 3 5 3 2 ...
 $ MFWEKIND: num  6 5 2 4 4 2 6 4 3 6 ...
 $ MOPLHOOG: num  1 0 0 3 5 0 0 0 0 0 ...
 $ MOPLMIDD: num  2 5 5 4 4 5 4 3 1 4 ...
 $ MOPLLAAG: num  7 4 4 2 0 4 5 6 8 5 ...
 $ MBERHOOG: num  1 0 0 4 0 2 0 2 1 2 ...
 $ MBERZELF: num  0 0 0 0 5 0 0 0 1 0 ...
 $ MBERBOER: num  1 0 0 0 4 0 0 0 0 0 ...
 $ MBERMIDD: num  2 5 7 3 0 4 4 2 1 3 ...
 $ MBERARBG: num  5 0 0 

In [4]:
# check classes
summary(Caravan$Purchase)

In [5]:
# check NA values
any(is.na(Caravan))

In [6]:
set.seed(123)
# stratified split 70% for training, and the rest for testing
split <- initial_split(Caravan, prop = 0.7, strata = "Purchase")
train <- training(split)
test  <- testing(split)

In [7]:
# distribution of train
table(train$Purchase) 


  No  Yes 
3825  250 

In [28]:
# distribution of test set
table(test$Purchase)


  No  Yes 
1649   98 

In [21]:
# create stratified training and testing
features <- setdiff(names(train), "Purchase")
# training
x_train <- train[, features]
y_train <- train$Purchase
# testing
x_test <- test[,features]
y_test <- test$Purchase

In [22]:
library("caret") 
library("klaR") # naive bayes

In [23]:
# set up 10-fold cross validation procedure
train_control <- trainControl(method = "cv", number = 10)

In [24]:
# train model
naive.bayes <- train(x = x_train, y = y_train, method = "nb", trControl = train_control)

"model fit failed for Fold01: usekernel=FALSE, fL=0, adjust=1 Error in NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : 
  Zero variances for at least one class in variables: PVRAAUT, PWERKT, AVRAAUT, AWERKT
"
"Numerical 0 probability for all classes with observation 118"
"Numerical 0 probability for all classes with observation 266"
"model fit failed for Fold02: usekernel=FALSE, fL=0, adjust=1 Error in NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : 
  Zero variances for at least one class in variables: PVRAAUT, PWERKT, AVRAAUT, AWERKT
"
"Numerical 0 probability for all classes with observation 16"
"Numerical 0 probability for all classes with observation 109"
"model fit failed for Fold03: usekernel=FALSE, fL=0, adjust=1 Error in NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : 
  Zero variances for at least one class in variables: PVRAAUT, PWERKT, AVRAAUT, AWERKT
"
"Numerical 0 probability for all classes with observation 328"
"mode

In [26]:
confusionMatrix(naive.bayes)

Cross-Validated (10 fold) Confusion Matrix 

(entries are percentual average cell counts across resamples)
 
          Reference
Prediction   No  Yes
       No  93.9  6.1
       Yes  0.0  0.0
                            
 Accuracy (average) : 0.9387


In [30]:
y_pred <- predict(naive.bayes, x_test, type="raw")

confusionMatrix(y_pred, y_test)

"Numerical 0 probability for all classes with observation 240"
"Numerical 0 probability for all classes with observation 1441"


Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1649   98
       Yes    0    0
                                          
               Accuracy : 0.9439          
                 95% CI : (0.9321, 0.9542)
    No Information Rate : 0.9439          
    P-Value [Acc > NIR] : 0.5268          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.9439          
         Neg Pred Value :    NaN          
             Prevalence : 0.9439          
         Detection Rate : 0.9439          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : No              
                        