# Naive Bayes Classifier

Naive Bayes Classifier is a probabilistic classification model. The model generated by the Naive Bayes algorithm is a set of *probability values* that are estimated from a training dataset.

## Naive Bayes Classifier - relation to Bayes Theorem

Naive Bayes Classifier consists of two steps, which are described below. Formally, let $X$ be the trainin dataset. Also consider that $c_1, c_2, \ldots, c_k$ are the classes of the problem (i.e., the possible values ​​of the target) and that $\mathbf{x} = [x_1, x_2, ..., x_n]$ is a new example that should be classified. Let $a_1, a_2, ..., a_n$ be the values for the predictive features $x_1, x_2, ..., x_n$, respectively. 

The term "bayesian" comes from [Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes), a British Presbyterian minister who lived in the 18th century and who formulated the famous [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem).

$$
\Pr(c_j \mid x_1, x_2, \dots, x_n) = \frac{\Pr(c_j) \Pr (x_1, x_2, \dots x_n \mid c_j)} {\Pr(x_1, x_2, \dots, x_n)}
$$

The following is a description of each term in the above expression, in the context of a [statistical classification](https://en.wikipedia.org/wiki/Statistical_classification) task:

* $\Pr(c_j \mid x_1, \dots, x_n)$ represents the probability of the class $c$, given the values ​​of the attributes of the example $\mathbf{x}$. These terms, called **posterior probabilities**, are what must be determined (learned) by the algorithm. There is one probability value for each value $c_j$ the target can assume ($c_j \in \mathcal{C}$). 

* $\Pr(x_1, \dots x_n \mid c_j)$ represents the probability that a specific combination of values ​​$x_1, \dots, x_n$ will occur in examples associated with value $c_j$ of the target attribute. Thesse terms are called **likelihoods**.

* $\Pr(c_j)$ represents the probability that an example selected at random belongs to a given class (i.e., belongs to a given value of the target attribute $y$). This term is called **prior probability**.

* $\Pr(x_1, x_2, \dots, x_n) $ represents the probability that a given combination of values ​​$x_1, x_2, \dots, x_n$ will occur in an example selected at random.

Notice that $\Pr(c_j \mid x_1, \dots, x_n)$ are the probability values that we want to determine. If we know these probability values, we can predict the class of the example $[x_1, \dots, x_n]$ by using the following expression:

$$
c_{map} = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\Pr(c_j \mid x_1, \dots, x_n) \right] = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\frac{\Pr(c_j) \Pr (x_1, x_2, \dots x_n \mid c_j)} {\Pr(x_1, x_2, \dots, x_n)} \right]
$$

The above expression uses the $\operatorname{argmax}$ operator. The interpretation is simple: we are going to predict the class $c_{map}$ that is associated to the maximum value resulting from computing the expression in brackets.

In the above expression, notice that $\Pr(x_1, x_2, \dots, x_n)$ does not depende on $c_j$. Hence, we can simplify the computation of the $\operatorname{argmax}$ operator:

$$
c_{map} = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\frac{\Pr(c_j) \Pr (x_1, x_2, \dots x_n \mid c_j)} {\Pr(x_1, x_2, \dots, x_n)} \right] = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\Pr(c_j) \Pr(x_1, x_2, \dots x_n \mid c_j) \right]
$$

## Naive Bayes Classifier - steps

From the discussion provided above, we can summarize the NBC algorithm in two steps, which are described below. 

1. Calculate the posterior probabilities $\Pr(c_j \mid \mathbf{x})$, $j = 1,2, \ldots, k $
2. Classify $\mathbf{x}$ as being of class $c_{map}$ such that $\Pr(c_{max} \mid \mathbf{x})$ is maximum.


By inspecting the expression for $c_{map}$, we can conclude that we need to compute two sets of probability values:

- $\Pr(c_j)$, $c_j \in \mathcal{C}$
- $\Pr(x_1, x_2, \dots x_n \mid c_j)$, $c_i \in \mathcal{C}$

## Naive Bayes Classifier - computing priors and likelihoods

Consider that $c_1, c_2, \ldots, c_k$ are the classes of the problem (i.e., the possible values ​​of the target) and that $\mathbf{x} = [x_1, x_2, ..., x_n]$ is a new example that should be classified. Let $a_1, a_2, ..., a_n$ be the values for the predictive features $\mathbf{x} = [x_1, x_2, ..., x_n]$, respectively. 

More concretely, consider that $\mathbf{x}$ is the following:

$$
\mathbf{x} = [\operatorname{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak}]
$$

Hence, we are provided information about the weather conditions of a given day (represented by $\mathbf{x}$), and we want to answer (*predict*) whether or not this is a good day to play tennis. 

There are only two possibilities, Yes (play=Yes) or No (play=No). Therefore, let us define two probability values:

$$
\begin{align*}
\Pr(\text{play = Yes} \mid \mathbf{x}) & \text{: probability that } \mathbf{x} \text{ is a good day to play.}\\
\Pr(\text{play = No} \mid \mathbf{x}) & \text{: probability that } \mathbf{x} \text{ is NOT a good day to play.}
\end{align*}
$$

Notice that, if we know the probabilites values above, the problem is solved. That is because we can use these values to make our decision: if $\Pr(\text{play = Yes} > \Pr(\text{play = No})$, then we predict that $\mathbf{x}$ is a good day to play tennis, that is, we predict play = Yes. Otherwise, we predict play = No.

But, how can we compute those probability values? The answer lies in the Bayes Rule. To compute $\Pr(\text{play = Yes} \mid \mathbf{x})$, we use Bayes Rules and write:

$$
\begin{align*}
\Pr(\text{play = Yes} \mid \mathbf{x}) & = \frac{\Pr(\mathbf{x} \mid \text{play = Yes}) \times \Pr(\text{play} = \text{Yes})}{\Pr(\mathbf{x})} = \\
&=  \frac{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) \times \Pr(\text{play = Yes})}{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})}
\end{align*}
$$

To compute $\Pr(\text{play = No} \mid \mathbf{x})$, we write a similar expression:

$$
\begin{align*}
\Pr(\text{play = No} \mid \mathbf{x}) & = \frac{\Pr(\mathbf{x} \mid \text{play = No}) \times \Pr(\text{play} = \text{No})}{\Pr(\mathbf{x})} = \\
&=  \frac{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No}) \times \Pr(\text{play = No})}{\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})}
\end{align*}
$$

By looking at the two expressions above, it seems there are several probability values we need to compute using the provided dataset. Let us list each one of them:

1. $\Pr(\text{play = Yes})$
2. $\Pr(\text{play = No})$
3. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No})$
4. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes})$
5. $\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak})$

Items 1 and 2 are easy to compute, and we arlready know how to estimate them. Another good news is that we don't actually need to compute item 5, since this expression appears as denominator of both $\Pr(\text{play = yes} \mid \mathbf{x})$ and $\Pr(\text{play = No} \mid \mathbf{x})$. We are left with items 3 and 4. Lets us apply the definition of [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability) to one of these expressions (item 4):

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = Yes}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{outlook} = \text{Sunny}, \text{play = Yes}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \text{play = Yes}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \text{play = Yes})
\end{align*}
$$

By looking at the above expression, it seems that we have to compute a lot of probability estimates from the available data. That is when the naive assumption of Naive Bayes Classifier comes in handy. This algorithm assumes that one attribute is conditionally independent from each other, once we know the value of the class. 

> *Conditional independence*. The term [conditional independence](https://www.probabilitycourse.com/chapter1/1_4_4_conditional_independence.php) corresponds to a somewhat advanced concept in Probability Theory. Given three variables A, B, and C. We say that variables A and B are conditionally independet given the variable C if and only if knowing the value of C makes A and B independent of each other.
$$
\Pr(A \mid B, C) = \Pr(A \mid C)
$$

The term *naive* stems from the fact that Naive Bayes considers that the attributes are conditionally independent given the class. When considering this hypothesis, the computation of the conditional probabilities can be simplified. Mathematically, we have:
$$
\Pr(x_1, x_2, \dots x_n \mid y) = \Pr(x_1 \mid y) \times \Pr(x_2 \mid y) \times \ldots \times \Pr(x_n \mid y)
$$

In many practical cases, this statistical independence between predictors does not exist. For example, consider a dataset with information about customers of a company. Also consider that each customer is represented by the following features: *weight*, *education*, *salary*, *age*, etc. In this dataset, the values ​​of the first three feature are correlated with values ​​of the age. In this case, at least in theory, the use of Naive Bayes would overestimate the effect of the age feature. However, practice shows that Naive Bayes is quite effective even in cases where the predictive features are not statistically independent.

Anyway, assuming the naive hypothesis is true, we can simplify the Bayes formula:

$$
\Pr(y \mid x_1, x_2, \dots, x_n) \propto \Pr(y) \times \Pr(x_1 \mid y) \times \Pr(x_2 \mid y) \times \ldots \Pr(x_n \mid y)
$$


Naive Bayes Classifier uses this assumption of conditional independence to simplify the computation of the probability estimates tha should be produced. By applying this assumption to the estimates above, we end up with the following:

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = Yes}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = Yes}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{play = Yes}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{play = Yes}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{play = Yes})
\end{align*}
$$

We can write a similar expression for $\text{play = No}$:

$$
\begin{align*}
\Pr(\text{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak} \mid \text{play = No}) = \\
\Pr(\text{outlook = Sunny} \mid \text{play = No}) \times \\ 
\times\Pr(\operatorname{temp} = \text{Hot} \mid \text{play = No}) \times \\
\times\Pr(\operatorname{humidity} = \text{High} \mid \text{play = No}) \times\\
\times\Pr(\operatorname{wind} = \text{Weak} \mid \text{play = No})
\end{align*}
$$

$$
\Pr(c_i \mid \mathbf{x}), \, 1 \leq i \leq n
$$

## Estimating priors and likelihoods from data

The probabilities are actually estimated from a training dataset by the Naive Bayes algorithm. These estimates are computed by counting the occurrences of values in a given feature, either separately or in conjunction with values of the target.

Let us see an example of how probability estimates can be computed from data provided as a dataset. For this, consider the [Play Tennis dataset](https://www.kaggle.com/fredericobreno/play-tennis), which is another toy dataset with four predictors (`outlook`, `temp`, `humidity`, and `wind`) and fourteen examples. The target (`play`) is binary. Each example provides data about the weather condition in a particular day. Therefore, the classification task is to predict whether a given day is appropriate to play tennis or not.

In [2]:
import pandas as pd
df_play_tennis = pd.read_csv('../datasets/play_tennis.csv')
df_play_tennis

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


### Priors

First, let us compute the estimates for the prior probabilites.

$$
\Pr(\operatorname{play} = \text{Yes}) \approx \frac{5}{14} \approx 36\%
$$

$$
\Pr(\operatorname{play} = \text{No}) \approx \frac{9}{14} \approx 64\%
$$

The way to interpret these prior probabilities is the following: if you do not know anything about the weather conditions in a given day, then there is approximately 64% chance that this day is appropriate to play tennis.

In general, to compute estimates for the prior probabilities, we use the following expression:

$$
\Pr(c_j) \approx \frac{q_j}{n}
$$

In the above expression,

- $q_j$ is the number of training examples that belong to class $c_j$;
- $n$ is the total number of training examples.

### Likelihoods

See a nice explanation about conditional probabilities [here](https://setosa.io/conditional/).

Recall the estimate $\Pr(\operatorname{play} = \text{No}  \mid \operatorname{outlook} = \text{Sunny}) \approx 60\%$. This estimate tells us that, if you are in a sunny day, then the chance is $60$% that this is not a good day to play tennis. Now, compare this value with the estimate for $\Pr(\operatorname{play} = \text{No}) \approx 36\%$. We can conclude that, knowing that we are in a sunny day changes our bets that this day is appropriate to play tennis. In other words, it seems to exist a **dependence** between variables `play` and `outlook`. 

In general, two events $A$ and $B$ are said to be independent if and only if both identities below are true:

1. $\Pr(A \mid B) = \Pr(A)$
2. $\Pr(B \mid A) = \Pr(B)$

We can also easily compute estimates for the conditional probabilities from the data. Somes examples:
- $\Pr(\operatorname{outlook} = \text{Sunny}  \mid \operatorname{play} = \text{No}) \approx \frac{3}{5}.$
- $\Pr(\operatorname{outlook} = \text{Sunny} \text{ and } \operatorname{temp} = \text{Hot} \mid \operatorname{play} = \text{No}) \approx \frac{2}{5} = 40\%$
- $\Pr(\operatorname{play} = \text{No}  \mid \operatorname{outlook} = \text{Sunny}) \approx \frac{3}{5} = 60\%$

As an exercise, compute estimates for the following conditional probatilities (likelihoods):

- $\Pr(\text{outlook} = \text{Sunny} \mid \text{play} = \text{Yes})$
- $\Pr(\text{outlook} = \text{Sunny} \mid \text{play} = \text{No})$ 

- $\Pr(\text{temp} = \text{Hot} \mid \text{play} = \text{Yes})$ 
- $\Pr(\text{temp} = \text{Hot} \mid \text{play} = \text{No})$ 

- $\Pr(\text{humidity} = \text{High} \mid \text{play} = \text{Yes})$
- $\Pr(\text{humidity} = \text{High} \mid \text{play} = \text{No})$ 

- $\Pr(\text{wind} = \text{Weak} \mid \text{play} = \text{Yes})$ 
- $\Pr(\text{wind} = \text{Weak} \mid \text{play} = \text{No})$

In general, to compute estimates for the conditional probabilities, we use the following expression:

$$
\Pr(x_i \mid c_j) \approx \frac{q_{ij}}{q_j}
$$

In the above expression,

- $q_j$ is the number of training examples that belong to class $c_j$;
- $q_{ij}$ is the number of training examples that belong to class $c_j$ and have the provided value for $x_i$ (the i-th attribute).

## Implementation details

### Laplace Smoothing

So far, we know that the calculation of probability estimates in the NBC method is based on frequency counts over the training dataset $\mathcal{D}$. For example, to obtain the estimate for the prior probabilities $\Pr(c_{j})$, it is necessary to determine how often we find examples that belong to class 
$c_{j}$. 

However, there is an additional complication in calculating the estimates for the conditional probabilities: zero frequencies cause the estimate of the conditional probability to be zero. To understand this, realize that it suffices for one of the factors in the equation to be zero for the entire product to also be zero.

$$
\Pr(x_1, x_2, \dots x_n \mid c_j) = \Pr(x_1 \mid c_j) \times \Pr(x_2 \mid c_j) \times \ldots \times \Pr(x_n \mid c_j)
$$

In particular, the frequency $\frac{q_{ij}}{q_{i}}$ is zero when the value used for attribute $x_i$ does not occur in training exmplaes labeled with class $c_j$
(because, in this case, $q_{ij} = 0$). To prevent the occurrence of zero frequencies, we must *smooth* the estimates.

In general, the procedure of smoothing a probability estimate $e$ means adding a small, positive constant $\delta$ to it, such that the new estimate is $e+\delta$. The result is that probability estimates that are zero become greater than zero.

One of the techniques used to smooth probability estimates is **Laplace Smoothing**. Remember that the expression to compute the estimates for is the following:
$$
	\Pr(x_{i} \mid c_{j}) \approx \frac{q_{ij}}{q_{j}}
$$

When applied to the above expression, the Laplace Smoothing technique allows it to be rewritten, as shown below.
$$
	\Pr(x_{i} \mid c_{j}) \approx \frac{q_{ij} + \lambda}{q_{j} + \lambda |\mathcal{C}|}
$$

- $|\mathcal{C}|$ denotes the number of different values the target. 
- Usually $\lambda$ in the formula is set to 1.

Let us present a numerical example of computing the probability estimates using Laplace Smoothing trick. For this, consider the following question: 

> Is it appropriate or not to play tennis on a sunny, hot, high humidity and light wind day?

This question is equivalent to classifying an example $\mathbf{x}$ corresponding to $[\operatorname{outlook} = \text{Sunny}, \operatorname{temp} = \text{Hot}, \operatorname{humidity} = \text{High}, \operatorname{wind} = \text{Weak}]$. To answer this question, we can apply the Naive Bayes classifier to compute $c_{\text{map}}$. But, before this, we to compute probability estimates for the prior and likelihoods. 

For the **prior propabilities**, we find that:

$$\Pr(\operatorname{play} = \text{Yes}) \approx \frac{q_{\text{Yes}}}{n} = \frac{9}{14}$$ 

$$\Pr(\operatorname{play} = \text{No}) \approx \frac{q_{\text{No}}}{n} = \frac{5}{14}$$

Similarly, estimates for **conditional probabilities** $\Pr(x_i \mid c_j)$ can be computed. Let us assume we want to compute these probabilities for the value of $\text{Sunny}$ of the attribute $\text{outlook}$. If we set the Laplace smoothing parameter to $\lambda=1$, then:

$$
\begin{align*}
\Pr(\operatorname{outlook} &= \text{Sunny} \mid \operatorname{play} = \text{Yes}) &\approx \frac{q_{ij} + \lambda}{q_{j} + \lambda \mid x_i \mid} = \frac{2 + 1}{9 + 1 \cdot 2} = \frac{3}{11} \approx 0.27\\

\Pr(\operatorname{outlook} &= \text{Sunny} \mid \operatorname{play} = \text{No}) &\approx \frac{q_{ij} + \lambda}{q_{j} + \lambda \mid x_i \mid} = \frac{3 + 1}{5 + 1 \cdot 2} = \frac{4}{7} \approx 0.57

\end{align*}
$$

In [6]:
def laplace_smoothed_probability(q_j, q_i_j, num_values_target, smoothing_parameter=1):
  """
  This function calculates the Laplace-smoothed estimate of a probability.

  Args:
      q_j (int): The total number of examples associated to target c_j.
      q_i_j (int): The total number of examples associated to target c_j, and that have a particular value for predictor x_i.
      num_values_target: number of values the target can assume.
      smoothing_parameter (int, optional): The smoothing parameter (default 1).

  Returns:
      float: The Laplace-smoothed probability estimate.
  """
  return (q_i_j + smoothing_parameter) / (q_j + smoothing_parameter * num_values_target)

print(laplace_smoothed_probability(9, 2, 2))
print(laplace_smoothed_probability(5, 3, 2))

0.2727272727272727
0.5714285714285714


### Log-transformation

In addition to Laplace smoothing, another implementation trick is commonly used in Naive Bayes Classifiers (NBC). Consider again the expression used to compute $\Pr(x_1, x_2, \dots x_n \mid y)$. Note that this equation presents a product over the factors $\Pr(x_{i} \mid c_{j})$. From a computational point of view, this product represents a complication, because the values $\Pr(x_{i}|c_{j})$ are typically very close to $0$, which makes their product even closer to zero. Considering that computers have a finite capacity for representing real numbers, this can lead to approximation errors.

To circumvent this issue, we leverage a property of the $\arg\max$ operator and the logarithmic function $f(x) = \log(x)$, as described below.

First, note that the function $\log(x)$ is *monotonically increasing*, which means that if  $x_1 \geq x_2$, then $\log(x_1) \geq \log(x_2)$. Therefore, if we apply the logarithmic function to each element of the list passed as an argument to the argmax operator, the result produced by this operator remains the same. In other words:

$$
\underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[f(c_j)\right] = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\log f(c_j)) \right]
$$

It is also worth noting that we can exploit a property of the logarithm function, which states that the logarithm of a product is equal to the sum of the logarithms. As a result, we can transform the product in the original expression for $c_{map}$ into a sum. This is advantageous from a computational standpoint, since sums are less prone to numerical approximation errors than products. This development is presented below.

$$
\begin{align*}
c_{map} &= \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[ \Pr(c_{j}) \times \prod_{i}\Pr(x_i|c_{j})\right] \\ 
		&= \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \log \left[ \Pr(c_{j}) \times \prod_{i}\Pr(x_i|c_{j})\right] \\ 
		&= \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[ \log \Pr(c_{j}) + \log \prod_{i}\Pr(x_i|c_{j})\right] \\ 
		&= \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[ \log \Pr(c_{j}) + \sum_{j} \log \Pr(x_i|c_{j})\right] \\ 
\end{align*}
$$

Based on the description provided so far, we can conclude that the most probable class for an example $\mathbf{x}$ can be determined as follows, given the estimates $\Pr(c_{j})$ e $\Pr(x_i|c_{j})$:

$$
\begin{equation*}
		c_{map} = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[ \log \Pr(c_{i}) + \sum_{i} \log \Pr(x_i|c_{j})\right]
\end{equation*}
$$

If we combina both implementation tricks (Laplace Smoothing and Log-Transformation), we get the final expression for $c_{map}$:
$$
\begin{equation*}
		c_{map} = \underset{c_{j} \in \mathcal{C}}{\operatorname{argmax}} \left[\log \frac{q_j}{n} + \sum_{i} \log \frac{q_{ij} + \lambda}{q_{j} + \lambda |\mathcal{C}|}\right]
\end{equation*}
$$

## *Recommended* object-oriented design

In [25]:
import pandas as pd

class NaiveBayesClassifier:
  """
  A class implementing the Naive Bayes Classifier algorithm.
  """

  def __init__(self, smoothing_parameter: int=1):
    """
    Initializes the classifier with a smoothing parameter for Laplace Smoothing.

    Args:
        smoothing_parameter (int, optional): The smoothing parameter (default 1).
    """
    self.smoothing_parameter = smoothing_parameter
    self.classes_ = None
    self.feature_counts_ = None  # Dictionary to store feature counts

  def fit(self, X: pd.DataFrame, y: pd.Series):
    """
    Fits the classifier to the training data.

    Args:
        X (pd.DataFrame): The training data features.
        y (pd.Series): The training data labels.
    """
    self.classes_ = np.unique(y)
    self.feature_counts_ = self._calculate_feature_counts(X, y)


  def predict(self, X: pd.DataFrame):
    """
    Predicts the class labels for new data points.

    Args:
        X (pd.DataFrame): The data points to predict labels for.

    Returns:
        numpy.ndarray: The predicted class labels for each data point.
    """
    if self.classes_ is None:
      raise ValueError("Model not fitted yet. Call fit(X, y) first.")
    return np.apply_along_axis(self._predict_proba, axis=1, arr=X.values).argmax(axis=1)

  def predict_proba(self, X: pd.DataFrame):
    """
    Predicts the class probabilities for new data points.

    Args:
        X (pd.DataFrame): The data points to predict class probabilities for.

    Returns:
        numpy.ndarray: The predicted class probabilities for each data point.
    """
    if self.classes_ is None:
      raise ValueError("Model not fitted yet. Call fit(X, y) first.")
    return np.apply_along_axis(self._predict_proba, axis=1, arr=X)

  def _calculate_feature_counts(self, X, y):
    """
    Calculates the counts of features for each class using Laplace Smoothing.

    Args:
        X (pd.DataFrame): The training data features.
        y (pd.Series): The training data labels.

    Returns:
        dict: A dictionary containing feature counts for each class.
    """
    ##########################################################################
    # YOUR CODE HERE
    # ... (Implementation of counting features)
    ##########################################################################

  def _predict_proba(self, x):
    """
    Calculates the class probabilities for a single data point using Bayes' theorem.

    Args:
        x (numpy.ndarray): A single data point.

    Returns:
        numpy.ndarray: The class probabilities for the data point.
    """
    ##########################################################################
    # YOUR CODE HERE
    # ... (Implementation of Naive Bayes probability calculation with Laplace smoothing)
    ##########################################################################


#### Additional information on the ``self.feature_counts_`` property:

- Inside the ``__init__`` method, we define an empty dictionary {} and assign it to the ``self.feature_counts_`` property.
- This dictionary will be used to store the calculated feature counts for each class during the training process (``fit`` method).
- The key of the dictionary will be a tuple representing a specific feature and its value (e.g., ("Outlook", "Sunny")).
- The value of the dictionary will be another dictionary that stores counts for each class (e.g., {"Yes": 2, "No": 3}).
- This dictionary allows efficient storage and retrieval of feature counts categorized by class and feature value during prediction.

The piece of code below is meant to provived a more concrete understanding of that property. 

In [23]:
feature_counts_ = {}
feature_counts_[("outlook", "Sunny")] = {"Yes": 2, "No": 3}
print(feature_counts_[("outlook", "Sunny")])
print(feature_counts_[("outlook", "Sunny")]["Yes"])
print(feature_counts_[("outlook", "Sunny")]["No"])

{'Yes': 2, 'No': 3}
2
3


#### Additional information on the function ``numpy.apply_along_axis``.

The following example demonstrates how ``numpy.apply_along_axis`` can be used with a pandas DataFrame to perform element-wise operations on each row or column using a custom function.

In [15]:
import pandas as pd
import numpy as np

# Sample DataFrame
data = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8], 'col3': [.1, .2, .3, .4]}
df = pd.DataFrame(data)

# Function to square each element
def square(x):
  return x * x

# Apply square function to each row (axis=0) using numpy.apply_along_axis
squared_df = pd.DataFrame(np.apply_along_axis(square, axis=0, arr=df.values))

# Set column names for the resulting DataFrame
squared_df.columns = df.columns

print(df)
print(squared_df)

   col1  col2  col3
0     1     5   0.1
1     2     6   0.2
2     3     7   0.3
3     4     8   0.4
   col1  col2  col3
0   1.0  25.0  0.01
1   4.0  36.0  0.04
2   9.0  49.0  0.09
3  16.0  64.0  0.16
