## Background
- Problem type: classification
- Assumptions on data: Independence between the features
- Theory: Bayes' theorem is mathematical formula used for calculating conditional probabilities.<br>
    $ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\ $<br>
    - A and B are events and $P(B)\neq 0$.
    - P(A), P(B) are called the marginal probability, and represents the probability of an event irrespective of the outcomes of other random variables.
    - $P(A|B)$ is a conditional probability: the likelihood of event A occurring given that B occurred.
    A and B must be different events.
    - and for n different classes, and y as a data point that we would like to assign a class:
### $P(y|x_1, x_2,...,x_n) = \frac{P(x_1, x_2,...,x_n|y) \cdot P(y)}{P(x_1, x_2,...,x_n)}$

## Naive Bayes:

The approach presented above demands a large amount of information, in order to estimate the probability distribution for all different possible combinations of values.

Instead I will use the "Naive Bayes" approach: I will assume __independency__ between every pair of features (and will preprocces the data accordingly). The independency assumption gives $P(B)⋅P(A|B)=P(B)⋅P(A)$ , and therefore the calculated probabilty can be simplified to:
### $P(y|x) = \frac{P(y) \cdot \prod_{i=1}^{n} P(x_i|y)}{P(x_1, x_2,...,x_n)}$

We will need to determine what class gets the highest probability for each data point. Solely for the purpose of comparison, we can calculate just the numerator (because the denominator is constant):
### $P(y|x) \propto P(y) \cdot \prod_{i=1}^{n} P(x_i|y)$

- note: by calculating just the numerator we lower our precision in __predicting__ the probabilities of a data point being in a class. therefore, we refer mainly to the __comparison__ between classes (by getting the maximum value) rather than the actual probabilty value for the data point being in that class.
    
Another simplification is needed: calculating the above mentioned equation still requires claculating the conditional probabilities of $P(x_i|y)$. We can avoid doing so using __probability density function__ (PDF).(https://en.wikipedia.org/wiki/Probability_density_function, https://en.wikipedia.org/wiki/Bayes%27_theorem#:~:text=In%20principle%2C%20Bayes'%20theorem%20applies,relevant%20densities%20(see%20Derivation).)
    For continuous random variables, the PDF is defined: $P[a \leq X \leq b] = \int_a^b {f(x)dx}$ where f is the PDF.
Probabilty for a specific event to occur will be $P(X=x_0) = F(x_0) - F(x_0) = 0$ (Newton-Leibniz formula). Instead of observing specific  (discrete) events, we will observe a small region of events $P(|X-x_0| < Δ(x))$.
For this purpose the PDF can be described as $pdf(|X-x_0| < Δ(x))= \frac {P(|X-x_0| < Δ(x))}{Δ(x)}$.

Therefore, an alternative option for writing out probabilities proportion equation is:
### $PDF(y|x)Δ(y) \propto PDF(y)Δ(y) \cdot \prod_{i=1}^{n} PDF(x_i|y)Δ(x_i)$
Dividing by $Δ(y), Δ(x_i)$ will keep the proportion (they are positive constants):
### $PDF(y|x) \propto PDF(y) \cdot \prod_{i=1}^{n} PDF(x_i|y)$
Summing up, we can calculate the relations between the __densities__ instead of the actual conditional probablities to understand which class is most likely for a specific data point.
In order to use this method, we will need to know the distribution of the data.


## Algorithm:
- Find the data's distributions for each class and independent variable.
- Calculate the density values for each data point being in each class
- Among all of the calculated values, assign each of the data points to the class that has the highest probability for it to be in among all classes.