# Naive Bayes

## Summary

**Keywords**:
- supervised learning
- classification
    - binary
    - multiclass
- **generative model** - captures the joint probability $P(X,Y)$ when specifying the hypothetical random process that generates the data
    - Here, we generate the the distribution for each label

### Assumptions

#### Overall Model

- **Independence**: All features are independent of one another
    - Because this is typically not true in the real world, this is why the algorithm is "naive"
- **Equal**: All features equally effect the outcome

#### Features

- https://scikit-learn.org/stable/modules/naive_bayes.html
- **gaussian / normal distribution**
    - Calculate the **likelihood**: ![](../img/naive-bayes-gaussian.png)
- **multinomial distribution** - best for discrete counts, i.e. how many times a word appears in that text, and tf-idf
    - The distribution is parameterized by a vector of $\theta$s for each class $y$.
    - $\hat{\theta_{yi}}=\frac{N_{yi}+\alpha}{N_y+\alpha n}$
        - where $n$ is the number of features (in text classification, the size of the vocabulary)
        - where $\theta_{yi}$ is the probability $P(x_i|y)$ of feature $i$ appearing in a sample belonging to class $y$
        - where $N_{yi}$ iis the number of times featire $i$ appears in a sample of class $y$ in the training set
        - where $N_y$ is the total count of all feature sfor class $y$
        - setting $\alpha=1$ is Laplace smoothing, and $\alpha < 1$ is Lidstone smoothing.
    - This is a smoothed verion of maxikum likelihood that involves relative frequency counting
- **bernoulli distribution** - best if your feature vectors are binary, like in text classification

### Pros

- Fast prediction, even for very large and/or highly-dimensional datasets
- Relatively simple and interpretable
- Requires a small amount of training data to estimate the necessary parameters
- Few (if any) parameters to tune
- Can outperform sophisticated classification methods, especially when the independence assumption holds
- Categorical input variables perform well
- Predicted posterior probabilities can provide an estimate of uncertainty

### Cons

- If a categorical variable was not observered in training, the model assigns a 0 probability to it and will be unable to make a prediction.
    - To avoid this, we use smoothing techniques.
- Assumptions, especially the independence assumption, are often violated in practice. So, it's naive and known as a bad estimator. Therefore, the predicted probability `predict_proba` should not be taken too seriously.
- If there are many numerical variables, a normal distribution is assumed (strong assumption)

### Common Use Cases

- Text classification / spam filtering / sentiment analysis
- Multiclass classification
- Real-time predictions


## How It Works

### Bayes Theorem

Bayes theorem provides a way of calculating posterior probability $P(c|x)$ from $P(c)$, $P(x)$ and $P(x|c)$:

![](../img/bayes-theorem.png)

- $x$ is the **evidence**, since it's already happened.
- $c$ is the **hypothesis** / class, since we are estimating its probability given the evidence.
- $P(c|x)$ is the **posterior probability** of class (c, target) given predictor (x, attributes). It's the probability of an event after the evidence is observed.
- $P(c)$ is the **priori**, or **prior probability** of the class. It's the probability of the event before the evidence was observed.
- $P(x|c)$ is the **likelihood** which is the probability of predictor given class.
- $P(x)$ is the **prior probability** of predictor.

### Applied to Multiple Features

By substituting for X and expanding using the chain rule we get:

![](../img/bayes-theorem-multiple-features.png)

For all entries in the dataset, the denominator does not change, it remain static.

### Applied to a Classifier

- Calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

### Smoothing Techniques

- **zero-frequency problem** - if you have no occurrences of a class label and a certain attribute value together, then the frequency-based probability estimate will be zero. And this will get a zero when all the probabilities are multiplied. To address this, we can use smoothing.
- **Laplace smoothing**: +1 numerator, +2 denominator


## Improving the Model

- If continuous features do not. have a normal distribution, we should use transformation or different methods to convert it in normal distribution.
- If the test data set has the zero frequency issue (data unobserved in training), apply smoothing techniques to predict the classes of the test data set.
- Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over-inflating importance.
- There are not many (if any) parameters to tune. It's better to focus on feature selection and engineering.
