<h1 align=center>Naive Bayes Algorithm In Depth</h1>

![alt text](../Images/ml/naviepiccs.png)

- Supervised Algorithm
- Used in classification problems
- Affected by imbalanced dataset
- Does not require feature scaling
- Robust to outliers
- Can handle missing data
- Can be heavily used for text classification and text analysis
- Can be used with categorical and numerical data classification
- The naive Bayes classifier is much faster with it is probability calculations
- Naive Bayes is faster than linear models, good for very large datasets and high dimensional data, and often less accurate than linear models
- The naive Bayes algorithm is based on the naive Bayes theorem

## Bayes Theorem

$$
P(A|B) = \frac {P(B|A) P(A)}{P(B)}\\
$$

- Belove is how we derive Bayes' Theorem from conditional probability:

$$
P(A|B) = \frac {P(A\cap B)}{P(B)}\\ P(B|A) = \frac {P(B\cap A)}{P(A)}\\ P(A\cap B)=P(B\cap A)\\P(A|B)P(B)= P(A\cap B)\\P(B|A)P(A)= P(B\cap A)\\P(A|B)P(B)=P(B|A)P(A)\\ P(A|B) = \frac {P(B|A)P(A)}{P(B)}
$$

- P(A|B): is the probability of event A given event B (Posterior probability)
- P(B|A): is the probability of event B given event A (Like-hood)
- P(A): is the probabilities of event A (Prior)
- P(B): is the probabilities of event B (Marginal)

### How naive Bayes do train on the dataset:

$$
 P(A|B) = \frac {P(B|A)P(A)}{P(B)}\\ Dataset\\ X = \{x_1,x_2,x_3...,x_n\}, \;\; \{y\}\\ P(y|x_1,x_2,x_3...,x_n)=\frac{P(x_1|y)P(x_2|y)P(x_3|y)...P(x_n|y)*P(y)}{P(x_1)P(x_2)P(x_3)...P(x_n)}\\ P(y|x_1,x_2,x_3...,x_n)=\frac{P(y)\;\prod_{i=1}^n P(x_i|y)}{P(x_1)P(x_2)P(x_3)...P(x_n)}
$$

- X: is the set of independent features
- y: is the set of dependent features

### Naive Bayes Numerical Example

`Problem`: The below is the dataset, and we are ask to find the probability if weather is sunny whether the player should play or not.

**Goal:**

- Find P(Yes | Sunny) → The probability of player play when it is sunny
- Find P(No | Sunny) → The probability of player do not play when it is sunny
- Compare both, the one with higher probability will be select

![alt text](../Images/ml/naive1.png)

#### P(Yes | Sunny):

$$
P(A|B) = \frac {P(B|A)P(A)}{P(B)}, \;\;\;\; Baive\; Theorem\; Formula\\ So\; we\; can\; write:\;\\ P(Yes|Sunny) = \frac {P(Sunny|Yes)P(Yes)}{P(Sunny)}\\ P(Sunny|Yes) = \frac {3}{10} = 0.3\\ P(Yes) = \frac {10}{14} = 0.71\\ P(Sunny) = \frac {5}{14}=0.35 \\ P(Yes|Sunny) = \frac {0.3\times0.71}{0.35} = 0.60
$$

#### P(No| Sunny):

$$
P(No|Sunny) = \frac {P(Sunny|No)P(No)}{P(Sunny)}\\ P(Sunny|No) = \frac {2}{4} = 0.5\\ P(No) = \frac {4}{14} = 0.28\\ P(Sunny) = \frac {5}{14}=0.35 \\ P(No|Sunny) = \frac {0.5\times0.28}{0.35} = 0.4
$$

`Note`:As we see P(Yes|Sunny) > P(No|Sunny), so the player plays when the weather is sunny

### Naive Bayes Text Classification Example

Below is our dataset:

![alt text](../Images/ml/text2naive.png)

Before performing any operations, we need to preprocess the text data and convert our text data to numerical data.

**Text Preprocessing Steps:**

- Converting to lower case
- Tokenising
- Removing stop words
- Words stemming
- Removing punctuation
- Stripping out html tags

After performing text preprocessing, the next step is to convert words to vector. For the above dataset we will use bag of words (Count vectorizer) and the final dataset is: 

- 1 —> Positive
- 0 —> Negative

![alt text](../Images/ml/navietext.png)

Now we will use the naive Bayes algorithm on the first sentence.

Sentence1: The food is good → after text preprocessing: food good

- food → we will call it feature x1
- good → we will call it feature x2

#### P(y=1|sentence1):

$$
P(A|B) = \frac {P(B|A)P(A)}{P(B)}, \;\;\;\; Baive\; Theorem\; Formula\\ Based \;on \;the\; above\; formula\; we\; can\; write:\\ P(y=1|sentence1) = P(y=1|(x_1,x_2))\\= \frac {P(x_1|y=1)P(x_2|y=1)P(y=1)}{P(x_1)P(x_2)}\\
$$

$$
P(y=1)=\frac3{5}=0.6\\ P(x_1|y=1)=\frac{3}{3}=1 \\P(x_2|y=1)=\frac{1}{3}=0.33 \\ P(x_1)=\frac{1}{2}=0.5\\ P(x_2)=\frac{1}{2}=0.5
$$

$$
 P(y=1|(x_1,x_2)) = \frac {1 \times0.33 \times0.6}{0.5 \times 0.5} = 0.8
$$

#### P(y=0|sentence1):

$$
P(y=0|sentence1) = P(y=0|(x_1,x_2))\\= \frac {P(x_1|y=0)P(x_2|y=0)P(y=0)}{P(x_1)P(x_2)}\\ P(y=0)=\frac{2}{5}=0.4\\ P(x_1|y=0)=\frac{2}{2}=1 \\P(x_2|y=0)=\frac{0}{2}=0 \\ P(x_1)=\frac{1}{2}=0.5\\ P(x_2)=\frac{1}{2}=0.5
$$

$$
 P(y=0|(x_1,x_2)) = \frac { 1\times0 \times0.4}{0.5 \times 0.5} = 0
$$

`Note`: If we want to normalize those to one, we will do as below:

`In probability theory, normalization often refers to scaling probabilities so that they sum up to 1. This is particularly useful when dealing with conditional probabilities or probability distributions where the total probability must equal 1`

$$
P(y=1) = \frac {0.8}{0.8+0} = 1\\P(y=0) = 1 - P(y=1)=1-1=0\\
$$

From the above example we see that the probability of positive sentences is higher, so we can say that our sentence is positive.

### **Types of Naive Bayes:**

1. **Gaussian Naive Bayes**: Assumes that features follow a Gaussian (normal) distribution. It is suitable for continuous data.
2. **Multinomial Naive Bayes**: Suitable for discrete features. It's commonly used in text classification tasks where features represent word counts or frequencies.
3. **Bernoulli Naive Bayes**: Assumes that features are binary-valued (e.g., presence or absence of a feature).

### **Advantages of Naive Bayes:**

- Simple and easy to implement
- Works well with high-dimensional data
- Efficient and fast for training and prediction.
- Can handle both categorical and numerical data.
- Performs well with small datasets.

### **Limitations of Naive Bayes:**

- Strong feature independence assumption, which might not hold true in some cases.
- It's known to be a bad estimator, meaning the probability outputs are not very accurate.
- Sensitivity to the presence of irrelevant features.
- Requires a relatively large amount of data for accurate estimation of probabilities.

In [1]:
# GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print(y_pred)

[2 1 0 2 0 2 0 1 1 1 1 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1 1 1 2 0 2 0 0 1 2 2 1 2 1 2 1 1 2 1 1 2 1 2 1 0 2 1 1 1 1 2 0 0 2 1 0 0
 1]


In [2]:
# MultinomialNB
import numpy as np
from sklearn.naive_bayes import MultinomialNB

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))

[3]
