<span style="font-size: 24px;"> Naive Bayes </span> 

Naive Bayes is a simple classification algorithm based on Bayes' theorem. It calculates the likelihood of an event happening given prior knowledge of other events. It performs particularly well on high-dimensional data.

<b> Pros and Cons for Naive Bayes </b>

Pros:
- Naive Bayes can be used on small amounts of training data.
- It can handle continuous and discrete data.
- It is simple and fast.
- Can be used for both binary and multi-class classification problems.

Cons:
- the assumption of linear independence is probably not particularly accurate
- If there is a categorical value in the test data set that doesn't appear in the training set, it will be allocated a probability of 0.

<span style="font-size: 20px;"> Bayes' Theorem </span> 

Baye's theorem looks at the probability of a class, given a feature vector X.

P(C|X) = P(X|C)xP(C)/P(X)

The 'naive' part is assuming the features are independent, i.e. the probability that A and B occur simultaneously, is equal to A occuring and B occuring i.e. P(A union B) = P(A)P(B). This means they're also conditionally independent:

P(X1,X2,....Xn|C) = P(X1|C)P(X2|C)......P(Xn|C)

Naive Bayes works well for text classification because text datasets typically have high dimensionality. Text classification involves relatively simple relationship and is less likely to violate the assumption of naivety. Naive Bayes works well with high dimensionality because of th

<b> Types of Naive Bayes: </b>

- If the features are continuous and are assumed to follow a normal distribution, you use a Gaussian Naive Bayes. You can normalise the data prior to using this.
- If the features are discrete/categorical data like text then you can use multinomial Naive Bayes (i.e. generalisation of a binomial where there's k experiments and two classes, to multiple classes)
- If the features are binary, can use the Bernoulli Naive Bayes.

Step 1) For each input feature outcome, calculate the frequency of occurences of that outcome relative to the no. of examples to get P(X), then calculate the probability of each class P(C). Then we can calculate the likelihood P(X|C) based on the distribution assumed (Gaussian, Multinomial, Bernoulli). 

Step 2) Laplace Smoothing 

This is a technique for smoothing categorical data. A small-sample correction, or pseudo-count, will be incorporated in every probability estimate. Hence, no probability will be zero or too small. This prevents the whole calculation from being 0 (and also division by 0 causing errors). It also prevents underflow, which is when very small probabilities cause you to reach the limit of numerical precision and get rounded down to 0. 

Step 3) Calculate the posterior probability P(C∣X) via Baye's theorem. 

Step 4) The class with the highest posterior probability is the predicted class.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Suppose we already have X_train and Y_train

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)

# Create a Gaussian Naive Bayes classifier
naive_bayes_classifier = GaussianNB()

# Train the classifier on the training data
naive_bayes_classifier.fit(X_train, y_train)

# Make predictions on the testing data
predictions = naive_bayes_classifier.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, predictions)