# Naive Bayes

### Bayes'  Theorem
* Think about making a prediction, given a certain set of information
* You have **Prior** beliefs, which is what you would assume without the data. Typically, this is the strict proportion of samples that have a certain category.
* You have the **:ikelihood** function, which is the probability of observing the data given the observation. This is seperate from the posterior, 
* You have the **Normalization**, which is how likely you are to observe the data at all. You must divide this out because you want to prove that it is "given" the data.

>$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$
0. $P(Y|X) = Posterior$
1. $P(Y) or P(Y|E)= Prior$
2. $P(X|Y) = Likelihood$
3. $P(D|E) = Normalization$

### The "Naive" assumption
* To be precise, $P(X|Y)$ might require that you already observed a case with the right mix of data. This might not be true when you have many features to your data which each have many options.
* Instead, you can naively assume that all of the features of the data are unrelated. The result is that the likelihood can be expressed as the simple multiple of the probabilities of all features

<center>Resources</center>

https://en.wikipedia.org/wiki/Additive_smoothing

https://stats.stackexchange.com/questions/218492/how-does-naive-bayes-work-with-continuous-variables

In [12]:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import sklearn as skl

# Additive smoothing (add-one smoothing)


# Naive bayes expressed in code
def naive_bayes(y, x, categories, data):
    y = np.array(y)
    x = np.array(x)
    
    posteriors = {}
    likelihood_all = []
    prior_all = []
    normalization_all = []
    
    for category in categories:
        
        given_cat = (y == category)
        prior = np.count_nonzero(given_cat) / len(y)
        
        x_given_cat = x[given_cat]
        likelihood = 1
        for i in range(len(data)):
            feature = x_given_cat[:, i]
            likelihood *= np.count_nonzero(feature == data[i])/np.count_nonzero(given_cat)            
        
        normalization = 1
        for i in range(len(data)):
            feature = x[:, i]
            normalization *= np.count_nonzero(feature == data[i])/np.count_nonzero(y)
        
        likelihood_all.append(likelihood)
        prior_all.append(prior)
        normalization_all.append(normalization)
        
        posteriors[category] = prior*likelihood/normalization
    return (posteriors, prior_all, likelihood_all, normalization_all)

In [13]:
# Trying to use Naive bayes with Reuters dataset

from tensorflow.keras.datasets import reuters

np_load_old = np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = 10000)

np.load = np_load_old

def vectorizer(sequences, dimension=10000):
    # Convert lists of different sizes to boolean vectors with length equal to the total number of words
    vectorized = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        vectorized[i, sequence] = 1
    return vectorized

x_train = vectorizer(train_data)
x_test = vectorizer(test_data)
y_train = np.array(train_labels)
y_test = np.array(test_labels)

categories = np.arange(max(train_labels)+1)

posteriors, priors, likelihoods, normalizations = naive_bayes(y_train, x_train, categories, x_test[0])

In [18]:
import matplotlib
import matplotlib.pyplot as plt

plt.plot(np.arange(len(priors)), priors, c = 'r', label = 'Prior')
plt.plot(np.arange(len(likelihoods)), likelihoods, c = 'o', label = 'Likelihood')
plt.plot(np.arange(len(normalization)), normalization, c = 'o', label = 'Normalization')
plt.xlabel('Category')
plt.ylabel("Value of Component of Bayes' Theorem")
plt.title('Relationship between Value of Prior, Likelihood and Normalization and Category')
plt.legend()

  self._transformed_path = None


AttributeError: type object 'Line2D' has no attribute '_alias_map'

AttributeError: 'Rectangle' object has no attribute 'get_in_layout'

<Figure size 432x288 with 1 Axes>

### Type of Data
* Particularily use