**Disclaimers:**
- All code was typed by me, not copied and pasted
- The code is from the book "Machine Learning with python cookbook" by Chris Albon (1 ed). 
- All comments are written in my own words, based on my previous knowledge and on the comments of the book. 
- My comments are usually short and concise, since the goal of this notebook is to implement the code, not to teach about the model

# 18. Naive Bayes
## 18.0 Introduction

- The Naive Bayes classifier will predict the class with the highest probability _a posteriori_.
- Assuming that the features of a dataset are independent, the full likelihood can be computed as the product of the likelihood for each feature (_Naive_ Assumption)
- We have to previously define the statistical distribution of the likelihood (usually Gaussian, Multinomial or Bernoulli).
- To classify a new observation, compute the probability a posteriori of each class with the likelihood and the _prior_ probability of each class. The clas with the highest posterior probability is the prediction of the model



## 18.1 Training a Classifier for Continuous Features
**Problem**
- Train a Naive Bayes Classifier with a dataset that has continuous features only
**Solution:**
- Gaussian Naive Bayes

In [2]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()
features = iris.data
target = iris.target

classifier = GaussianNB()

model = classifier.fit(features, target)

In [3]:
new_observation = [[4, 4, 4, 0.4]]

model.predict(new_observation)

array([1])

**Discussion:**
- In the previous case, the priors for each class were adjusted with the dataset (frequentist approach). 
- We can define prior probabilities for each class, as bellow:
- NOTE: This is approach could be more _bayesian_ if we did some Bayesian Modelling hehe. (See the pymc libary and this pydata talk, as an introduction)
- https://www.youtube.com/watch?v=911d4A1U0BE

In [5]:
clf = GaussianNB(priors=[0.25, 0.25, 0.5])
model = clf.fit(features, target)
model.predict(new_observation)

array([1])

**Discussion:**
- The posterior probabilities are not useful! They should not be interpreted as the probability for each class

## 18.2 Training a Classifier for Discrete and Count Features
**Problem:**
- Given discrete or count data, train a Naive Bayes Classifier

**Solution:**
- The dreaded but amazing Multinomial distribution
- See this video for an intro to the Multinomial: 
- https://www.youtube.com/watch?v=Dkc_hcVWDpA
- Yes, I do feel very _statistical_ by knowing a bit about the Multinomial Distribution xD

In [6]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

text_data = np.array(['I love Brazil. Brazil!',
                      'Brazil is best',
                      'Germany beats both'])

**Comment:**
- Hahahahah this is actually the example from the book. See page 333.
- I'm pretty sure he's referencing to the World Cup and the 7x1 
- It is crazy to find a reference of this in a ML book.


In [9]:
# Create bag of words - Bilbo would be proud
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

features = bag_of_words.toarray()
target = np.array([0, 0, 1])

# Note that Brazil is indice 4
features

array([[0, 0, 0, 2, 0, 0, 1],
       [0, 1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0, 0]], dtype=int64)

In [None]:
classifier = MultinomialNB(class_prior=[0.25, 0.5])
model = classifier.fit(features, target)

**Discussion:**
- We are predicting if a message (tweet?) comes from a pro-brazil or pro-germany person.
- We created a bag of words to encode the text data. We could have better encoded it
- We specified the prior probabilities of each class
- MultinomialNB has an ``alpha`` parameter that does _smoothing_. *No idea what this is. Research later

## 18.3 Training a Naive Bayes Classifier for Binary Features
**Problem:**
- Given binary data, train a Naive Bayes Classifier

**Solution:**
- The Bernoulli distribution, mother of all distribs
- All hail the Bernoulli!
- Lots of jokes in this notebook, huh? Sorry for that

In [17]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB

features = np.random.randint(2, size=(100, 3))
target = np.random.randint(2, size=(100,1)).ravel() # Why use ravel here and not just specify size=100?

In [18]:
classifier = BernoulliNB(class_prior=[0.25, 0.5])
model = classifier.fit(features, target)

**Discussion:**
- Bernoulli also has an $\alpha$ smoothing parameter
- In all NB cases, we can specify an uniform prior by setting `fit_prior=False`

## 18.4 Training a Naive Bayes Classifier for Binary Features
**Problem:**
- Calibrating the predicted probabilities from the NB classifiers so they are interpretable! (They are the probability of that class)

**Solution:**
- ``CalibratedClassifierCV``

In [19]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV

In [21]:
iris = datasets.load_iris()
features = iris.data
target = iris.target

classifier = GaussianNB()

classifier_sigmoid = CalibratedClassifierCV(classifier, cv=2, method='sigmoid')

classifier_sigmoid.fit(features, target)

new_observation = [[2.6, 2.6, 2.6, 0.4]]

classifier_sigmoid.predict_proba(new_observation)

array([[0.31859969, 0.63663466, 0.04476565]])

**Discussion:**
- We are calibrating the posterior probabilities of each class using a sigmoid function
- I understand that we are applying the sigmoid function to turn the results into probabilities ("squashing" it between 0 and 1).
- We are doing this k=2 times (k-fold cross validation), and taking the average result. 
- What I don't understand is why exactly we are doing cross validation here...
- In the book, it mentions that there are two possible calibration methods.
- Research this later!