# Notebook 4 - Naive-Bayes and Logistics Regression in NLP

## 7. Naive-Bayes Classifier

As stated in the previous notebook, Naive-Bayes is a supervised learning probabilistic classifier. It is based on applying Bayes' probability theorem and using the fact that occurrence of an event impacts the probability of another event. But how exactly does it work?

### 7.1. NB Fundamentals
The general purpose of classifiers is to *classify* samples from the dataset into 2 or more **classes**. Since we want to classify text, instead of term *sample* we will use term **document**. Thus, classifiers' task is to take an input document *d* and out of all possible classes, return a class *c*, to which the document *d* belongs.

Now, since NB is the probabilistic classifier, its role would be to **maximize the probability** of the predicted class c given the input document d.
<div style="text-align:center"><img src="res/eq1.png" alt="!!!equation 4.1!!!" width="300"/></div>

The intuition of Bayesian classification is to use **Bayes’ rule** to transform the equation above into their probabilities that have some useful properties.
<div style="text-align:center"><img src="res/eq2.png" alt="!!!equation 4.2!!!" width="300"/></div>

We then substitute the first equation into the second one to get:
<div style="text-align:center"><img src="res/eq3.png" alt="!!!equation 4.3!!!" width="400"/></div>

We can conveniently simplify the above equation by dropping the denominator *P(d)*. This is possible because we will be computing *P(d|c)P(c) / P(d)* for each possible class, but *P(d)* does not change for each class; we are always asking the most likely class for the same article *d*, which must have the same probability *P(d)*. Thus, we can choose the class that maximises the simpler formula

<div style="text-align:center"><img src="res/eq4.png" alt="!!!equation 4.4!!!" width="400"/></div>

Okay, but how do we actually represent a document *d*? We can represent a document as a set of **features** `d = (f1, f2, f3 ... fn)`. One way to define these features is to use the Bag-of-words model introduced in the Notebook 2. After contstructing the BOW of the complete dataset, we will be able to express each document as a vector of word counts. Thus, we can treat each vector value associated with a different word as a separate feature giving us information on the words (and optionally their counts) used in the document. Here we also introduce the first of two **simplifying assumptions**: since we use BOW, the **word ordering doesn't matter**. We don't care about the position of a word in a document.
<div style="text-align:center"><img src="res/eq5.png" alt="!!!equation 4.6!!!" width="400"/></div>

However, calculating `P(f1, f2, f3 ... fn | c)` requires computing all possible combinations of features (if BOW uses sum pooling than even more!). We need another simplifying assumption called the **naive Bayes assumption** - the conditional independence between features given the same class. Hence, we can multiply probabilites as follows:
<div style="text-align:center"><img src="res/eq6.png" alt="!!!equation 4.7!!!" width="400"/></div>

Resulting in a final equation:
<div style="text-align:center"><img src="res/eq8.png" alt="!!!equation 4.8!!!" width="400"/></div>


### 7.2 NB Training

So how do we train the classifier? How does it learn what actually is *P(c)* and *P(f|c)*? Starting with the first probability, we can simply use frequencies and derive it from the probability definition: the probability of a class in the dataset is the number of documents of this class divided be the total number of all documents. 
<div style="text-align:center"><img src="res/eq8,5.png" alt="!!!equation 4.11!!!" width="200"/></div>

Learning the probability of a features given a class *P(f<sub>i</sub>|c)* isn't more complicated. We assume a feature is just the existence of a word in the document’s bag of words (set of the vocabulary *V*), and so we’ll want *P(w|c)*, which we compute as **fraction of times the word w<sub>i</sub> appears among all words in all documents of topic c**.
<div style="text-align:center"><img src="res/eq9.png" alt="!!!equation 4.12!!!" width="400"/></div>

Let's consider the following example:

This is our training data:

| **Text** | **Lables** |
|----------|-----------|
|"What a great match" | sports |
|"The election results will be out tomorrow"| not sports |
|"The match was very boring"| sports |
|"It was a close election"| not sports |

To make the example easier to follow, let’s assume we applied some pre-processing to the sentences and removed stopwords. The resulting sentences are:

| **Text** | **Lables** |
|----------|------------|
|"great match"|  sports |
|"election results tomorrow"| not sports |
|"match boring"| sports |
|"close election"| not sports |

In our small corpus, we have 2 classes each having 2 senteces. Hence, the probability of each class is:

        P("sports") = 2/4 = 0.5
        P("not sports") = 2/4 = 0.5

Total unique features (words) for "sports": 3
Total unique features (words) for "not sports": 4

Let's say we want to assign a class to the following sentence "that was a very close, great match". After stop word removal it is "**close great match**". Now we need to perform some calculations:

1. Likelihood P("close great match"|sports) = P("close"|sports) * P("great"|sports) * P("match"|sports)
2. Likelihood P("close great match"|not sports) = P("close"|not sports) * P("great"|not sports) * P("match"|not sports)

| word | P(word\|sports) | P(word\|not sports)|
|------|----------------|-------------------|
| close | 0/3 | 1/4 |
| great | 1/3| 0/4|
| match | 2/3| 0/4|

We suspect that the correct class is "sport", right? Let's see what happens with the likelihood:
P("close great match"|sports) = 0/3 * 1/3 * 2/3
P("close great match"|sports) = 0

Oops... This example shows a very common situation - there were no training documents classified as "sports" containing word "close". As a result, P("close"|"sports") results in a painful zero, which also makes the product of probabilities equals 0. We can solve this issue using **smoothing**.

### 7.4 NB Smoothing & Unknown words

<!-- TODO -->


### 7.5 Completed NB example & Implementation

Ok, so let's complete our example using the Laplace smoothing.

The size of our vocabulary |V| = 7, so we will add it to all denominators.

| word | P(word\|sports) | P(word\|not sports)|
|------|----------------|-------------------|
| close | 1/10 | 2/11 |
| great | 2/10| 1/11|
| match | 3/10| 1/11|

P("close great match"|sports) = 1/10 * 2/10 * 3/10 = 0.006
P("close great match"|not sports) = 2/11 * 1/11 * 1/11 = 0.0015

Now, we have to multiply each probability by the probability of a class (posterior = likelihood * prior), which results in:
P("close great match"|sports)*P(sports) = 0.006 * 0.5 = 0.003
P("close great match"|not sports)*P(not sports) = 0.0015 * 0.5 = 0.00075

Thus, there is a higher probability that the test document is indeed about sports and this would be the decision of the NB classifier.

Now, let's implement exactly the same example using `Python` and `scikit-learn`!

In [None]:
# 1 - sports, 0 - not sports
training_corpus = ["What a great match",
          "The election results will be out tomorrow",
          "The match was very boring",
          "It was a close election",
          ]
training_labels = [1, 0, 1, 0]

test_doc = "that was a very close, great match"

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

from sklearn.feature_extraction.text import CountVectorizer

# Instantiate countvectorizer. You can sepcify what kind of preprocessing will be done by the CountVectorizer
# In this case we want all text in lowercase, remove stopwords, keep only alphanumeric characters.
count_vector = CountVectorizer(lowercase=True, stop_words='english', token_pattern='\w+')

# Fit training data
training_data = count_vector.fit_transform(training_corpus)
# Let's inspect vectors
print(count_vector.get_feature_names())
print(training_data.toarray())

# transform test data
test_doc_transform = count_vector.transform([test_doc])

In [None]:
# Create the classifier object and fit data. The classifier object is by default created with the Laplace smoothing
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, training_labels)

# Make predictions.
predictions = naive_bayes.predict(test_doc_transform)
predictions

Result of 1, means the classifier predict that the test sentence was from the "sport" category. Now, let's try with this one:

In [None]:
test_doc2 = "that was a very close, great election"
test_doc2_transform = count_vector.transform([test_doc2])
predictions = naive_bayes.predict(test_doc2_transform)
predictions

In [None]:
# We can also inspect calculated probabilites.
print(naive_bayes.classes_)  # print the order of classes
print(naive_bayes.predict_proba(test_doc_transform))
print(naive_bayes.predict_proba(test_doc2_transform))


This was just a simple example to familiarize with the NB classifier. Let's look at a real word scenario!

### 7.6 News classification using NB

## 8. Logistic Regression Classifier