# Data Science Ex 06 - Classification (Naïve Bayes)

27.03.2022, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some fun with Classification and Model Evaluation!

In this exercise you are going to get an introduction to classification and how you can evaluate and visualize the performance of your model.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Introduction

### Data

Reference: https://scikit-learn.org/stable/datasets.html

First, we need some data to run our classification algorithm on.
The scikit-learn package (`sklearn`), which we will also use for the algorithms and evaluation, offers some sample datasets that we can use to work with.
The following lines download articles/messages and their assigned categories.

Please note: The data will be downloaded the first time. Thus, the execution will take some time during the first run.

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

With the following calls, we get a glimps on the structure of our demo data and what's in there.

`keys()` returns the root level keys of the data object

In [None]:
data.keys()

We can now use one of these keys to access parts of the demo data that are relevant to us.

In [None]:
len(data["data"])

In [None]:
data["data"][0]

In [None]:
data["target"]

The array from above contains the index of the corresponding label that the data belongs to.
The cleartext names of these categories are the following:

In [None]:
data["target_names"]

So, the example from above belongs to the following category:

In [None]:
dataId = 0

#### Message

In [None]:
data["data"][dataId]

#### Target/Category/Label

In [None]:
catId = data["target"][dataId]
data["target_names"][catId]

### Naïve Bayes

Reference: https://scikit-learn.org/stable/modules/naive_bayes.html

Since the articles contain text, we need some preparation of the data so the Naïve Bayes algorithm can "understand" (read: work with) it.
In case of the text, we will use the [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) algorithm to assign a significance to every word within the data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

To train and test the algorithm, we need two distinct sets of data.
Additionally, within this introduction, we'll just use a subset of all categories which are defined in the `categories` list.

In [None]:
categories = ['sci.crypt', 'sci.electronics',  'sci.med',  'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',  'talk.politics.mideast',  'talk.politics.misc',  'talk.religion.misc']
train = fetch_20newsgroups(subset="train", categories=categories)
test = fetch_20newsgroups(subset="test", categories=categories)

Setting the `subset` parameter will already split the data into two sets.
Later, you'll see another function how you can do the same with any dataset.

Training the model contains two steps:
1. Preparing the data

In [None]:
tfidf = TfidfVectorizer()
train_mod = tfidf.fit_transform(train["data"])
train_mod

We won't go into detail about [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf).

In short: TF-IDF stands for *term frequency - inverse document frequency*.
The algorithm takes every word of a document, counts how many times it's in there compared to all words (*term frequency* part), and couples its significance to the occurrence in all documents (*inverse document frequency*).
The idea is, that if a word is used many times in a document, but seldom in the others, it's important to find similar documents.
And if a word is quite frequent in all documents, it's not that useful for finding similar documents.

If you are interested, feel free to dig deeper.
But for this course, going deeper is not relevant and is thus out of scope.

2. Actually training the model

In [None]:
model = MultinomialNB()
model.fit(train_mod, train["target"])

Now we have a trained model of our Naïve Bayes algorithm.

Before we can do predictions (use the model), we also have to process the testing data by running it through the `TfidfVectorizer()`.
And then predictions can be made by calling the `predict()` method on the model.

In [None]:
test_mod = tfidf.transform(test["data"]) 
pred = model.predict(test_mod)
pred

And that's it.
As you see, we need to call a `fit()` method to train the model and with a call to `predict()` we can get some predictions for another dataset.

#### Feature Pipelines

Now, as you also saw, we had to do some preprocessing (vectorizing) before we could build the model.
These steps are usually the same for every approach:
- Preprocessing the train data
- Train the model
- Preprocessing the test data
- Use test data to make predictions

To avoid code, and for that matter, errors in code, scikit-learn introduced a concept called *Feature Pipelines*.

In [None]:
from sklearn.pipeline import make_pipeline

Since the algorithms included in sklearn always provide the same interface (there is always a `fit()` and `predict()` method), we just have to provide and configure the algorithms.
The calls to the methods are made by a `Pipeline` object.
The code from above can be re-written as:

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train["data"], train["target"])
pred = model.predict(test["data"])
pred

As you can see, the code looks much cleaner and clearer, but it's the same as in the lines above.

Having our predictions, we can go on and evaluate them.

### Model Evaluation

First of all, we can simple check how well our predictions match the expected results.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(test["target"], pred)

This means, that our predictions work nearly 80% of the time - and as you will see later on, this number is prabably higher.

#### Crossvalidation

Sometimes, a simple score isn't enough to determine if a model works well with our data.
We want to evaluate the model on our training data as well - before we even make predictions.
In this case, we can use crossvalidation to check multiple combinations of our data against the model and see how well each part performs.

In [None]:
from sklearn.model_selection import cross_val_score

Using crossvalidation, we have to provide a model (in this case our pipeline object), the dataset split into data and categories.
And with the `cv` parameter, we specify how many sets for crossvalidation should be built (in this case 5).

In [None]:
crossVal = cross_val_score(model, train["data"], train["target"], cv=5)
print(f"{crossVal} -> {crossVal.mean()}")

Based on our training data, we can see that our model is 86.5% correct with its predictions.

### Retraining

Since crossvalidation only trains our model with parts of the train set, we need to retrain our model with the full train set at the end (before we start actually using the model).
And to be sure we won't use any legacy settings, we recreate the model as well.

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train["data"], train["target"])
pred = model.predict(test["data"])
accuracy_score(test["target"], pred)

### Confusion Matrix

Another way to visualize the performance of a model is to plot a confusion matrix.
sklearn offers a method to create this confusion matrix, and using `seaborn` we can plot it nicely.

In [None]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(test["target"], pred)   # Comparing the expected categories with the predicted categories
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(matrix.T, square=True, annot=True, fmt="d", cbar=True, xticklabels=train.target_names, yticklabels=train.target_names, ax=ax)
ax.set(xlabel="True Category", ylabel="Predicted Category")

As you can see here, the algorithm performed really well (brigth diagonal line), except it got confused by *soc.religion.christion* and *talk.religion.misc*.
But I'd say in this case, this confusion isn't that severe as religion is the overall topic.
So it wasn't that wrong.
The same is with *talk.politics.guns* and *talk.politics.misc* or *sci.crypt* and *sci.electronics*.
It points in the right direction.

`confusion_matrix()` can also be used to reorder and sample the used labels.
If we just want to focus on the `sci` labels, and in the reversed order, we can set the `labels` parameter.

In [None]:
labelIdx = [3,2,1,0]
labels = np.array(train["target_names"])[labelIdx]
labels

In this case, where the labels are indices pointing to the actual name of the label, we have to split the information into `labelIdx` (filtering in `confusion_matrix()`) and `labels` (ticks in the visualization).

In [None]:
matrix = confusion_matrix(test["target"], pred, labels=labelIdx)   # Comparing the expected categories with the predicted categories
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(matrix.T, square=True, annot=True, fmt="d", cbar=True, xticklabels=labels, yticklabels=labels, ax=ax)
ax.set(xlabel="True Category", ylabel="Predicted Category")

### Using the model

Now that we saw that the model performed pretty good, we can use it for our own input.
To simplify the usage, we write a simple method, that takes away writting the same lines of code all the time.
The method takes our text, the model and categories, predicts the category (`p` contains the index of the category) and we will use this information to get the name of the category back.

*Please note:* This is only necessary for this model since the categories are encoded as numbers in the data. 

In [None]:
def predict_category(text, model=model, categories=train["target_names"]):
    p = model.predict([text])   # We need to insert the text as part of a list
    return categories[p[0]]     # The predicted category is also a list, and since we only provided one text, there is only one category predicted

In [None]:
predict_category("I believe in god")

In [None]:
predict_category("Fly me to the moon")

In [None]:
predict_category("How do hash algorithms work?")

And sometimes, based on the limited data, and probably the actuallity of the dataset, the results can a bit funny.

In [None]:
predict_category("Android vs iOS")

In [None]:
predict_category("Data science is fun. Hard to learn, but if you master it, your life gets easier.")

In [None]:
predict_category("Coronavirus")

In [None]:
predict_category("Corona virus")

You may noticed that the longer the input, the better the prediction will be since the model will have more input to work with.

### Train & Test Sets

Compared to the demo data from the beginning, we do not always (read: never) have the luxury that the data already comes with a train and test set.
In this case, we need to split our data into a test set and training set.

Since this is a common requirement in Data Science, there is a method we can use.

In [None]:
from sklearn.model_selection import train_test_split

Just as an example, we will use our `train` data to be split into a subset for training `X_train` (data) and `y_train` (categories) and one for testing `X_test` (data) and `y_test` (categories).
With `train_size` we provide the percentage of data used as part of the training set, the rest will be part of the test set (there is also a `test_size` that you could use instead).
And `random_state` ensures that we always split the data the same way (for reproducibility).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train["data"], train["target"], train_size=.75, random_state=42)

Now, this example simply took 75% of the original data as train set.
And the other 25% are in the test set.

### More Naïve Bayes

scikit-learn offers some more Naïve Bayes algorithms that could be interesting for you (list not exhaustive):
- **MultinomialNB:** For features that have a number of occurrences (how many times? Used above)
- **BernoulliNB:** If we just say that a feature is present or not (`0` or `1` | `True` or `False`)
- **GaussianNB:** If the features have a gaussian distribution

## Exercises

### Ex01 - News Categories

In the introduction, you saw the usage of the *20newsgroups* dataset with some categories.
Now, create and train a model using the other categories (*alt.atheism* to *rec.sport.hockey*) and do some predictions.

First, specify the categories you want to use.

Now, fetch the `train` and `test` sets for these categories.

Build the pipeline using the `MultinomialNB` classifier.

Train the model.

Predict the categories for the test set.

Calculate the confusion matrix.

Plot the heatmap of the confusion matrix.

Predict the categories for:

- "General Motors is a car manufacturer."

- "The Boston Red Sox actually wear red socks."

- "Have you tried turning it off and on again? Maybe a reboot helps."

#### Solution

In [None]:
# %load ./Ex06_01_Sol.py

### Ex02 - Spam Filter

In this exercise, you are going to train a simple spam filter.
First, load the data from **Ex06_02_Data.csv**.

As you can see, the column *Label* contains the information if a message was spam (`spam`) or not spam (`ham`).

Create the train and test sets.
The train set should contain 70% of the data.

Create, train and predict the labels using the same approach as shown in the introduction.
But this time, we will set a parameter to the `MultinomialNB()` classifier - set `alpha=.1`.
Your constructor call should look like `MultinomialNB(alpha=.1)`.
This makes the classifier a bit more radical in deciding.

Plot the confusion matrix as heatmap to see how well the classifier performs.
Entries labeled as `spam` should be shown in the top left corner.

And now, use the model to predict:

- "Whazaaaap!"

- "Congratulations, you've won the lottery."

- "Sorry, I'll be late."

- "I'm a nigerian prince who needs to transfer some gold. You can have $1'000'000 if you work with me."

As you can see, your model was nearly always correct - except for the last one.

#### Solution

In [None]:
# %load ./Ex06_02_Sol.py

### Ex03 - Alexa Reviews (Part 1)

In this exercise, we will try to predict if a customer liked a product or not.
We will do this by analysing the reviews of the Amazon Alexa.
As you will see, the exercise is split into two parts since it's a bit bigger than the previous ones.

*Please note:* If you can't complete this part, you can also start with the next exercise.
The idea is that you take your data from this exercise to the next one, but we've also provided the input data for the next exercise.

Load the file **Ex06_03_Data.csv** and show the first couple of lines.

As you can see, we have *verified_reviews* that we will use to predict the *feedback*.
Here, `1` stands for like, and `0` for dislike.

Create the train and test sets with 80% of the data in the train set.

Create the model, the same way you've done it before (here, we won't use `alpha`).
And show the crossvalidation score with `cv=5`.

As you see, the model performs really well.

So, train the model with your train set and predict the feedback for your test set.

Show the result as confusion matrix.

As you now see, your model performed really well, because it just assumed everyone liked Alexa.
Conspiracy theorist probalby think that Alexa and scikit-learn are working together - but that's not the case - well, probably.

Plot the amount of positive and negative feedbacks.

*Hint:* Use `value_counts()` on the column and the rest is easy.

As you can see, we have a severe case of class imbalance.
And the model just figured that dislikes are so rare, that it's easier just to assume all reviews are positive (or at least nearly all the time).

To fix this, we will upsample the negative reviews.
This means, we just duplicate reviews until the imbalance isn't a problem anymore.
To do so, we use a method called `resample()`.
This method takes 4 parameters:
- `arrays`: The data to take samples from
- `replace`: Boolean parameter to state if duplicates are allowed
- `n_samples`: Number of samples to take from the `array`
- `random_state`: Seed to reproduce the sampling

In [None]:
from sklearn.utils import resample

Upsample the negative reviews to 2/3 of positive reviews.
And build a new `DataFrame` containing the positive reviews and your newly upsampled negative reviews.

*Hint:* You need `pd.concat()` to create the new dataset.

Now, plot the *feedback* again to show that the imbalance is gone.

#### Solution

In [None]:
# %load ./Ex06_03_Sol.py

### Ex04 - Alexa Reviews (Part 2)

In this exercise, you will actually do the Amazon Alexa like/dislike classifier.
You can take the data from the previous exercise, or you can load **Ex06_04_Data.csv**.
This file should contain roughly the same data as you created in the exercise above.

Load the data, create your train and test sets (choose your own split ratio), create a model (`alpha=.1`) and show the crossvalidation scores.

As you can see, the model performs equally well as before.

Create the model (again with `alpha=.1`), train it and predict the feedback for the test set.

Show the confusion matrix to show that the classifier now performs better.

As you can see, the classifier works now.
At least negative reviews are found.

Now predict the feedback for the following reviews:

- "I love my Alexa!"

- "I hate it!!!!"

- "It does not work. Sound quality is bad."

- "It's a cool tool. My life got way easier."

- "Just the works product ever"

- "The NSA is probably listening..."

As you can see, it performs quite well.
I'm just not quite sure about the last one...

#### Solution

In [None]:
# %load ./Ex06_04_Sol.py