# Applications of the Naive Bayes classifier

## Task 1
The _Naive Bayes_ classifier is often used when it comes to the classification of textual data but can also be used for any other classification task. The underlying math comes from the _Bayes theorem_ which describes the probability of an event based on a _prior_. This prior represents the knowledge of different conditions and thus often allows for a more accurate prediction.  
As we will use the classifier on text data, we first take a quick look/recap at useful preprocessing techniques for _Natural Language Processing_ (NLP) as you will need these to solve the task at hand.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Remove punctuation
#### List comprehension
There are different techniques to remove punctuation from our data. The first uses the _string_ library and a list comprehension.

In [None]:
import string

print(string.punctuation)

s = ("This is a test text to show off some of the relevant preprocessing techniques for NLP problems. These include the removal of punctuation, the conversion to lower case as well as the removal of stopwords.")

s_wo_punct = [c for c in s if c not in string.punctuation]
s_wo_punct

In [None]:
s_wo_punct = "".join(s_wo_punct)
s_wo_punct

#### Regular expression
The second approach uses _regular expressions_ to find specific characters and then substitute those characters with an empty char. You are not limited to use string.punctuation but can define any char you want to be deleted from the string (see commented line in the code below).

In [None]:
import re

s_wo_punct2 = re.sub("[.,!?:;-='...\"@#_]", "", s)
s_wo_punct3 = re.sub(f"[{string.punctuation}]", "", s)

print(s_wo_punct2)
print(s_wo_punct3)

### Lower case
Transform the text to lowercase.

In [None]:
s_wo_punct_lower = s_wo_punct.lower()
s_wo_punct_lower

### Remove stopwords
As text classification does not usually rely on a deep understanding of the underlying text, the added value of pronouns, articles and prepositions oftentimes diminishes for these kind of tasks. They are thus entirely removed from the text corpus to reduce the dimensionality of the input data.  
We use the python NLP package __NLTK__ which requires you to download the stopwords if you use it for the first time. Subsequent usage of the package will not require you to refetch these files every time. The stopwords are designed for different languages as can be seen in the code below.

In [None]:
# You have to download stopwords Package to execute this command
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

print(stopwords.words('german'))
print(stopwords.words('english'))

In [None]:
s_wo_punct_lower_nosw = " ".join([w for w in s_wo_punct_lower.split() if w not in stopwords.words("english")])

s_wo_punct_lower_nosw

### Count Vectorizer
A _Count Vectorizer_ is used to generate a representation of the underlying text in terms of the frequency of all words in the corpus. These features (__X__ in this case) can then be used to train a classifier.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

sample_data = ['This is the fourth exercise.','This exercise is not online yet','Exercise four is boring, I want another exercise','Is this the first exercise?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)

print(vectorizer.get_feature_names_out())

Here the transformed input gets stored into a dataframe. _X_ could also directly be used for training purposes (via model.fit(X)).

In [None]:
print(X.toarray())

df2 = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df2

### Naive Bayes classifier
In order to use the classifier you first need to import it. There are three different variants preimplemented by sklearn, namely the _MultinomialNB_ which is used for the classification of text data, the _CategoricalNB_ which handles categorical data and the _GaussianNB_ for continuous features. In this exercise you will get to apply the two former types.

In [None]:
from sklearn.naive_bayes import MultinomialNB, CategoricalNB

naive_bayes = MultinomialNB()

naive_bayes.fit(X,[0,0,0,1])

## Task 2
As written in the accompanying PDF, please download the spam dataset ("emails.csv") from __[here](https://github.com/DataScienceLabFHSWF/machine-learning-book/tree/main/data/naive_bayes)__ and load it into a pandas Dataframe. Get familiar with the dataset. 

### Visualization
Visualize different aspects of the dataset (e.g. class distribution, text length of the different entries) by using matplotlib or seaborn. The text length of each sample should be stored in an extra column called _length_.

Get length for each text sample and store them in column _length_ of the dataframe.

Plot the distribution of lengths for _spam_ and _non-spam_ samples.

Display the shortest (and longest) message that are stored in this dataframe.

Visualize the most frequent words (for spam and non-spam texts) with the help of the package _wordcloud_. Do you notice any meaningful differences between these two wordclouds? What is the problem with some of the frequent words (for both cases) and how would you rate the added value of these problematic words when it comes to actually training a classifier?

Calculate the class distribution between spam and non-spam data in percent and then use a barplot (or countplot if you use seaborn) to present it visually.

### Preprocess dataset
Define your functions for text cleaning here and then preprocess the text. 

Now that we have cleaned our sample texts we can use the dataset to perform the training and test procedure. Use the _CountVectorizer_ to generate the features. 

### Training
As we are dealing with features generated from text data, we use the _sklearn.naive_bayes.MultinomialNB_ as our underlying model.

Performance on test dataset:

Performance on own text samples:

### Evaluation

## Task 3
In this task we will analyze the "flu.csv" dataset which can be downloaded __[here](https://github.com/DataScienceLabFHSWF/machine-learning-book/tree/main/data/naive_bayes)__. It is a very small toy dataset to showcase the encoding of categorical features as well as the usage of another variant of the Naive Bayes classifier.

### Preprocessing
Get familiar with the dataset. The goal is to predict whether or not a person has the flu.  
What are the feature columns and what is the target column in this example?

Use the LabelEncoder (sklearn.preprocessing.LabelEncoder()) to encode the data in the columns.

Now use the encoded features (f1 to f4) to build a new dataframe for the training of our classifier, the _zip()_ function might be useful here.

### Training
We do not use the _MultinomialNB_ but instead _sklearn.naive_bayes.CategoricalNB_ as we have to deal with categorical data here.

Generate a few input samples to feed into the classifier and print the predictions as well as the predicted probabilities for each target class (_model.predict_proba()_).