# Naive Bayes

Naive Bayes is a basic and "naive" method of classifying text. Despite being less advanced than many modern techniques, it can still be quite powerful, as you will see in this workshop. 

Each unique word in the corpus will be considered a feature, and the value of that feature for a given sample will be its frequency. For example, if "the" occurs 4 times in a sample, the value of the "the" feature will be 4.  

A critical step in training a Naive Bayes classifier is deciding what qualifies as a unique word. For our purposes, we recommend converting everything to lower case and removing all punctuation, numbers, and whitespace. However, you can decide how strict you want to be. 

Complete the tokenize function below:

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint</b></font>
</summary>
<p>
<ul>
    <li>You can use .split() on a string to create a list of tokens separated by whitespace.</li>

</ul>
</p>

In [None]:
def tokenize(message):
  message = message.lower()
  # make sure to remove all extraneous characters
  return ... # a list of all the tokens in the message (including duplicates)

Now, we'll use that tokenization function to get the tokens of each message and put them in a list, messages_tokens. At the same time, we'll keep track of all the unique tokens in vocab. This is also a good time to create y, which has the labels for each sample. In this case, the labels represent spam or not spam.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint</b></font>
</summary>
<p>
<ul>
    <li>Use your tokenize function!</li>

</ul>
</p>

In [5]:
vocab = set()
y = []
messages_tokens = []

# reads file and creats vocabulary
with open('smsspam_raw.data', 'r') as file:
  for line in file:
    label, message = line.strip().split('\t', 1)
    tokens = ...
    messages_tokens.append(tokens) # used later for populating x
    vocab.update(tokens)
    y.append(1 if label == 'True' else 0)

vocab = sorted(vocab)
vocab_dict = {word: idx for idx, word in enumerate(vocab)} # to easily get feature's index from a token

The code block below takes the messages and vocab information collected above and uses it to create a matrix, x, that we will be able to use for our model.

In [None]:
from collections import defaultdict
import numpy as np

features = len(vocab)
x = np.zeros((len(messages_tokens), features)) # shape is (samples, features)
for i, tokens in enumerate(messages_tokens):
    word_count = defaultdict(int)
    for token in tokens:
        if token in vocab_dict:
            word_count[token] += 1
    for token, count in word_count.items():
        x[i, vocab_dict[token]] = count

When training models, it's important to separate training and testing data. If we used all the data on training and did our testing on the same data, it wouldn't be very fair, as the model could easily just remember what the right answers are. We'll use 30% of the data for testing and leave the rest for training.

In [7]:
from sklearn.model_selection import train_test_split as TTS
x_train, x_test, y_train, y_test = TTS(x, y, test_size=0.3) # 30% testing

Now that our data is split into training and testing, we can go ahead and actually train our model. Use nb.fit on x_train and y_train.

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(..., ...)

Now that our model is trained, let's see how good it is. Use nb.predict on the test data, and compare that to the actual values in y_test!

In [None]:
from sklearn.metrics import classification_report

y_pred = ...
print(classification_report(y_test, y_pred))

If you've done everything right, you should be getting pretty good results! Now, feel free to do some experimentation to see if you can improve upon the model. Maybe you can change your tokenizer or remove unnecessary features? Let us know what your best results are!