## Part 2 - Classification with Bag of Words and TF-IDF

This is part 2 of the **Natural Language Processing** series. If you haven't already done or watched part 1, you can find the [notebook and presentation here](https://github.com/IBMDeveloperUK/Analysing-Jokes-with-Natural-Language-Processing). 

In this notebook we will look at the task of classification. We will continue to work with the dataset of jokes that were scraped from the web but this time we're interested in how we can automatically categorize jokes.

The libraries we'll be using are:
- [NLTK (the Natural Language Toolkit)](https://www.nltk.org/) a very easy Python library for working with text data and NLP.
- [scikit-learn](https://scikit-learn.org/stable/) a library dedicated to machine learning (and other statistics)

### Pre-processing

To make things easier for this notebook, we're going to define one big function that will handle all of the pre-processing steps we want to do. We covered how all of these steps work in part 1 but let's quickly remind ourselves.

1. First up, we take every word and make it **lowercase**. This avoids case-sensitive duplicates. 
2. Next, we expand some very common **contractions** in the english language (e.g. we'll = we will).
3. Thirdly we strip all other **punctuation** out of the text.
4. We now turn our jokes in to lists of words (**tokens**)
5. Remove any tokens that are included in our **stopwords**.
6. Tag the **part of speech** of each word.
7. Use the part of speech to reduce each token to it's **lemma**.

By doing all of this we drastically reduce the number of unique words we have to work with whilst still maintaining almost all of the meaning. 

In [None]:
import string
import re
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 


def preprocess_jokes(jokes):
    #Make them all lowercase
    jokes_no_empties = [joke.lower() for joke in jokes]
    
    #Replace english language contractions
    jokes_expanded = []
    for joke in jokes_no_empties:
        # specific
        joke = re.sub(r"won\'t", "will not", joke)
        joke = re.sub(r"can\'t", "can not", joke)

        # general
        joke = re.sub(r"n\'t", " not", joke)
        joke = re.sub(r"\'re", " are", joke)
        joke = re.sub(r"\'s", " is", joke)
        joke = re.sub(r"\'d", " would", joke)
        joke = re.sub(r"\'ll", " will", joke)
        joke = re.sub(r"\'t", " not", joke)
        joke = re.sub(r"\'ve", " have", joke)
        joke = re.sub(r"\'m", " am", joke)
        jokes_expanded.append(joke)
    
    #Remove all other punctuation
    jokes_no_punct = [joke.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) for joke in jokes_expanded]
    
    #Tokenize each of the jokes
    jokes_tokenized = [word_tokenize(joke) for joke in jokes_no_punct]
    
    #Remove stopwords
    stop_words = set(stopwords.words('english'))
    jokes_no_stops = []
    for joke in jokes_tokenized:
        joke_no_stops = [word for word in joke if word not in stop_words]
        jokes_no_stops.append(joke_no_stops)
    
    #Tag parts of speech using nltk pos_tag and convert to WN format
    pos_tagged_jokes = [pos_tag(joke) for joke in jokes_no_stops]
    def get_wordnet_pos(treebank_tag):
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return 'n'
    wn_pos_jokes = []
    for joke in pos_tagged_jokes:
        wn_pos_joke = [(val[0], get_wordnet_pos(val[1])) for val in joke]
        wn_pos_jokes.append(wn_pos_joke)
    
    #Lemmatize all the individual words
    lemmatizer = WordNetLemmatizer() 
    lemmatized_jokes = []
    for joke in wn_pos_jokes:
        lemmatized = [lemmatizer.lemmatize(token[0], pos = token[1]) for token in joke]
        lemmatized_jokes.append(lemmatized)

    
    return lemmatized_jokes

### Importing the data

We're using the same file as last time `stupidstuff.json`. We will import it, save it to a variable called `data` and then remove any empty entries from the dataset.

In [None]:
import json
with open('stupidstuff.json') as json_file:
    data = json.load(json_file)
print("Number of data entries {}".format(len(data)))
data = [item for item in data if not item['body'] == '']
print("Number of entries after removing empties {}".format(len(data)))

### Looking at the categories

Because we are interested in [classifying](https://en.wikipedia.org/wiki/Statistical_classification) our jokes, we need to look at the categories that have been assigned to each of them so that we can design our model. Let's count how many are in each category and then sort them.

In [None]:
from collections import Counter

categories = [data[i]["category"] for i in range(len(data))]
categories_counts = dict(Counter(categories))
sorted_categories = {k: v for k, v in sorted(categories_counts.items(), key=lambda item: item[1])}
sorted_categories

### Choosing an appropriate subset

We can see that there are actually LOTS of categories. On top of this, many of the categories have very few jokes in them at all. These will be particularly difficult for us to work with because we don't have enough training data. 

There is also one category `Miscellaneous` that is significantly bigger than the others. In order to achieve reasonable results in multiclass classification we ideally want a similar number of training examples for each class. 

If we ignore the `Miscellaneous` class, let's try taking the next 7 biggest categories and use them to build our classification model. That still gives us a reasonable spread of categories and a decent number (100+) of training examples for each.

In [None]:
category_dict = {'Insults':0, 'Men':1, 'Women':2, 'Yo Mama':3, 'Light Bulbs':4, 'Religious':5, 'Political':6}
jokes = []
labels = []
for entry in data:
    if entry['category'] in category_dict:
        jokes.append(entry['body'])
        labels.append(category_dict[entry['category']])

### PreProcessing the data subset

You'll notice that we also created a `labels` list which will store the category of each joke. This is important for both training and testing, as you'll find out later on.

We can now preprocess the data using the function we defined at the beginning of this notebook.

In [None]:
jokes_raw = jokes
jokes = preprocess_jokes(jokes)
jokes[0]

You can see from the first joke printed that the tokens look nice and clean, they all contain relevant words to the joke. We can see immediately that this joke is about Bill and Hillary Clinton. Maybe that tells you something about the age of this dataset 😎

### Picking an example to work through

Let's choose an individual joke to work with so that we can get a feel for how the bag of words model works.

In [None]:
print(jokes_raw[6])

And here's what the joke looks like after pre-processing as well as it's category:

In [None]:
joke = jokes[6]
print(joke)
joke_label = labels[6]
print('The joke\'s category is: '+list(category_dict.keys())[list(category_dict.values()).index(joke_label)] +' which has the label value '+str(joke_label))

### Building a bag of words dictionary

The first thing we need to do with any [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model) is to build a **Dictionary**. 

The **Dictionary** allows us to map the individual tokens to their indices in a vector later on. It involves collecting all of the unique words in the training data and storing an entry for each one in the dictionary.

In [None]:
unique_words = set(joke)
index = 0
dictionary = {}
for word in unique_words:
    dictionary[word] = index
    index += 1
dictionary

### Compiling a bag of words vector

Once we've built our dictionary, we can now create the **Bag of Words Vector** for our joke. This is the important part because it is the vector representation of the joke that we are trying to classify. 

By representing each joke as a vector (i.e. a series of numbers) we are making it computer friendly. It allows us to run mathematical models and algorithms on the data in ways we couldn't do with data stored as a string.

In [None]:
bag_of_words = len(dictionary) * [0]
for word in joke:
    if word in dictionary:
        index = dictionary[word]
        bag_of_words[index] += 1
print(bag_of_words)

### Creating vectors for new jokes

Now that we've created a dictionary and a vector for our joke. Let's try vectorizing another joke with the same dictionary and see what happens...

In [None]:
joke_2 = jokes[500]
print(joke_2)
joke_label = labels[500]
print('The joke\'s category is: '+list(category_dict.keys())[list(category_dict.values()).index(joke_label)] +' which has the label value '+str(joke_label))
bag_of_words = len(dictionary) * [0]
for word in joke_2:
    if word in dictionary:
        index = dictionary[word]
        bag_of_words[index] += 1
print(bag_of_words)

### Bag of words caveats

Oh no! you can see that the bag of words vector is completely empty (all zero values). That's because none of the words from our second joke appear in the first.

This is one of the problems with the bag-of-words model - it doesn't handle unseen words very well. This is why it's important to train a bag of words model with lots of examples so that it has a long dictionary and is unlikely to not have seen a word before. 

Let's take these two jokes we've used and keep them as examples for later. We can then test our machine learning model with them to see if it works!

In [None]:
#remove joke 1 and 2 from the training data
jokes_for_testing = []
labels_for_testing = []
jokes_for_testing.append(jokes.pop(6));
jokes_for_testing.append(jokes.pop(500));
labels_for_testing.append(labels.pop(6));
labels_for_testing.append(labels.pop(500));
labels_for_testing

### Dictionaries at scale

Last time we built a bag-of-words dictionary it was only for a single joke. To create our classifier we're going to need a dictionary that represents all of the words in the training set. 

In [None]:
def build_bow_dictionary(jokes):
    index = 0
    dictionary = {}
    for joke in jokes:
        unique_words = set(joke)
        for word in unique_words:
            if word not in dictionary:
                dictionary[word] = index
                index += 1
    return dictionary
dictionary = build_bow_dictionary(jokes)
len(dictionary)

### Bag of words vector function

Let's also build a function that will allow us to generate the bag of words vectors for any given dictionary. This will be handy if we want to try out different datasets or a different subset of our joke dataset.

In [None]:
def build_bow_vector(joke, dictionary):
    bow_vector = len(dictionary) * [0]
    for word in joke:
        if word in dictionary:
            index = dictionary[word]
            bow_vector[index] += 1
    return bow_vector
print(build_bow_vector(joke, dictionary))

We can now use this function to genereate vectors for every single one of our jokes:

In [None]:
bow_vectors = [build_bow_vector(joke, dictionary) for joke in jokes]
print(bow_vectors[:5])

### Splitting the data

Before we train our model, we want to separate our dataset into two parts: a **training set** and a **test set**.

Our **training set** will be used to train the model along with it's corresponding labels and the **test set** will then be evaluated against that model to give us an accuracy value i.e. how many of the test set's label's the classifier got right!

In [None]:
#Split to train and test data here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(bow_vectors, labels, train_size=0.8, random_state=97)

### Building the model

We're going to use the [Multinomail Naive-Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) algorithm from the scikit-learn library. This is a good choice for multi-class classification problems around NLP and works particularly well with the bag of words model.

We import this and then train it on our **training set** `X_train` and **training labels** `y_train`

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

### Making a prediction

Now that the model has been built, we can make predictions on our test set and see how well it performs:

In [None]:
y_pred = classifier.predict(X_test)

### Checking the accuracy

Finally, we can then compare the predictions the model made with the actual labels for each of our test jokes. This will give us an accuracy value that is indicative of how good our model is.

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
accuracy

### Checking with our jokes from earlier

Let's feed in the two jokes we set aside earlier and see what the classifier predicts!

In [None]:
unseen_joke_vectors = [build_bow_vector(joke, dictionary) for joke in jokes_for_testing]
unseen_preds = classifier.predict(unseen_joke_vectors)
for i in range(len(unseen_preds)):
    correct_or_not = 'incorrectly'
    if unseen_preds[i] == labels_for_testing[i]:
        correct_or_not = 'correctly'
    label_verbose = list(category_dict.keys())[list(category_dict.values()).index(unseen_preds[i])]
    print("The classifier {} predicted the category to be: {}".format(correct_or_not, label_verbose))

### Great stuff! It predicted our categories correctly.

In future, when you're working with the bag-of-words model, you don't actually have to write out the functions yourself. 

We did it because it was a good way to understand what's going on but you can use the CountVectorizer() function from scikit-learn to do it much more quickly:
```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
 
term_frequencies = vectorizer.fit_transform(jokes)
```

### Term Frequency - Inverse Document Frequency

Our model performed pretty well, but can we do any better with a different approach? Let's try building a model using tf-idf representation instead of just bag-of-words.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
joined_jokes = [' '.join(joke) for joke in jokes]
tfidf = vectorizer.fit_transform(joined_jokes)


### Splitting the data

Again, let's separate our data in to **training set** and **test set**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, labels, train_size=0.8, random_state=97)

And once again, build the model...

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

and create predictions and evaluate the results.

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
accuracy

### Good stuff! Our model is now even better!

TF-IDF has performed slightly better, but only marginally in this case. That's most likely because we did such a great job in pre-processing.