# Bag of words approach for Natural Language Processing

We have seen in the last workshop how to train a model on tabular data. We saw that it was straightforward on numerical columns, but that it required a bit more work on categorical data. 

Well, what happens if we only have non-numerical data, like text, or images? This kind of data is called unstructured data, because it does not fit nicely in a table. 

In this tutorial, we will try to classify some text extracts into some given categories - so it's supervised learning like last time, only with text. 

Natural Language Processing typically requires a lot of preprocessing from the raw dataset, which we are today going to gloss over, in the interest of time. 


We are going to use a classic dataset for text classification, 20 newsgroup. It contains extracts from 20 Usenet newsgroup, and the goal is to predict to which newsgroup a certain extract belongs.

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')

We can have a look at the 20 categories of the dataset.

In [None]:
newsgroups_train.target_names

The labels are directly extracted from the Usenet newsgroup hierarchy, so we can still see the tree structure.

There is a wide variety of categories, some of them closely related, like `comp.sys.ibm.pc.hardware` and `comp.sys.mac.hardware`, some of them are quite unique, like `misc.forsale`, and some of them are opposing pairs, like `alt.atheism` and `soc.religion.christian`. 

Let's have a look at the kind of text we want to classify.

In [None]:
newsgroups_train.data[0]

Hm, that's kind of messy, there is a lot of "metadata", or at least information we don't really want our classifier to learn about (such as email addresses, or newsgroup header). 

Fortunately, scikit-learn has implemented the cleaning step for us.

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))

In [None]:
newsgroups_train.data[0]

Much better. Now the text extract is clean and contains only the content of the message. 

But how do we handle it from there?

## Bag of words

The first way of encoding text is to use an approach called bag of words:
* We define a vocabulary.
* For each text extract to classify, we count the number of occurences of each word, and fill the appropriate index in the vocabulary vector.
* All words which are in the vocabulary but not in the text get a 0 value. Words that are in the text but not in the vocabulary are ignored.
* The (sparse) matrix made of concatenating those vectors (size n_samples x vocabulary_size) is then fed to the classifier.

Let's look at an example. We have the following dataset:

| index | text                         |
|-------|------------------------------|
| **1** | All cats are mortal.         |
| **2** | Socrates is mortal.          |
| **3** | Therefore Socrates is a cat. |

We define our vocabulary to be:
```
    voc = ['Socrates', 'cat', 'cats', 'mortal', 'therefore']
```

Then, our encoded matrix is:

| index | Socrates | cat | cats | mortal | therefore |
|-------|----------|-----|------|--------|-----------|
| **1** | 0        | 0   | 1    | 1      | 0         |
| **2** | 1        | 0   | 0    | 1      | 0         |
| **3** | 1        | 1   | 0    | 0      | 1         |
    

I'm sure that all the Pythonistas already have an idea of how to implement this bag of words encoding with `Counter` and a clever list comprehension. 

We are not going to do that here (but feel welcome to give it a try at home), scikit-learn will take care of that for us.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(newsgroups_train.data)

We can have a look at the size of the vocabulary.

In [None]:
len(vectorizer.vocabulary_)

Wow, more than 100 000 words, that's a lot ! 

Without further instructions, the `CountVectorizer` keeps in the vocabulary every word appearing even only once in the whole corpus. That's not quite what we want.

As a rule of thumb, we prefer to train on "long" matrices (with a lot of rows), rather than on "wide" matrices (with a lot of columns). The intuition behind that is that, if we have enough columns, the classifier will probably learn by heart a unique combination for each row, and thus not generalize well. That is overfitting! 

So, to force the classifier to generalize, we want to feed it a matrix with less columns, i.e. less words in the vocabulary. How about we keep the 3000 most frequent words?

In [None]:
vectorizer = CountVectorizer(max_features=3000)
vectorizer.fit(newsgroups_train.data)
len(vectorizer.vocabulary_)

Let's have a look at the words in the vocabulary.

In [None]:
list(vectorizer.vocabulary_)[:15]

Well, people sure are "wondering if anyone out there could \[help\] them". 

Joke aside, we can see that some words in the vocabulary, such as `was`, `this`, `the`, which do not carry a lot of semantic meaning. We call those stopwords, and we don't want to keep them in our limited vocabulary.

In [None]:
vectorizer = CountVectorizer(max_features=3000,
                             stop_words='english')
vectorizer.fit(newsgroups_train.data)
list(vectorizer.vocabulary_)[:15]

That looks better. 
Now we can encode our dataset with this vocabulary.

In [None]:
X_train = vectorizer.transform(newsgroups_train.data)
X_train

In [None]:
len(newsgroups_train.data)

The shape of the sparse matrix we get looks correct, can we check the first row?

In [None]:
newsgroups_train.data[0]

In [None]:
_, col_index = X_train[0].nonzero()
for i in col_index:
    print(sorted(list(vectorizer.vocabulary_))[i])

Seems good. It's time to train our classifier !

## Naive Bayes

The first classifier we will use is called Naive Bayes. It uses Bayes rule to make decisions:
$$P(y=C_i | (\mbox{features}) =  \frac{P((\mbox{features } | y=C_i) P(y=C_i)}{P((\mbox{features})}$$

It supposes that all features (here, the presence or the absence of a word in the text) are conditionally independent (that's the naive part). 

So the decision function can be rewritten as:
$$ P(y=C_i | (\mbox{features}) \propto \prod_\mbox{feature} P(\mbox{feature } | y=C_i) P(y=C_i) $$
since $P(\mbox{features})$ is a constant that we do not need to compute.

$P(y=C_i) $ is just the relative frequency of the class $C_i$ in the training set, we only need to compute the $P(\mbox{feature } | y=C_i)$ for each feature.

In the multinomial flavour we are using here, the likelihood to each feature given a class is simply computed using a smoothed relative frequency count. (See [scikit-learn documentation](http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) for more details.)


In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, newsgroups_train.target)

In [None]:
# train error
nb_classifier.score(X_train, newsgroups_train.target)

Now that our classifier is trained, we want to evaluate how good it would be on new data. We can get a test set using the option `subset='test'`, now it's your time to do the preprocessing again.

In [None]:
# %load solutions/preprocessing_test.py
newsgroups_test = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'))
X_test = ...

And now compute the test error.

In [None]:
# %load solutions/test_score_nb.py


We have an overall accuracy of 70% on the training set, and only 55% on the test set. It might seem quite low, but let's not forget that we have 20 classes, so if we were making random predictions, we would have an accuracy of roughly 5%.

We can check that by computing a random baseline.

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
dummy.fit(X_train, newsgroups_train.target)
dummy.score(X_train, newsgroups_train.target)

In [None]:
dummy.score(X_test, newsgroups_test.target)

Another thing we might want to check is how errors are distributed along classes. For example, it is understandable that the classifier would confuse extract from `comp.sys.ibm.pc.hardware` and `comp.sys.mac.hardware`. 

The tool to visualize that is called a confusion matrix. We define a function to prettify scikit-learn output.

In [None]:
# %load solutions/confusion_matrix.py
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
import pandas as pd

def confusion_matrix(y_true, y_predicted, labels):
    df = pd.DataFrame(data=sk_confusion_matrix(y_true, y_predicted), index=labels, columns=labels)
    df.index.name = 'true classes'
    df.columns.name = 'predicted classes'
    return df


In [None]:
y_estimated_test = nb_classifier.predict(X_test) 
confusion_mat = confusion_matrix(newsgroups_test.target, y_estimated_test, newsgroups_test.target_names)
confusion_mat

We can plot the matrix as a heatmap.

In [None]:
import seaborn as sns
%matplotlib inline

sns.heatmap(confusion_mat)

y_estimated_test = nb_classifier.predict(X_test) 
confusion_mat = confusion_matrix(newsgroups_test.target, y_estimated_test, newsgroups_test.target_names)
confusion_matrix

## Logistic Regression

We can use other classifiers than Naive Bayes. A personal favorite of mine is Logistic Regression, which is the most badly named linear classifier ever, but it has the advantage of retaining some interpretability.

In [None]:
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression(multi_class='multinomial',
                                   solver='lbfgs')

Now, you can train and evaluate this classifier. Remember that all scikit-learn classifier share a common interface, so you can probably use the same method as for the Naive Bayes classifier (or you can check the documentation).

In [None]:
# %load solutions/log_reg_training.py


We have improved a lot our train accuracy! Unfortunately, that does not transfer to the test accuracy, we are probably overfitting. 

The main hyperparameter of logistic regression is called `C`, it's a positive float which is the inverse of the regularisation strength. The smaller `C`, the smoother our decision function will be, which means we are less likely to overfit.

Let's modify our code to add some regularisation. `C` defaults to `1`, we want more regularisation, let's try something smaller.

In [None]:
# %load solutions/log_reg_training_with_regularisation.py


Depending on the exact value you chose, the test accuracy might be slightly better or worse, but how can we choose the optimal value for `C`?

We are going to use cross-validation. If you don't remember what is cross-validation, here is a quick summary.

    At each iteration, we use 90% of the data to train a model, and the remaining 10% to evaluate how good the model is. 
    And we repeat that 10 times, using a different 10% to evaluate each time. 

![Cross-validation](img/crossValidation.png)

And the good news is that we don't even have to do that by hand, scikit-learn provides us with a class for that!

In [None]:
from sklearn.linear_model import LogisticRegressionCV

In [None]:
# %load solutions/log_reg_cv_training.py


In [None]:
confusion_mat = confusion_matrix(newsgroups_test.target, lr_cv_classifier.predict(X_test), newsgroups_test.target_names)
confusion_mat

In [None]:
sns.heatmap(confusion_mat)

Even if the improvement did not seem like much on the accuracy value, the confusion matrix looks much nicer with the Logistic Regression. 

Some samples are predicted for the `comp.os.ms-windows.misc` class and the classifier is much less confused with the `comp` classes which were mixed before. 

The improvement is not that spectacular with the class `talk.religion.misc`, but it is not worse. 

**Bonus step: Optimise the values of hyperparameters with cross-validation**

`C` is not the only hyperparameter we can tune with cross-validation. Remember that we chose the number of words in the vocabulary at the beginning of the notebook, and that choice was quite arbitrary too. 

We can tune both these parameters to improve accuracy, and the best way would be to tune them at the same time. 

Here are some hints to solve this advanced exercise:
 * We want to do a grid search over those two hyperparameters, that is to say, try every possible combination and keep the best one.
 * scikit-learn can probably help you, have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).
 * Consider using `sklearn.pipeline.Pipeline` to merge both steps (vectorizer and classifier) into one estimator.
 * The number of models to train grows as the cartesian product of the two lists of hyperparameters to try, don't be too greedy! Trying out 3 values for the vocabulary size (1500, 3000, 5000) and 5 values for `C` (.001, .01, .05, .1, .5) is probably a good start. 

In [None]:
# %load solutions/grid_search.py


## Understanding the results

One great thing about using intepretable classifiers such as logisitic regression is that we can have a look at why the classifier is making the prediction it makes. 

We are going to use a Python library called eli5 which provides tools to visualize the inner workings of ML algorithms.

In [None]:
import eli5
eli5.show_weights(lr_cv_classifier, 
                  vec=vectorizer, 
                  top=10,
                  target_names=newsgroups_test.target_names)

We can observe that often the features on which the logistic regression relies make sense: we find `bike` in the `rec.motorcycles` class, `encryption` for the `sci.crypt` class, and so on. 

We can also observe that `soc.religion.christian` and `talk.religion.misc` share quite a lot of features, which explains why the classifier is so confused between the two classes.

eli5 has another cool feature which is explaining the prediction for a given sample. Words highlighted in red are contributing negatively to the class (making the class less likely), and words in green are contributing towards the class. The deeper the color, the higher the contribution. 

In [None]:
eli5.show_prediction(lr_cv_classifier, 
                     newsgroups_test.data[0], 
                     vec=vectorizer,
                     target_names=newsgroups_test.target_names)

We have a lot of classes, so that's a bit messy. We can only show the top 5 classes.

In [None]:
eli5.show_prediction(lr_cv_classifier, 
                     newsgroups_test.data[0], 
                     vec=vectorizer,
                     target_names=newsgroups_test.target_names, 
                     top_targets = 5)

We can also have a look at a sample that is wrongly classified to get a sense of what went wrong.

In [None]:
newsgroups_test.target_names[newsgroups_test.target[1]]

In [None]:
eli5.show_prediction(lr_cv_classifier, 
                     newsgroups_test.data[1], 
                     vec=vectorizer,
                     target_names=newsgroups_test.target_names, 
                     top_targets = 5)

One reason for misclassification might be that words are cut out in two when they contains a dash. That's something we might be able to fix by changing options in `CountVectorizer`. 

## Conclusion 

We have seen how to encode text using bag of words. One advantage of this approach is that, if combined with the right classifier, the results remain interpretable.

We have seen that many machine learning algorithms, whether we are dealing with text or not, have a lot of hyperparameters to fine tune. But fortunately, there are existing tools to help us with this endeavour.

What we have not seen: all the Natural Language Processing necessary for handling raw text: tokenization (splitting the text into words), lemmatization (stripping words to their base form, 'was' -> 'be'), cleaning of irrelevant words, ect...