Things I need to teach in this tutorial

- Bag of Words
- Creating a SpaCy TextCategorizer
- Training the TextCategorizer
- Using a TextCategorizer to make predictions
- Evaluating a TextCategorizer

# Text Classification with SpaCy

A common task in NLP is text classification, things like spam detection, sentiment analysis, and tagging customer queries. In this tutorial, I'll show you how to build a text classification model with SpaCy. The classifier will be trained to detect spam messages. This is a very common use and spam detectors run in the background of nearly all email clients these days.

# Bag of Words
Machine learning models can't learn from raw text data. Instead, you need to convert the documents into a vector representation you can use as model input. Ideally these vectors will be close together for similar documents and far apart for dissimilar documents.

A simple representation we can use is a variation of one-hot-encoding. For each document, you represent it as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents). 

As an example, let's take the sentences "Tea is life. Tea is love." and "Tea is healthy, calming, and delicious." as our corpus. The vocabulary then is `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}`.

For each document, we count up how many times a term occurs. We place that count in the appropriate element of a vector. The first sentence has "tea" twice and that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look like 

$$
\begin{align}
v_1 &= \left[\begin{matrix} 2 & 2 & 1 & 1 & 0 & 0 & 0 & 0 \end{matrix}\right] \\
v_2 &= \left[\begin{matrix} 1 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \end{matrix}\right]
\end{align}
$$

This is called the "bag of words" representation. You can see that documents with similar terms will have similar vectors. Typical vocabularies will have tens of thousands of terms normally, so these vectors will be very large.

Another common representation is TF-IDF which is similar to bag of words except that each value is scaled by its frequency in the corpus. Using TF-IDF can potentially improve your models, but you won't be needing it here. Feel free to look it up though!

### Building a Bag of Words model

Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. SpaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.

The TextCategorizer is a SpaCy pipe. Pipes are classes for processing and transforming tokens. When you create a SpaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model `doc = nlp("Some text here")`, the output of the pipes are attached to the tokens in the `doc` object. The lemmas for `token.lemma_` come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

In [1]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

We've configured it with the bag of words (`"bow"`) architecture. SpaCy also provides a convolutional neural network architecture, but it's more complex than what you need for now. Also, since we have ex

In [None]:
# Add NEGATIVE and POSITIVE labels to text classifier
textcat.add_label("NEGATIVE")
textcat.add_label("POSITIVE")