**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**

---


# Text Classification with SpaCy

A common task in NLP is text classification, things like spam detection, sentiment analysis, and tagging customer queries. In this tutorial, I'll show you how to build a text classification model with SpaCy. The classifier will be trained to detect spam messages. This is a very common use and spam detectors run in the background of nearly all email clients these days.

In [None]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)

# Bag of Words
Machine learning models can't learn from raw text data. Instead, you need to convert the documents into a vector representation you can use as model input. Ideally these vectors will be close together for similar documents and far apart for dissimilar documents.

A simple representation we can use is a variation of one-hot-encoding. For each document, you represent it as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents). 

As an example, let's take the sentences "Tea is life. Tea is love." and "Tea is healthy, calming, and delicious." as our corpus. The vocabulary then is `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}` (ignoring punctuation).

For each document, we count up how many times a term occurs. We place that count in the appropriate element of a vector. The first sentence has "tea" twice and that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look like 

$$
\begin{align}
v_1 &= \left[\begin{matrix} 2 & 2 & 1 & 1 & 0 & 0 & 0 & 0 \end{matrix}\right] \\
v_2 &= \left[\begin{matrix} 1 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \end{matrix}\right]
\end{align}
$$

This is called the "bag of words" representation. You can see that documents with similar terms will have similar vectors. Typical vocabularies will have tens of thousands of terms normally, so these vectors will be very large.

Another common representation is TF-IDF (Term Frequency - Inverse Document Frequency) which is similar to bag of words except that each term count is scaled by the term's frequency in the corpus. Using TF-IDF can potentially improve your models, but you won't be needing it here. Feel free to look it up though!

# Building a Bag of Words model

Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. SpaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.

The TextCategorizer is a SpaCy pipe. Pipes are classes for processing and transforming tokens. When you create a SpaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model `doc = nlp("Some text here")`, the output of the pipes are attached to the tokens in the `doc` object. The lemmas for `token.lemma_` come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

In [None]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Since the classes are either positive or negative sentiment, we set `"exclusive_classes"` to `True`. We've also configured it with the bag of words (`"bow"`) architecture. SpaCy provides a convolutional neural network architecture as well, but it's more complex than what you need for now.

Next we'll add the labels to the model. Here "ham" are for the real messages, "spam" are spam messages.

In [None]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

# Training a Text Categorizer Model

Now we need to convert the labels in the data to the form TextCategorizer requires. For each example we create a dictionary with boolean values for each class. In this case, if a text is "ham", we need a dictionary `{'ham': True, 'spam': False}`. The model is looking for these labels inside another dictionary with the key `'cats'`.

In [None]:
# Hold out a bit of data for testing
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]

Then we combine the texts and labels into a single list.

In [None]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

With the data prepared, we'll train the classifier. First, we'll set some random seeds so we get repeatable outcomes. Then we need to create an `optimizer` using `nlp.begin_training()`. SpaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. For this, SpaCy provides the `minibatch` function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with `nlp.update` to update the model's parameters.

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

This is just one training loop through the data. The model will typically need For this, you wrap it in another for loop and shuffle the training data at the begining of each loop. 

In [None]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

# Making Predictions

Once your model is trained you can try making predictions. The easiest way to do this is by using the `predict()` method of the TextCategorizer pipe. First the input text needs to be tokenized with `nlp.tokenizer`. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.

In [None]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

The scores are used to predict a single class or label by choosing the label with the highest probability. You get the index of the highest probability with `scores.argmax`, then use the index to get the label string from `textcat.labels`.

In [None]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

Evaluating the model is straightforward once you have the predictions. To measure the accuracy, calculate how many correct predictions are made on some test data, divided by the total number of predictions.

Next up, you'll train a text classifier model to predict the sentiment of Yelp reviews.

---
**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*