# Text Classification with SpaCy  

A common task in NLP is **text classification**. This is classification in the conventional machine learning sense, and it is applied to text. Common examples include **spam detection**, **sentiment analysis**, and **tagging customer queries**.  
    
This tutorial is used to learn text classification with spaCy. The classifier will detect spam messages, a common functionality in most email clients.

## Bag of Words  
In machine learning we have to convert text to numeric.  
  
The simplest common representation is a variation of **one-hot encoding**. In this, each document is represented as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).  
  
As an example, the sentences: "Tea is life. Tea is love." and "Tea is healthy calming, and delicious." **as our corpus.** The **vocabulary** is then: `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}` (ignoring punctuation).   

  
For each document, count up how many times a term occurs, and place that count in the appropriate element of a vector. The first sentence has "tea" twice and the that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look like `v1=[22110000]` `v2=[11001111]`. This is called the **bag of words** representation. You can see that documents with similar terms will have similar vectors. Vocabularies frequently have tens of thousands of terms, so these vectors can be very large.  
  
Another common representation is **TF-IDF (Term Frequency-Inverse Document Frequency)**. TD-IDF is similar to bag of words except that each term is scaled by the term's frequency in the corpus. Using TF-IDF can potentially improve your models.  

  
  


## Building a Bag of Words model  
Once you have your documents in a bag of words representation, **those vectors can be used as input to any machine learning model.** spaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.  
  

The `TextCategorizer` is a spaCy **pipe**. Pipes are <u> classes for processing and transforming tokens. </u> When you create a spaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes that perform **part of speech tagging**, **entity recognition**, and other transformations. When you run text through a model `doc = nlp("Some text here")`, the output of the pipes are attached to the tokens in the `doc` object. The lemmas for `token.lemma_` come from one of these pipes.  
  
You can remove or add pipes to models.

In [None]:
import spacy  

# Create an empty model  
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture  
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture" : "bow"})

# Add the TextCategorizer to the empty model 
nlp.add_pipe(textcat)


Classes will be ham or spam, so we set `"exclusive_classes"` to `True`. We've also configured it with the bag of words (`"bow"`) architecture. spaCy provides a CNN architecture too.  

Now we'll add labels to the model. The "ham" will be for the real messages, "spam" are for the spam messages.

In [None]:
# Add labels to the text classifier
textcat.add_label("ham") 
textcat.add_label("spam")

## Training a Text Categorizer Model  
Next, convert the labels in the data to the form TextCategorizer requires.  
For each document, you'll create a dictionary of boolean values for each class.  
  
For example, if a text is "ham", we need a dictionary `{'ham': True, 'spam': False}`.
The model is looking for these labels inside another dictionary with the key 'cats'.  
  
  

In [None]:
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]

Then we combine the texts and labels into a single list. 

In [None]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

Now we train the model. First, create an `optimizer` using `nlp.begin_training()`. spaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. spaCy provides the `minibatch` function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with `nlp.update` to update the model's parameters. 

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

Now you are ready to train the model.  
- First, create an `optimizer` using `nlp.begin_training()`. spaCy uses this optimizer to update the model. spaCy provides the `minibatch` function that returns a generator yielding minibatches for training.  
- Finally, the minibatches are split into texts and labels, then used with `nlp.update` to update the model's parameters.

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

This is just one training loop (or epoch) through the data. The model will typically need multiple epochs. Use another loop for more epochs, and optionally re-shuffle the training data at the beginning of each loop. 

In [None]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

## Making Predicitons  
Now that you have a trained model, you can make predictions with the `predict()` method. The input text needs to be tokenized with `nlp.tokenizer`. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.

In [None]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

The scores are used to predict a single class or label by choosing the label with the highest probability. You get the index of the highest probability with `scores.argmax` , then use the index to get the label string from `textcat.labels`. 

In [None]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

Evaluating the model is straightforward once you have the predictions. To measure the accuracy, calculate how many correct predictions are made on some test data, divided by the total number of predictions.

## Reference  
https://www.kaggle.com/matleonard/text-classification

---