# Natural Language Classification

In the first exercise, you will train a model that classifies Yelp reviews into "good" or "bad" sentiments. You'll use ScaPy to train a text classifier, then use the model to predict the sentiment of text examples.

The data consists of the text body of each review along with the star rating. The star ratings have been grouped into sentiments. Ratings with 1-2 stars are "negative", ratings with 4-5 stars are "positive", while 3 star ratings are "neutral" and have been dropped from the data.

<img src="https://i.imgur.com/7l6vwIr.png" width=400px>

The goal then is to use the text and sentiments of each review to train a classification model for predicting the sentiment of new text. To do this, you'll use ScaPy's TextCategorizer component.

In [1]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex1 import *
print("\nSetup complete")


Setup complete


Load in the data here.

In [2]:
all_data = pd.read_csv('../input/yelp_ratings.csv', index_col=0)
all_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


## 1) Exercise: Create the model

For the first exercise, create the text classifier model and add the labels `"NEGATIVE"` and `"POSITIVE"`. For the model, use the `"bow"` (bag of words) architecture. The other architectures will likely result in better performance, but train much slower.

In [3]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
            "textcat",
            config={
                "exclusive_classes": True,
                "architecture": "bow"})
nlp.add_pipe(textcat)

# Add NEGATIVE and POSITIVE labels to text classifier
textcat.add_label("NEGATIVE")
textcat.add_label("POSITIVE")

1

## Loading the Data
Here I've included a function to load the data and split it into training and validation slices.

In [4]:
def load_data(csv_file, split=0.8):
    data = pd.read_csv(csv_file)
    
    # Shuffle data
    train_data = data.sample(frac=1, random_state=7)
    
    texts = train_data.text.values
    labels = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)}
              for y in train_data.sentiment.values]
    split = int(len(train_data) * split)
    
    return texts[:split], labels[:split], texts[split:], labels[split:]

In [5]:
train_texts, train_labels, val_texts, val_labels = load_data('../input/yelp_ratings.csv')

## 2) Exercise: Train Function

Implement a function `train` that updates a model with training data.

In [6]:
from spacy.util import minibatch
import random

def train(model, train_data, optimizer, batch_size=8):
    losses = {}
    random.shuffle(train_data)
    batches = minibatch(train_data, size=batch_size)
    for batch in batches:
        # Need to get separate iterables for texts and labels
        texts, labels = zip(*batch)
        model.update(texts, labels, sgd=optimizer, drop=0.2, losses=losses)
    return losses

In [7]:
optimizer = nlp.begin_training()

train_data = list(zip(train_texts, [{"cats": labels} for labels in train_labels]))

losses = train(nlp, train_data, optimizer)
print(losses['textcat'])

10.471749642463003


We can try this slightly trained model on some example text and look at the probabilities assigned to each label.

In [8]:
text = "This tea cup was full of holes. Do not recommend."
doc = nlp(text)
print(doc.cats)

{'NEGATIVE': 0.8552083373069763, 'POSITIVE': 0.1447916030883789}


These probabilities look reasonable. Now you should turn them into an actual prediction.

## 3) Exercise: Making Predictions

Implement a function `predict` that uses a model to predict the sentiment of text examples. The function takes a SpaCy model (with a TextCategorizer) and a list of texts. First, tokenize the texts using `model.tokenizer`. Then, pass those docs to the TextCategorizer which you can get from `model.get_pipe`. Use the `textcat.predict` method to get scores for each document, then choose the class with the highest score (probability) as the predicted class.

In [9]:
def predict(model, texts): 
    # Use the tokenizer to tokenize each input text example
    docs = [model.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = model.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    
    # From the scores, find the class with the highest score/probability
    predicted_class = scores.argmax(axis=1)
    
    return predicted_class

In [10]:
predictions = predict(nlp, val_texts[23:27])
texts = val_texts[23:27]

for p, t in zip(predictions, texts):
    print(f"{textcat.labels[p]}: {t} \n")

POSITIVE: Some of the best sushi I've ever had. Reasonable prices. Excellent service and drinks. 

NEGATIVE: I would be remiss if I said nothing was good, because the egg rolls were good and the white rice, because it's white rice and that was pretty good. But everything else was kind of sub par. Maybe it's what I ordered. The quality of the chicken on the general tso's was really bad. You hope for crispy chicken with a spicy sauce, but this is soggy and the breading is gross. The lo mien hardly has any vegetables, and is also pretty bleh. Stay far away from the crab rangoons. 

I wanted to like this place, I had gone to the one in Vermilion and was even more disappointed. Seeing these reviews, and knowing it was under different management  I thought it would be better. The people in the restaurant love it, but it's gross. 

POSITIVE: One of my favorite Asian restaurants. The food is not typical and seemingly more authentic. There are items on the menu I would have to be a bit more adv

By eye it looks like your model is working well after going through the data just once. However you need to calculate some metric for the model's performance on the hold-out validation data.

## 4) Exercise: Evaluating a Trained Model

Implement a function that evaluates a `TextCategorizer` model. This function `evaluate` takes a model along with texts and labels. It returns the accuracy of the model, the number of correct predictions divided by all predictions.

First, use the `predict` method you wrote earlier to get the predicted class for each text in `texts`. Then, find where the predicted labels match the true "gold-standard" labels and calculate the accuracy.

In [11]:
def evaluate(model, texts, labels):
    """ Returns the accuracy of a TextCategorizer model. 
    
        Arguments
        ---------
        model: ScaPy model with a TextCategorizer
        texts: Text samples, from load_data function
        labels: True labels, from load_data function
    
    """
    # Get predictions from textcat model
    predicted_class = predict(model, texts)
    
    # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
    true_class = [int(each['POSITIVE']) for each in labels]
    
    # A boolean or int array indicating correct predictions
    correct_predictions = predicted_class == true_class
    
    # The accuracy, number of correct predictions divided by all predictions
    accuracy = correct_predictions.mean()
    
    return accuracy

In [12]:
accuracy = evaluate(nlp, val_texts, val_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9432


With the functions implemented, you can train and evaluate in loop.

In [13]:
n_iters = 5
train_data = list(zip(train_texts, [{"cats": labels} for labels in train_labels]))
for i in range(n_iters):
    losses = train(nlp, train_data, optimizer)
    accuracy = evaluate(nlp, val_texts, val_labels)
    print(f"Loss: {losses['textcat']:.3f} \t Accuracy: {accuracy:.3f}")

Loss: 6.479 	 Accuracy: 0.946
Loss: 5.242 	 Accuracy: 0.946
Loss: 4.799 	 Accuracy: 0.950
Loss: 4.446 	 Accuracy: 0.949
Loss: 4.168 	 Accuracy: 0.948


## 5) What would you do to find the best model?

In this exercise, you only build the necessary components to train a text classifier with SpaCy. What could you do further to optimize the model and get the best accuracy on the hold-out data?

Answer: There are various hyperparameters to work with here. The biggest one is the TextCategorizer architecture. You used the simplest model which trains faster but likely has worse performance than the CNN and ensemble models. You can adjust the dropout parameter to reduce overfitting. Also, you can save the model after each training pass through the data and use the model with the best validation accuracy.

## Next Up!

In the next lesson, you'll learn how to use SpaCy to represent tokens as vectors, then use these vectors to train machine learning models.