# Natural Language Classification

In the first exercise, you will train a model that classifies Yelp reviews into "good" or "bad" sentiments. You'll use ScaPy to 

Original data is from here: https://www.kaggle.com/yelp-dataset/yelp-dataset#yelp_academic_dataset_review.json. That file is ~5GB, I sampled roughly 50,000 reviews from it. I converted the ratings, the "stars" field in the JSON data, to a binary 0 & 1. Reviews with 1 or 2 stars are considered "negative", encoded with sentiment 0. Reviews with 4 or 5 stars are considered "positive" and encoded with sentiment 1. Reviews with 3 stars are "neutral" and excluded from the data.

The resulting sample has around 75% positive sentiments. Might need to resample to balance the labels.

In [1]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex1 import *
print("\nSetup complete")


Setup complete


In [2]:
all_data = pd.read_csv('../input/yelp_ratings.csv', index_col=0)
all_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


## 1) Exercise: Create the model

Here, create the text classifier model and add the labels `"NEGATIVE"` and `"POSITIVE"`. For the model, use the "bow" architecture. The other architectures will likely result in better performance, but train much slower.

In [3]:
import spacy

def create_model():

    # Create an empty model
    nlp = spacy.blank("en")

    # Create the TextCategorizer with exclusive classes and "bow" architecture
    textcat = nlp.create_pipe(
                "textcat",
                config={
                    "exclusive_classes": True,
                    "architecture": "bow"})
    nlp.add_pipe(textcat)

    # Add NEGATIVE and POSITIVE labels to text classifier
    textcat.add_label("NEGATIVE")
    textcat.add_label("POSITIVE")
    
    return nlp

## Loading the Data
Here I've included a function to load the data and split it into training and validation slices.

In [4]:
def load_data(csv_file, split=0.8):
    data = pd.read_csv(csv_file)
    
    # Shuffle data
    train_data = data.sample(frac=1, random_state=7)
    
    texts = train_data.text.values
    labels = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)}
              for y in train_data.sentiment.values]
    split = int(len(train_data) * split)
    
    return texts[:split], labels[:split], texts[split:], labels[split:]

In [5]:
train_texts, train_labels, val_texts, val_labels = load_data('../input/yelp_ratings.csv')

## 2) Exercise: Train Function

Implement a function `train` that updates a model with training data.

In [6]:
from spacy.util import minibatch
import random

def train(model, train_data, optimizer, batch_size=8):
    print("Training model!")
    
    losses = {}
    random.shuffle(train_data)
    batches = minibatch(train_data, size=batch_size)
    for batch in batches:
        # Need to get separate iterables for texts and labels
        texts, labels = zip(*batch)
        model.update(texts, labels, sgd=optimizer, drop=0.2, losses=losses)
    return losses

In [9]:
nlp = create_model()
optimizer = nlp.begin_training()

train_data = list(zip(train_texts, [{"cats": labels} for labels in train_labels]))

losses = train(nlp, train_data, optimizer)
print(losses['textcat'])

Training model!
10.147658804418384


## 3) Exercise: Making Predictions

Implement a function `predict` that uses a model to predict the sentiment of text examples.

In [10]:
def predict(model, texts): 
    # Use the tokenizer to tokenize each input text example
    docs = [model.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = model.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    
    # From the scores, find the class with the highest score/probability
    predicted_class = scores.argmax(axis=1)
    
    return predicted_class

In [19]:
print(predict(nlp, val_texts[23:27]))
print(val_texts[23:27])

[1 0 1 0]
["Some of the best sushi I've ever had. Reasonable prices. Excellent service and drinks."
 "I would be remiss if I said nothing was good, because the egg rolls were good and the white rice, because it's white rice and that was pretty good. But everything else was kind of sub par. Maybe it's what I ordered. The quality of the chicken on the general tso's was really bad. You hope for crispy chicken with a spicy sauce, but this is soggy and the breading is gross. The lo mien hardly has any vegetables, and is also pretty bleh. Stay far away from the crab rangoons. \n\nI wanted to like this place, I had gone to the one in Vermilion and was even more disappointed. Seeing these reviews, and knowing it was under different management  I thought it would be better. The people in the restaurant love it, but it's gross."
 "One of my favorite Asian restaurants. The food is not typical and seemingly more authentic. There are items on the menu I would have to be a bit more adventurous to tr

By eye it looks like your model is working well after going through the data just once. However you need to calculate some metric for the model's performance on the hold-out validation data.

## 4) Exercise: Evaluating a Trained Model

Implement a function that evaluates a `TextCategorizer` model. This function `evaluate` takes a model along with texts and labels. It returns the accuracy of the model, the number of correct predictions divided by all predictions.

In [20]:
def evaluate(model, texts, labels):
    """ Returns the accuracy of a TextCategorizer model. 
    
        Arguments
        ---------
        model: ScaPy model with a TextCategorizer
        texts: Text samples, from load_data function
        labels: True labels, from load_data function
    
    """
    # Get predictions from textcat model
    predicted_class = predict(model, texts)
    
    # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
    true_class = [int(each['POSITIVE']) for each in labels]
    
    # A boolean or int array indicating correct predictions
    correct_predictions = predicted_class == true_class
    
    # The accuracy, number of correct predictions divided by all predictions
    accuracy = correct_predictions.mean()
    
    return accuracy

In [21]:
accuracy = evaluate(nlp, val_texts, val_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9413


With the functions implemented, you can put it all into a training loop.

In [23]:
nlp = create_model()
optimizer = nlp.begin_training()
train_data = list(zip(train_texts, [{"cats": labels} for labels in train_labels]))
for i in range(5):
    losses = train(nlp, train_data, optimizer)
    accuracy = evaluate(nlp, val_texts, val_labels)
    print(f"Loss: {losses['textcat']:.3f} \t Accuracy: {accuracy:.3f}")

Training model!
Loss: 9.925 	 Accuracy: 0.942
Training model!
Loss: 6.106 	 Accuracy: 0.945
Training model!
Loss: 5.325 	 Accuracy: 0.949
Training model!
Loss: 4.870 	 Accuracy: 0.948
Training model!
Loss: 4.579 	 Accuracy: 0.947


In the next lesson, you'll learn how to use SpaCy to represent tokens as vectors, then use these vectors to train machine learning models.