**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**

---


# Natural Language Classification

You did a great such a great job for DeFalco's restaurant in the previous exercise that the chef has hired you for a new project.

The restaurant's menu includes an email address where visitors can give feedback about their food. 

The manager wants you to create a tool that automatically sends him all the negative reviews so he can fix them, while automatically sending all the positive reviews to the owner, so the manager can ask for a raise. 

You will first build a model to distinguish positive reviews from negative reviews using Yelp reviews because these reviews include a rating with each review. Your data consists of the text body of each review along with the star rating. Ratings with 1-2 stars count as "negative", and ratings with 4-5 stars are "positive". Ratings with 3 stars are "neutral" and have been dropped from the data.

Let's get started. First, run the next code cell.

In [2]:
# Enabling print for all lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


import pandas as pd
import spacy

In [1]:
# import pandas as pd

# Set up code checking
# !pip install -U -t /kaggle/working/ git+https://github.com/Kaggle/learntools.git
# from learntools.core import binder
# binder.bind(globals())
# from learntools.nlp.ex2 import *
# print("\nSetup complete")

Collecting git+https://github.com/Kaggle/learntools.git
  Cloning https://github.com/Kaggle/learntools.git to /tmp/pip-req-build-hjse080l
  Running command git clone -q https://github.com/Kaggle/learntools.git /tmp/pip-req-build-hjse080l
Building wheels for collected packages: learntools
  Building wheel for learntools (setup.py) ... [?25ldone
[?25h  Created wheel for learntools: filename=learntools-0.3.4-py3-none-any.whl size=205145 sha256=707ed4297b6e4b6f22cbb950cf2e0ebdde0b16d63b59083bc11b02fd4df822c2
  Stored in directory: /tmp/pip-ephem-wheel-cache-i0pitg2_/wheels/dd/d7/6b/0fc758f52767fd281d6dceded6757c6cb5bb90ccd2dbb1de9f
Successfully built learntools
Installing collected packages: learntools
Successfully installed learntools-0.3.4

Setup complete


# Step 1: Evaluate the Approach

Is there anything about this approach that concerns you? After you've thought about it, run the function below to see one point of view.

In [2]:
# Check your answer (Run this code cell to receive credit!)
# step_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> Any way of setting up an ML problem will have multiple strengths and weaknesses.  So you may have thought of different issues than listed here.

The strength of this approach is that it allows you to distinguish positive email messages from negative emails even though you don't have historical emails that you have labeled as positive or negative.

The weakness of this approach is that emails may be systematically different from Yelp reviews in ways that make your model less accurate. For example, customers might generally use different words or slang in emails, and the model based on Yelp reviews won't have seen these words.

If you wanted to see how serious this issue is, you could compare word frequencies between the two sources. In practice, manually reading a few emails from each source may be enough to see if it's a serious issue. 

If you wanted to do something fancier, you could create a dataset that contains both Yelp reviews and emails and see whether a model can tell a reviews source from the text content. Ideally, you'd like to find that model didn't perform well, because it would mean your data sources are similar. That approach seems unnecessarily complex here.

# Step 2: Review Data and Create the model

Moving forward with your plan, you'll need to load the data. Here's some basic code to load data and split it into a training and validation set. Run this code.

In [3]:
def load_data(csv_file, split=0.9):
    data = pd.read_csv(csv_file)
    
    # Shuffle data
    train_data = data.sample(frac=1, random_state=7)
    
    texts = train_data.text.values
    labels = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)}
              for y in train_data.sentiment.values]
    split = int(len(train_data) * split)
    
    train_labels = [{"cats": labels} for labels in labels[:split]]
    val_labels = [{"cats": labels} for labels in labels[split:]]
    
    return texts[:split], train_labels, texts[split:], val_labels

train_texts, train_labels, val_texts, val_labels = load_data('yelp_ratings.csv')

You will use this training data to build a model. The code to build the model is the same as what you saw in the tutorial. So that is copied below for you.

But because your data is different, there are **two lines in the modeling code cell that you'll need to change.** Can you figure out what they are? 

First, run the cell below to look at a couple elements from your training data.

In [4]:
print('Texts from training data\n------')
print(train_texts[:2])
print('\nLabels from training data\n------')
print(train_labels[:2])

Texts from training data
------
["Some of the best sushi I've ever had....and I come from the East Coast.  Unreal toro, have some of it's available."
 "One of the best burgers I've ever had and very well priced. I got the tortilla burger and is was delicious especially with there tortilla soup!"]

Labels from training data
------
[{'cats': {'POSITIVE': True, 'NEGATIVE': False}}, {'cats': {'POSITIVE': True, 'NEGATIVE': False}}]


Now, having seen this data, find the two lines that need to be changed.

In [6]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# Add labels to text classifier
textcat.add_label("NEGATIVE")
textcat.add_label("POSITIVE")

# Check your answer
# step_2.check()

1

1

In [7]:
# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

# Step 3: Train Function

Implement a function `train` that updates a model with training data. Most of this is general data munging, which we've filled in for you. Just add the one line of code necessary to update your model.

In [10]:
from spacy.util import minibatch
import random

def train(model, train_data, optimizer):
    losses = {}
    random.seed(1)
    random.shuffle(train_data)
    optimizer = nlp.begin_training()
    
    batches = minibatch(train_data, size=8)
    for batch in batches:
        # train_data is a list of tuples [(text0, label0), (text1, label1), ...]
        # Split batch into texts and labels
        texts, labels = zip(*batch)
        
        # Update model with texts and labels
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
        
    return losses

# Check your answer
# step_3.check()

In [11]:
# Lines below will give you a hint or solution code
#step_3.hint()
#step_3.solution()

In [12]:
# Fix seed for reproducibility
spacy.util.fix_random_seed(1)
random.seed(1)

# This may take a while to run!
optimizer = nlp.begin_training()
train_data = list(zip(train_texts, train_labels))
losses = train(nlp, train_data, optimizer)
print(losses['textcat'])

8.701602560770688


We can try this slightly trained model on some example text and look at the probabilities assigned to each label.

In [13]:
text = "This tea cup was full of holes. Do not recommend."
doc = nlp(text)
print(doc.cats)

{'NEGATIVE': 0.7731374502182007, 'POSITIVE': 0.22686253488063812}


These probabilities look reasonable. Now you should turn them into an actual prediction.

# Step 4: Making Predictions

Implement a function `predict` that predicts the sentiment of text examples. 
- First, tokenize the texts using `nlp.tokenizer()`. 
- Then, pass those docs to the TextCategorizer which you can get from `nlp.get_pipe()`. 
- Use the `textcat.predict()` method to get scores for each document, then choose the class with the highest score (probability) as the predicted class.

In [14]:
def predict(nlp, texts): 
    # Use the model's tokenizer to tokenize each input text
    docs = [nlp.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = nlp.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    scores
    
    # From the scores, find the class with the highest score/probability
    predicted_class = scores.argmax(axis=1)
#     [textcat.labels[label] for label in predicted_labels]
    
    return predicted_class

# Check your answer
# step_4.check()

In [15]:
# Lines below will give you a hint or solution code
#step_4.hint()
#step_4.solution()

In [16]:
texts = val_texts[34:38]
predictions = predict(nlp, texts)

for p, t in zip(predictions, texts):
    print(f"{textcat.labels[p]}: {t} \n")

POSITIVE: Came over and had their "Pick 2" lunch combo and chose their best selling 1/2 chicken sandwich with quinoa.  Both were tasty, the chicken salad is a bit creamy but was perfect with quinoa on the side.  This is a good lunch joint, casual and clean! 

POSITIVE: Went here last night and got oysters, fried okra, fries, and onion rings. I cannot complain. The portions were great and tasty!!! I will definitely be back for more. I cannot wait to try the crawfish boudin and soft shell crab. 

POSITIVE: This restaurant was fantastic! 
The concept of eating without vision was intriguing. The dinner was filled with laughs and good conversation. 

We were lead in a line to our table and each person to their seat. This was not just dark but you could not see something right in front of your face. 

The waiters/waitresses were all blind and allowed us to see how aware you need to be without the vision. 

Taking away one sense is said to increase your other senses so as taste and hearing wh

It looks like your model is working well after going through the data just once. However you need to calculate some metric for the model's performance on the hold-out validation data.

# Step 5: Evaluate The Model

Implement a function that evaluates a `TextCategorizer` model. This function `evaluate` takes a model along with texts and labels. It returns the accuracy of the model, which is the number of correct predictions divided by all predictions.

First, use the `predict` method you wrote earlier to get the predicted class for each text in `texts`. Then, find where the predicted labels match the true "gold-standard" labels and calculate the accuracy.

In [17]:
def evaluate(model, texts, labels):
    """ Returns the accuracy of a TextCategorizer model. 
    
        Arguments
        ---------
        model: ScaPy model with a TextCategorizer
        texts: Text samples, from load_data function
        labels: True labels, from load_data function
    
    """
    # Get predictions from textcat model (using your predict method)
    predicted_class = predict(model, texts)
    
    # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
    true_class = [int(each['cats']['POSITIVE']) for each in labels]
    
    # A boolean or int array indicating correct predictions
    correct_predictions = predicted_class == true_class
    
    # The accuracy, number of correct predictions divided by all predictions
    accuracy = correct_predictions.mean()
    
    return accuracy

# step_5.check()

In [18]:
# Lines below will give you a hint or solution code
#step_5.hint()
# step_5.solution()

In [19]:
accuracy = evaluate(nlp, val_texts, val_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9486


With the functions implemented, you can train and evaluate in a loop.

In [20]:
# This may take a while to run!
n_iters = 5
for i in range(n_iters):
    losses = train(nlp, train_data, optimizer)
    accuracy = evaluate(nlp, val_texts, val_labels)
    print(f"Loss: {losses['textcat']:.3f} \t Accuracy: {accuracy:.3f}")

Loss: 5.722 	 Accuracy: 0.944
Loss: 5.195 	 Accuracy: 0.947
Loss: 4.906 	 Accuracy: 0.942
Loss: 4.828 	 Accuracy: 0.942
Loss: 4.840 	 Accuracy: 0.946


# Step 6: Keep Improving

You've built the necessary components to train a text classifier with spaCy. What could you do further to optimize the model?

Run the next line to check your answer.

In [21]:
# Check your answer (Run this code cell to receive credit!)
# step_6.solution()

## Keep Going

The next step is a big one. See how you can **[represent tokens as vectors that describe their meaning](https://www.kaggle.com/matleonard/word-vectors)**, and plug those into your machine learning models.

---
**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161466) to chat with other Learners.*

In [22]:
# Loading the spam data. Ham is the label for non-spam messages
spam = pd.read_csv('spam.csv')
spam.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Since the classes are either ham or spam, we set "exclusive_classes" to True. We've also configured it with the bag of words ("bow") architecture. spaCy provides a convolutional neural network architecture as well, but it's more complex than you need for now.

In [23]:
# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# We'll add the labels to the model
textcat.add_label("ham")
textcat.add_label("spam")

1

1

**Training a Text Categorizer Model**

- We'll convert the labels in the data to the form TextCategorizer requires. For each document, create a dictionary of boolean values for each class.
- For example, if a text is "ham", we need a dictionary {'ham': True, 'spam': False}. The model is looking for these labels inside another dictionary with the key 'cats'

In [24]:
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham','spam': label == 'spam'}} for label in spam['label']]

# We combine the texts and labels into a single list
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

**Training the model**

- First, create an optimizer using `nlp.begin_training()`
- spaCy uses this optimizer to update the model
- In general it's more efficient to train models in small batches
- spaCy provides the `minibatch function` that returns a `generator yielding minibatches` for training
- Finally, the minibatches are split into texts and labels, then used with `nlp.update` to update the model's parameters

In [25]:
from spacy.util import minibatch

# Random seed
spacy.util.fix_random_seed(123)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)

# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)
    
# This is just one training loop (or epoch) through the data. The model will typically need multiple epochs

In [26]:
# Use another loop for more epochs, and optionally re-shuffle the training data at the begining of each loop

import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(20):
    
    random.shuffle(train_data)
    batches = minibatch(train_data, size=8)
    
    # Iterate through minibatches
    for batch in batches:
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 0.43189746531965056}
{'textcat': 0.6474976854055399}
{'textcat': 0.7842155156951605}
{'textcat': 0.8716684436774109}
{'textcat': 0.9280940163479938}
{'textcat': 0.9655780743921334}
{'textcat': 0.9939652660071991}
{'textcat': 1.0127977430832646}
{'textcat': 1.0275638589514353}
{'textcat': 1.037853221864903}
{'textcat': 1.0460189377707307}
{'textcat': 1.0524558944385225}
{'textcat': 1.0577781848589516}
{'textcat': 1.062299988672305}
{'textcat': 1.066389671523694}
{'textcat': 1.0699026189959342}
{'textcat': 1.072965325231249}
{'textcat': 1.0756358728320985}
{'textcat': 1.0779066511342426}
{'textcat': 1.0861850921354943}


**Making predictions**

In [27]:
# The input text needs to be tokenized with nlp.tokenizer. Then you pass the tokens to the predict method which returns scores.
# The scores are the probability the input text belongs to the classes

texts = ["Are you ready for the tea party????? It's gonna be wild","URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)
scores

# The scores are used to predict a single class or label by choosing the label with the highest probability.
# Get the index of the highest probability with scores.argmax, then use the index to get the label string from textcat.labels

# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
[textcat.labels[label] for label in predicted_labels]

array([[9.9999678e-01, 3.1974694e-06],
       [9.5907319e-04, 9.9904090e-01]], dtype=float32)

['ham', 'spam']

Evaluating the model is straightforward once you have the predictions. To measure the accuracy, calculate how many correct predictions are made on some test data, divided by the total number of predictions