# Classifying Voice Commands

For voice commands, Siri needs to be able to figure out *what* the speaker wants, and then *how* to accomplish that request.

<img src="https://www.cheatsheet.com/wp-content/uploads/2016/01/Siri-in-iOS-9-640x305.png" width=400>

Recall that we had a two-part goal:

a) predict the intent of the speaker of a voice command

and

b) extract the interesting named entities within the command.

It's now time to focus on part (b), also known as **NER**, which will help our sentence-level classification system we started in the 2nd notebook!

<img src="https://miro.medium.com/max/2594/1*rq7FCkcq4sqUY9IgfsPEOg.png" width="500">

---

In this notebook we'll be:
*   Implementing ML models for Intent Classification



**IMPORTANT**: Since the BERT model we will be using in this notebook is so large, we need to do one step before continuing. Please go to the 'Runtime' tab, and click on 'Change Runtime Type'; then select **GPU** under the dropdown for Hardware accelerator.

In [None]:
#@title Run this code to get started
%tensorflow_version 2.x
%pip install -q transformers

import tensorflow as tf
from urllib.request import urlretrieve
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from transformers import BertTokenizer
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import SparseCategoricalAccuracy

model_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# SNIPS_DATA_BASE_URL = (
#     "https://github.com/ogrisel/slot_filling_and_intent_detection_of_SLU/blob/"
#     "master/data/snips/"
# )
# for filename in ["train", "valid", "test", "vocab.intent", "vocab.slot"]:
#     path = Path(filename)
#     if not path.exists():
#       print(f"Downloading {filename}...")
#       urlretrieve(SNIPS_DATA_BASE_URL + filename + "?raw=true", path)

!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/train'
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/valid'
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/test'
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/vocab.intent'
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/vocab.slot'



def parse_line(line):
    data, intent_label = line.split(" <=> ")
    items = data.split()
    words = [item.rsplit(":", 1)[0]for item in items]
    word_labels = [item.rsplit(":", 1)[1]for item in items]
    return {
        "intent_label": intent_label,
        "words": " ".join(words),
        "word_labels": " ".join(word_labels),
        "length": len(words),
    }

def encode_dataset(text_sequences):
    # Create token_ids array (initialized to all zeros), where
    # rows are a sequence and columns are encoding ids
    # of each token in given sequence.
    token_ids = np.zeros(shape=(len(text_sequences), max_token_len),
                         dtype=np.int32)

    for i, text_sequence in enumerate(text_sequences):
        encoded = tokenizer.encode(text_sequence)
        token_ids[i, 0:len(encoded)] = encoded

    attention_masks = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_masks": attention_masks}


train_lines = Path("train").read_text().strip().splitlines()
valid_lines = Path("valid").read_text().strip().splitlines()
test_lines = Path("test").read_text().strip().splitlines()

df_train = pd.DataFrame([parse_line(line) for line in train_lines])
df_valid = pd.DataFrame([parse_line(line) for line in valid_lines])
df_test = pd.DataFrame([parse_line(line) for line in test_lines])

max_token_len = 43

encoded_train = encode_dataset(df_train["words"])
encoded_valid = encode_dataset(df_valid["words"])
encoded_test = encode_dataset(df_test["words"])

intent_names = Path("vocab.intent").read_text().split()
intent_map = dict((label, idx) for idx, label in enumerate(intent_names))
intent_train = df_train["intent_label"].map(intent_map).values
intent_valid = df_valid["intent_label"].map(intent_map).values
intent_test = df_test["intent_label"].map(intent_map).values

base_bert_model = TFBertModel.from_pretrained("bert-base-cased")

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
[K     |████████████████████████████████| 4.7 MB 8.4 MB/s 
[K     |████████████████████████████████| 101 kB 10.2 MB/s 
[K     |████████████████████████████████| 6.6 MB 24.8 MB/s 
[K     |████████████████████████████████| 596 kB 29.2 MB/s 
[?25h

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

--2022-08-14 01:20:21--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/train
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.99.128, 142.250.107.128, 74.125.20.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.99.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1793794 (1.7M) [application/octet-stream]
Saving to: ‘train’


2022-08-14 01:20:21 (214 MB/s) - ‘train’ saved [1793794/1793794]

--2022-08-14 01:20:21--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Siri%20(Bert)%20Voice%20Commands/valid
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.99.128, 142.250.107.128, 74.125.20.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.99.128|:443... connected.
HTTP request sent, a

Downloading tf_model.h5:   0%|          | 0.00/502M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


## Intent Classification + NER

Let's now refine our Natural Language Understanding system by capturing the important named elements within each voice command.

To do this, we will do word (actually *token*) level classification of the BIO labels.

```
      Book : O
         a : O
     table : O
       for : O
       two : B-party_size_number
        at : O
        Le : B-restaurant_name
         R : I-restaurant_name
     ##itz : I-restaurant_name
       for : O
    Friday : B-timeRange
     night : I-timeRange
         ! : O
```

Note: Since we have *word* level tags but BERT uses a tokenizer, we need to align the BIO labels with the BERT *tokens*.

In [None]:
#Own code cuz we nerds
df_train["word_labels"]

0        O B-entity_name I-entity_name I-entity_name O ...
1        O B-entity_name I-entity_name O B-playlist_own...
2        O O B-music_item O B-artist I-artist O O B-pla...
3        O O B-music_item O B-playlist_owner B-playlist...
4        O B-entity_name I-entity_name I-entity_name I-...
                               ...                        
13079    O O B-location_name I-location_name O B-movie_...
13080    O O O O B-movie_type I-movie_type B-spatial_re...
13081    O O B-movie_type I-movie_type B-spatial_relati...
13082    O B-movie_type I-movie_type O O O B-location_n...
13083      O B-object_type I-object_type O O B-timeRange O
Name: word_labels, Length: 13084, dtype: object

First, let's load the list of possible word token labels and augment it with an additional padding label so we can ignore special tokens:

In [None]:
# Build a map from slot name to a unique id.
slot_names = ["[PAD]"] + Path("vocab.slot").read_text().strip().splitlines()
slot_map = {}
for label in slot_names:
    slot_map[label] = len(slot_map)
slot_map

{'B-album': 1,
 'B-artist': 2,
 'B-best_rating': 3,
 'B-city': 4,
 'B-condition_description': 5,
 'B-condition_temperature': 6,
 'B-country': 7,
 'B-cuisine': 8,
 'B-current_location': 9,
 'B-entity_name': 10,
 'B-facility': 11,
 'B-genre': 12,
 'B-geographic_poi': 13,
 'B-location_name': 14,
 'B-movie_name': 15,
 'B-movie_type': 16,
 'B-music_item': 17,
 'B-object_location_type': 18,
 'B-object_name': 19,
 'B-object_part_of_series_type': 20,
 'B-object_select': 21,
 'B-object_type': 22,
 'B-party_size_description': 23,
 'B-party_size_number': 24,
 'B-playlist': 25,
 'B-playlist_owner': 26,
 'B-poi': 27,
 'B-rating_unit': 28,
 'B-rating_value': 29,
 'B-restaurant_name': 30,
 'B-restaurant_type': 31,
 'B-served_dish': 32,
 'B-service': 33,
 'B-sort': 34,
 'B-spatial_relation': 35,
 'B-state': 36,
 'B-timeRange': 37,
 'B-track': 38,
 'B-year': 39,
 'I-album': 40,
 'I-artist': 41,
 'I-city': 42,
 'I-country': 43,
 'I-cuisine': 44,
 'I-current_location': 45,
 'I-entity_name': 46,
 'I-facil

#### Word to Token Encodings

The following function generates *token-aligned* integer ids from the BIO *word-level* annotations. <img src="https://www.emoji.co.uk/files/twitter-emojis/symbols-twitter/11214-anticlockwise-downwards-and-upwards-open-circle-arrows.png" width=20>

If a certain word is broken down into multiple tokens by BERT, the word-level label is replicated for all of the word's tokens. The "B-" prefix is only used for the 1st of the tokens, while the rest of the tokens have the same label but with the "I-" prefix.



In [None]:
# Uses the slot_map of slot name to unique id, defined above, as well
# as the BERT tokenizer, to create a np array with each row corresponding
# to a given sequence, and the columns as the id of the given token slot labels.
def encode_token_labels(text_sequences, true_word_labels):
    encoded = np.zeros(shape=(len(text_sequences), max_token_len), dtype=np.int32)
    for i, (text_sequence, word_labels) in enumerate( \
            zip(text_sequences, true_word_labels)):
        encoded_labels = []
        for word, word_label in zip(text_sequence.split(), word_labels.split()):
            tokens = tokenizer.tokenize(word)
            encoded_labels.append(slot_map[word_label])
            expand_label = word_label.replace("B-", "I-")
            if not expand_label in slot_map:
                expand_label = word_label
            encoded_labels.extend([slot_map[expand_label]] * (len(tokens) - 1))
        encoded[i, 1:len(encoded_labels) + 1] = encoded_labels
    return encoded

#### Exercise 1

Let's encode the token labels for train, validation, & test:

In [None]:
# Encode the token labels and store in variables slot_train, slot_valid, slot_test.
### YOUR CODE HERE ###
slot_train = encode_token_labels(df_train["words"], df_train["word_labels"])
slot_valid = encode_token_labels(df_valid["words"], df_valid["word_labels"])
slot_test = encode_token_labels(df_test["words"], df_test["word_labels"])

Let's look at what the encoded token labels for the 1st training sequence are:

In [None]:
slot_train[0]

array([ 0, 72, 72, 10, 46, 46, 46, 72, 26, 25, 60, 60, 60, 60, 60, 60, 72,
       72,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

In [None]:
#Own code -- cuz don't have a life.
slot_valid[0]

array([ 0,  2, 41, 41, 72, 72, 72, 26, 25, 72, 72,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

In [None]:
#Own code -- cuz don't have a life, again.
slot_test[0]

array([ 0, 72, 72, 72, 72, 72, 72, 72, 17, 72, 26, 25, 60, 60, 60, 72, 72,
       72,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

Remember that special tokens such as `[PAD]` and `[SEP]` as well as all padded positions have a 0 label.

#### Exercise 2

Let's finish filling out the code below to build our **joint sequence and token classification model** which will be trained on our encoded dataset with the NER labels <img src="https://www.dictionary.com/e/wp-content/uploads/2018/08/victory-hand.png" width=20>:


In [None]:
# Define the class for the model that will create predictions
# for the overall intent of a sequence, as well as the NER token labels.
class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None, ## This defines the model structure
                dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")

        self.bert = base_bert_model

        # TODO: define the dropout, intent & slot classifier layers
        self.dropout = Dropout(dropout_prob)
        self.intent_classifier = Dense(intent_num_labels, name = "intent_classifier")
        self.slot_classifier = Dense(slot_num_labels, name = "slot_classifier")

    def call(self, inputs, **kwargs):
        # Extract features from the inputs using pre-trained BERT.
        # TODO: what does the bert model return?
        tokens_output,pooled_output = self.bert(inputs, **kwargs, return_dict=False)

        # TODO: use the new layers to predict slot class (logits) for each
        # token position in input sequence (size: (batch_size, seq_len, slot_num_labels)).
        tokens_output = self.dropout(tokens_output, training = kwargs.get("training",False)) ## dropout
        slot_logits = self.slot_classifier(tokens_output) ## Slot

        # TODO: define a second classification head for the sequence-wise
        # predictions (size: (batch_size, intent_num_labels)).
        # (Hint: create pooled_output to get the intent_logits).
        # Remember that the 2nd output of the main BERT layer is size
        # (batch_size, output_dim) & gives a "pooled" representation for
        # full sequence from hidden state corresponding to [CLS]).
        pooled_output = self.dropout(pooled_output, training = kwargs.get("training",False)) ## dropout
        intent_logits = self.intent_classifier(pooled_output) ##intent

        return slot_logits, intent_logits

# TODO: create an instantiation of this model
joint_model = JointIntentAndSlotFillingModel(intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

In [None]:
# Define one classification loss for each output (intent & NER):
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]

joint_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
                    loss=losses,
                    metrics=[SparseCategoricalAccuracy('accuracy')], run_eagerly=True)

In [None]:
# Train the model.
history = joint_model.fit(encoded_train["input_ids"], (slot_train, intent_train), \
    validation_data=(encoded_valid["input_ids"], (slot_valid, intent_valid)), \
    epochs=1, batch_size=32)



We should be able to achieve 99% validation accuracy for both tasks (sequence & token predictions) after only training for one epoch!

#### Classification

<img src="https://orbitcarrot.com/wp-content/uploads/2014/12/predict.png" width=100>

Whew! All that's left to make predictions is the following function which uses our trained model to make a prediction on a single text sequence, & display both the sequence-wise and the token-wise class labels.


#### Exercise 3

Let's finish the following function to make predictions:

In [None]:
# Use the model we trained to get the intent & slot logits
# and print the actual string of the class corresponding to
# highest logit score for each token, and the sentence overall.
def show_predictions(text, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = joint_model(inputs)
    slot_logits, intent_logits = outputs  ### YOUR CODE HERE ###
    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]
    print("## Intent:", intent_names[intent_id])  ### YOUR CODE HERE ###
    print("## Slots:")
    for token, slot_id in zip(tokenizer.tokenize(text), slot_ids):
        print(f"{token:>10} : {slot_names[slot_id]}")

Let's see how our classification function works on some examples!

In [None]:
show_predictions("Book a table for two at Le Ritz for Friday night!", intent_names, slot_names)

## Intent: BookRestaurant
## Slots:
      Book : O
         a : O
     table : O
       for : O
       two : B-party_size_number
        at : O
        Le : B-restaurant_name
         R : I-restaurant_name
     ##itz : I-restaurant_name
       for : O
    Friday : B-timeRange
     night : O
         ! : O


In [None]:
show_predictions("Will it snow tomorrow in Saclay?", intent_names, slot_names)

## Intent: GetWeather
## Slots:
      Will : O
        it : O
      snow : B-condition_description
  tomorrow : B-timeRange
        in : O
        Sa : B-city
       ##c : I-city
     ##lay : I-city
         ? : O


In [None]:
show_predictions("I would like to listen to Anima by Thom Yorke.", intent_names, slot_names)

## Intent: PlayMusic
## Slots:
         I : O
     would : O
      like : O
        to : O
    listen : O
        to : O
        An : B-artist
     ##ima : I-album
        by : O
      Thom : B-artist
      York : I-artist
       ##e : I-artist
         . : O


### Turning Predictions into Structured Knowledge

A system like Siri shouldn't have to handle any excess information, and ultimately wants to transform a speaker's verbal command into a nice, structured format.

For completeness, the following functions turn the predicted BIO token ids and intent id into a simple structured representation:

In [None]:
def decode_predictions(text, intent_names, slot_names,
                       intent_id, slot_ids):
    info = {"intent": intent_names[intent_id]}
    collected_slots = {}
    active_slot_words = []
    active_slot_name = None
    for word in text.split():
        tokens = tokenizer.tokenize(word)
        current_word_slot_ids = slot_ids[:len(tokens)]
        slot_ids = slot_ids[len(tokens):]
        current_word_slot_name = slot_names[current_word_slot_ids[0]]
        if current_word_slot_name == "O":
            if active_slot_name:
                collected_slots[active_slot_name] = " ".join(active_slot_words)
                active_slot_words = []
                active_slot_name = None
        else:
            # Naive BIO: handling: treat B- and I- the same...
            new_slot_name = current_word_slot_name[2:]
            if active_slot_name is None:
                active_slot_words.append(word)
                active_slot_name = new_slot_name
            elif new_slot_name == active_slot_name:
                active_slot_words.append(word)
            else:
                collected_slots[active_slot_name] = " ".join(active_slot_words)
                active_slot_words = [word]
                active_slot_name = new_slot_name
    if active_slot_name:
        collected_slots[active_slot_name] = " ".join(active_slot_words)
    info["slots"] = collected_slots
    return info

In [None]:
def nlu(text, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = joint_model(inputs)
    slot_logits, intent_logits = outputs
    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]

    return decode_predictions(text, intent_names, slot_names, intent_id, slot_ids)

Let's test this on the same examples:

In [None]:
nlu("Book a table for two at Le Ritz for Friday night", intent_names, slot_names)

{'intent': 'BookRestaurant',
 'slots': {'party_size_number': 'two',
  'restaurant_name': 'Le Ritz',
  'timeRange': 'Friday'}}

In [None]:
nlu("Will it snow tomorrow in Saclay", intent_names, slot_names)

{'intent': 'GetWeather',
 'slots': {'city': 'Saclay',
  'condition_description': 'snow',
  'timeRange': 'tomorrow'}}

In [None]:
nlu("I would like to listen to Anima by Thom Yorke", intent_names, slot_names)

{'intent': 'PlayMusic', 'slots': {'artist': 'Thom Yorke', 'service': 'Anima'}}

In [None]:
nlu("Book a seat at Pepper House for tomorrow at 5:45pm", intent_names, slot_names)

{'intent': 'BookRestaurant',
 'slots': {'restaurant_name': 'Pepper House', 'timeRange': '5:45pm'}}

0        O B-entity_name I-entity_name I-entity_name O ...
1        O B-entity_name I-entity_name O B-playlist_own...
2        O O B-music_item O B-artist I-artist O O B-pla...
3        O O B-music_item O B-playlist_owner B-playlist...
4        O B-entity_name I-entity_name I-entity_name I-...
                               ...                        
13079    O O B-location_name I-location_name O B-movie_...
13080    O O O O B-movie_type I-movie_type B-spatial_re...
13081    O O B-movie_type I-movie_type B-spatial_relati...
13082    O B-movie_type I-movie_type O O O B-location_n...
13083      O B-object_type I-object_type O O B-timeRange O
Name: word_labels, Length: 13084, dtype: object

**Discuss**:

We focused on the NLU/NLP aspect of turning a string of words in a verbal command into a simple representation for Siri to utilize.

What do you think Siri would actually do next with those structured predictions?

## Limitations

1. **Language**

BERT is pretrained primarily on English content. Therefore, it will extract features on English text.

Note that there are alternative pretrained models that use a mix of different languages (e.g. [XLM](https://github.com/facebookresearch/XLM/)) and certain models that have been trained on other languages entirely. For instance [CamemBERT](https://camembert-model.fr/) is pretrained on French text. Both kinds of models are available in the transformers package:

https://github.com/huggingface/transformers#model-architectures

The public SNIPS dataset we used is for fine-tuning in English only. To build a model for another language we would need to collect and annotate a similar corpus (body of text) with diverse, representative samples.


2. **Biases in the Pre-Trained Model**

The original data used to pre-train BERT was collected from the Internet and contains a multitude of data, including offensive and hateful speech.

While using BERT for our voice command understanding system is unlikely to be impacted by those biases, it could be a serious problem for other kinds of applications.

It is therefore strongly recommended to spend time auditing any biases that are embedded in pre-trained models before ever actually deploying system that derives from them.

3. **Computational Resources**

The original BERT model has many parameters which takes up a lot of memory. It is also very computationally intensive and usually requires powerful [GPUs](https://en.wikipedia.org/wiki/Graphics_processing_unit) or [TPUs](https://en.wikipedia.org/wiki/Tensor_processing_unit) to process data at a *reasonable* speed (both for training and testing).

Designing alternative architectures with fewer parameters or more efficient training and prediction methods is still an area of active research.

Depending on the problem, simpler architectures based on convolutional neural networks (CNNs) and LSTMs might have a better speed / accuracy trade-off.