# Joint Intent Classification and Slot filling with BERT
This notebook is based on the paper __BERT for Joint Intent Classification and Slot Filling__ by Chen et al. (2019), https://arxiv.org/abs/1902.10909 but on a different dataset made for a class project.

Ideas were also taken from https://github.com/monologg/JointBERT, which is a PyTorch implementation of the paper with the original dataset.


## Install transformers

In [None]:
!pip install transformers



## Download data

In [None]:
!wget https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/train.json

--2021-01-29 16:53:59--  https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/train.json
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/train.json [following]
--2021-01-29 16:53:59--  https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5055766 (4.8M) [text/plain]
Saving to: ‘train.json.3’


2021-01-29 16:54:00 (95.0 MB/s) - ‘train.json.3’ saved [5055766/5055766]



In [None]:
!wget https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/dev.json

--2021-01-29 16:54:00--  https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/dev.json
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/dev.json [following]
--2021-01-29 16:54:00--  https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/dev.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 248459 (243K) [text/plain]
Saving to: ‘dev.json.3’


2021-01-29 16:54:00 (42.2 MB/s) - ‘dev.json.3’ saved [248459/248459]



## Read data from json files

Data is of the following format
````json5
{
  "text": "",
  "positions": [{}],
  "slots": [{}],
  "intent": ""
}
````

We will be using `text` as the input and `slots` and `intent` as lables

In [None]:
import json
import os

class RawData(object):
    def __init__(self, id, intent, positions, slots, text):
        self.id = id
        self.intent = intent
        self.positions = positions
        self.slots = slots
        self.text = text

    def __repr__(self):
        return str(json.dumps(self.__dict__, indent=2))


"""
reads json from data file
returns a list containing DataInstance objects
"""


def read_train_json_file(filename):
    if os.path.exists(filename):
        intents = []

        with open(filename, "r", encoding="utf-8") as json_file:
            data = json.load(json_file)

            for k in data.keys():
                intent = data[k]["intent"]
                positions = data[k]["positions"]
                slots = data[k]["slots"]
                text = data[k]["text"]

                temp = RawData(k, intent, positions, slots, text)
                intents.append(temp)

        return intents
    else:
        raise FileNotFoundError("No file found with that path!")

# read from json file
train_data = read_train_json_file("train.json")

In [None]:
example = train_data[0]
example

{
  "id": "0",
  "intent": "AddToPlaylist",
  "positions": {
    "music_item": [
      6,
      9
    ],
    "playlist_owner": [
      14,
      15
    ],
    "playlist": [
      17,
      32
    ]
  },
  "slots": {
    "music_item": "tune",
    "playlist_owner": "my",
    "playlist": "elrow Guest List"
  },
  "text": "Add a tune to my elrow Guest List"
}

## Load Tokenizer from transformers

We will use a pretrained bert model `bert-base-cased` for both Tokenizer and our classifier.

In [None]:
import tensorflow as tf
from transformers import AutoTokenizer

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode texts from the dataset

We have to encode the texts using the tokenizer to create tensors for training the classifier.

In [None]:
# https://huggingface.co/transformers/preprocessing.html

def encode_texts(tokenizer, texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="tf")

texts = [d.text for d in train_data]
tds = encode_texts(tokenizer, texts)
tds.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
encoded_texts = tds

## Encode labels
### Intents

In [None]:

intents = [d.intent for d in train_data]
intent_names = list(set(intents))
intent_names

['AddToPlaylist',
 'BookRestaurant',
 'GetWeather',
 'RateBook',
 'PlayMusic',
 'SearchScreeningEvent',
 'SearchCreativeWork']

In [None]:
intent_map = dict() # index -> intent
for idx, ui in enumerate(intent_names):
    intent_map[ui] = idx
intent_map

{'AddToPlaylist': 0,
 'BookRestaurant': 1,
 'GetWeather': 2,
 'PlayMusic': 4,
 'RateBook': 3,
 'SearchCreativeWork': 6,
 'SearchScreeningEvent': 5}

In [None]:
# map to train_data values
def encode_intents(intents, intent_map):
    encoded = []
    for i in intents:
        encoded.append(intent_map[i])
    # convert to tf tensor
    return tf.convert_to_tensor(encoded, dtype="int32")

encoded_intents = encode_intents(intents, intent_map)

### Slots

To padd all the texts to the same length, the tokenizer will use special characters. To handle those we need to add <PAD> to slots_names. It can be some other symbol as well.

In [None]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td.slots
    for slot in slots:
        slot_names.add(slot)
slot_names = list(slot_names)
slot_names.insert(0, "<PAD>")
slot_names

['<PAD>',
 'spatial_relation',
 'music_item',
 'object_name',
 'geographic_poi',
 'service',
 'artist',
 'playlist',
 'object_part_of_series_type',
 'playlist_owner',
 'sort',
 'cuisine',
 'state',
 'year',
 'rating_unit',
 'location_name',
 'restaurant_name',
 'object_type',
 'country',
 'object_select',
 'timeRange',
 'album',
 'entity_name',
 'movie_type',
 'served_dish',
 'city',
 'poi',
 'movie_name',
 'party_size_number',
 'genre',
 'party_size_description',
 'restaurant_type',
 'object_location_type',
 'best_rating',
 'track',
 'condition_description',
 'rating_value',
 'facility',
 'current_location',
 'condition_temperature']

In [None]:
slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

{'<PAD>': 0,
 'album': 21,
 'artist': 6,
 'best_rating': 33,
 'city': 25,
 'condition_description': 35,
 'condition_temperature': 39,
 'country': 18,
 'cuisine': 11,
 'current_location': 38,
 'entity_name': 22,
 'facility': 37,
 'genre': 29,
 'geographic_poi': 4,
 'location_name': 15,
 'movie_name': 27,
 'movie_type': 23,
 'music_item': 2,
 'object_location_type': 32,
 'object_name': 3,
 'object_part_of_series_type': 8,
 'object_select': 19,
 'object_type': 17,
 'party_size_description': 30,
 'party_size_number': 28,
 'playlist': 7,
 'playlist_owner': 9,
 'poi': 26,
 'rating_unit': 14,
 'rating_value': 36,
 'restaurant_name': 16,
 'restaurant_type': 31,
 'served_dish': 24,
 'service': 5,
 'sort': 10,
 'spatial_relation': 1,
 'state': 12,
 'timeRange': 20,
 'track': 34,
 'year': 13}

In [None]:
# gets slot name from its values
def get_slot_from_word(word, slot_dict):
    for slot_label,value in slot_dict.items():
        if word in value.split():
            return slot_label
    return None

print(train_data[0].text)
print(train_data[0].slots)
print("slot_name for my is : ", get_slot_from_word("my", train_data[0].slots))

Add a tune to my elrow Guest List
{'music_item': 'tune', 'playlist_owner': 'my', 'playlist': 'elrow Guest List'}
slot_name for my is :  playlist_owner


In [None]:
import numpy as np

# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids
max_len = len(encoded_texts["input_ids"][0])

def encode_slots(all_slots, all_texts, 
                 toknizer, slot_map, max_len=max_len):
    encoded_slots = np.zeros(shape=(len(all_texts), max_len), dtype=np.int32)
    
    for idx, text in enumerate(all_texts):
        enc = [] # for this idx, to be added at the end to encoded_slots
        
        # slot names for this idx
        slot_names = all_slots[idx]
        
        # raw word tokens
        # not using bert for this block because bert uses
        # a wordpiece tokenizer which will make 
        # the slot label to word mapping
        # difficult
        raw_tokens = text.split()

        # words or slot_values associated with a certain
        # slot_name are contained in the values of the
        # dict slots_names
        # now this becomes a two way lookup
        # first we check if a word belongs to any
        # slot label or not and then we add the value from
        # slot map to encoded for that word
        for rt in raw_tokens:
            # use bert tokenizer
            # to get wordpiece tokens
            bert_tokens = tokenizer.tokenize(rt)
            
            # find the slot name for a token
            rt_slot_name = get_slot_from_word(rt, slot_names)
            if rt_slot_name is not None:
                # fill with the slot_map value for all ber tokens for rt
                enc.append(slot_map[rt_slot_name])
                enc.extend([slot_map[rt_slot_name]] * (len(bert_tokens) - 1))

            else:
                # rt is not associated with any slot name
                enc.append(0)

        
        # now add to encoded_slots
        # ignore the first and the last elements
        # in encoded text as they're special chars
        encoded_slots[idx, 1:len(enc)+1] = enc
    
    return encoded_slots
    

In [None]:
all_slots = [td.slots for td in train_data]
all_texts = [td.text for td in train_data]

In [None]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map)

In [None]:
encoded_slots[0]

array([0, 0, 0, 2, 0, 9, 7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

## Classifier Model

### Definition

In [None]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None,
                 model_name=model_name, dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.intent_classifier = Dense(intent_num_labels,
                                       name="intent_classifier")
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        # two outputs from BERT
        trained_bert = self.bert(inputs, **kwargs)
        pooled_output = trained_bert.pooler_output
        sequence_output = trained_bert.last_hidden_state
        
        # sequence_output will be used for slot_filling / classification
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        # pooled_output for intent classification
        pooled_output = self.dropout(pooled_output,
                                     training=kwargs.get("training", False))
        intent_logits = self.intent_classifier(pooled_output)

        return slot_logits, intent_logits

In [None]:
joint_model = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Hyperparams, Optimizer and Loss function

In [None]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# two outputs, one for slots, another for intents
# we have to fine tune for both
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]

metrics = [SparseCategoricalAccuracy("accuracy")]
# compile model
joint_model.compile(optimizer=opt, loss=losses, metrics=metrics)

### Train

In [None]:
x = {"input_ids": encoded_texts["input_ids"], "token_type_ids": encoded_texts["token_type_ids"],  "attention_mask": encoded_texts["attention_mask"]}

history = joint_model.fit(
    x, (encoded_slots, encoded_intents), epochs=2, batch_size=32, shuffle=True)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f18c3b86110> is not a module, class, method, function, traceback, frame, or code object
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f18c3b86110> is not a module, class, method, function, traceback, frame, or code object


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).


Cause: while/else statement not yet supported


The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


Cause: while/else statement not yet supported


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


Epoch 2/2


## Inference

In [None]:
def nlu(text, tokenizer, model, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits, intent_logits = outputs

    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, :]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]

    info = {"intent": intent_names[intent_id], "slots": {}}

    out_dict = {}
    # get all slot names and add to out_dict as keys
    predicted_slots = set([slot_names[s] for s in slot_ids if s != 0])
    for ps in predicted_slots:
      out_dict[ps] = []

    # check if the text starts with a small letter
    if text[0].islower():
      tokens = tokenizer.tokenize(text, add_special_tokens=True)
    else:
      tokens = tokenizer.tokenize(text)
    for token, slot_id in zip(tokens, slot_ids):
        # add all to out_dict
        slot_name = slot_names[slot_id]

        if slot_name == "<PAD>":
            continue

        # collect tokens
        collected_tokens = [token]
        idx = tokens.index(token)

        # see if it starts with ##
        # then it belongs to the previous token
        if token.startswith("##"):
          # check if the token already exists or not
          if tokens[idx - 1] not in out_dict[slot_name]:
            collected_tokens.insert(0, tokens[idx - 1])

        # add collected tokens to slots
        out_dict[slot_name].extend(collected_tokens)

    # process out_dict
    for slot_name in out_dict:
        tokens = out_dict[slot_name]
        slot_value = tokenizer.convert_tokens_to_string(tokens)

        info["slots"][slot_name] = slot_value.strip()

    return info


In [None]:
nlu("add Madchild to Electro Latino", tokenizer, joint_model, 
    intent_names, slot_names)

{'intent': 'AddToPlaylist',
 'slots': {'entity_name': 'Madchild', 'playlist': 'Electro Latino'}}

In [None]:
nlu("add Brian May to my Reggae Infusions list", tokenizer, joint_model, 
    intent_names, slot_names)

{'intent': 'AddToPlaylist',
 'slots': {'artist': 'Brian May',
  'playlist': 'Reggae Infusions',
  'playlist_owner': 'my'}}

In [None]:
import calendar
import time

# to generate timestamps for prediction file
def get_time_stamp():
    ts = calendar.timegm(time.gmtime())
    return ts

get_time_stamp()

1611939468

## Generate prediction.json

This section creates a file containing all the prediction results for inputs from dev.json

In [None]:
def read_dev_data(file="dev.json"):
    dev_texts = []
    with open(file, "r", encoding="utf-8") as json_file:
        data = json.load(json_file)

        for k in data.keys():
          text = data[k]["text"]
          dev_texts.append(text)
          
    return dev_texts
dev_texts = read_dev_data()

In [None]:
from tqdm import tqdm

results = []
for i in tqdm(range(len(dev_texts))):
    res = nlu(dev_texts[i], tokenizer, joint_model, intent_names, slot_names)
    results.append(res)

100%|██████████| 2887/2887 [02:58<00:00, 16.21it/s]


In [None]:
# process results
results_dict = dict()

for idx, res in enumerate(results):
    results_dict[str(idx)] = res

In [None]:
with open("prediction.json", "w") as f:
    json.dump(results_dict, f, indent=2)

In [None]:
!head prediction.json

{
  "0": {
    "intent": "AddToPlaylist",
    "slots": {
      "entity_name": "changes & things",
      "playlist": "hot 50"
    }
  },
  "1": {
    "intent": "AddToPlaylist",
