# **Leveraging BERT for Natural Language Understanding of Domain-Specific Knowledge**

This notebook is based on the paper __BERT for Joint Intent Classification and Slot Filling__ by Chen et al. (2019), --> https://arxiv.org/abs/1902.10909
It is based on Shawon Ashraf's notebook, available here --> https://github.com/ShawonAshraf/nlu-jointbert-dl2021.

This notebook is the running code for the paper Leveraging BERT for Natural Language Understanding of Domain-Specific Knowledge, by V.I. Iga and G.C. Silaghi, available here --> https://github.com/IonutIga/Domain-Specific-NLU-BERT/



  **In order for this notebook to run properly**, load the datasets available at --> https://github.com/IonutIga/Domain-Specific-NLU-BERT, from the Datasets folder.
You can train ATIS, SNIPS and/or the custom generated dataset (generated using the Dialogue Simulator available here --> https://github.com/IonutIga/Dialogue-Simulator).

This version of the notebook has added functionality for converting text formatted data (ATIS, SNIPS) into JSON format, **while keeping all original labelings**.

## Dataset format

Data is of the following format
````json5
{
  "text": "",
  "positions": [{}],
  "slots": [{}],
  "intent": ""
}
````

We will be using `text` as the input and `slots` and `intent` as lables.

## Install and import required libraries

In [None]:
!pip install transformers
!pip install datasets
!pip install seqeval
import re
import json
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy
from tqdm import tqdm
from datasets import load_metric
import os

## Define functions

In [None]:
# calculate de start and end index of a word in a phrase
def entityDetails(phrase, entity, start_index = 0):
  phrase = phrase[start_index:]
  startIndex = phrase.find(entity)
  endIndex = startIndex + len(entity) - 1
  return startIndex + start_index, endIndex + start_index

# read a random line from a file
def read_lines(file):
  f = open(file)
  lines = f.read().splitlines()
  f.close()
  return lines

**Convert text format into JSON**. ATIS and SNIPS are text format phrases, while our generated dataset stores information in JSON format. To train our model with the two datasets, we design a function that converts plain texts and intent and slots labeling into JSON objects. Unfortunately, there is a loss of information during the process, as some texts may have multiple same slots in the same text, but JSON needs unqique keys to store information, therefore one slot can only appear once. Although this caveat, ATIS and SNIPS texts have, most of the times, only one appearance per slot per text, therefore we do not lose much information.

In [None]:
# provide the name of the resulted JSON file for out_file
# v1.1, added support for converting multi-label texts into multi-label JSON format, keeping all original labels
# how it works: we add a number to each slot that repeats in a user utterance. Later, all the numbers will be removed by other functions and only keep the standard label

def convert_text_format_to_JSON(labels, seqin, seqout, out_file):

  annotated_phrases = {}
  index = 0
  i = 0
  annotated_phrase = {'text' : '',
                      'slots' : {},
                      'positions':{},
                      'intent': ''}
  for i in range(len(seqin)):
    start_index = 0
    annotated_phrase['text'] = seqin[i]
    annotated_phrase['intent'] = labels[i]
    tok_seqout = seqout[i].split(' ')
    tok_seqin = seqin[i].split(' ')
    complete_word = ''
    complete_label = ''
    for k, v in zip(tok_seqout, tok_seqin):
      if k != 'O':
        if k.startswith('B'):

          if complete_word:
            si, ei = entityDetails(seqin[i], complete_word, start_index)
            start_index = ei + 1
            annotated_phrase['slots'][complete_label] = complete_word
            annotated_phrase['positions'][complete_label] = [si, ei]
            complete_word = v
            if k in annotated_phrase['slots'].keys():
              complete_label = f'{k}{index}'
              index += 1
            else:
              complete_label = f'{k}'

          else:
            complete_word += v

            if k in annotated_phrase['slots'].keys():
              complete_label = f'{k}{index}'
              index += 1
            else:
              complete_label = f'{k}'
        elif k.startswith('I'):
          if complete_word != '':
            complete_word += f' {v}'

      elif complete_word:
        si, ei = entityDetails(seqin[i], complete_word, start_index)
        start_index = ei + 1
        annotated_phrase['slots'][complete_label] = complete_word
        annotated_phrase['positions'][complete_label] = [si, ei]
        complete_label = ''
        complete_word = ''
    if complete_word:
      si, ei = entityDetails(seqin[i], complete_word, start_index)
      start_index = ei + 1
      annotated_phrase['slots'][complete_label] = complete_word
      annotated_phrase['positions'][complete_label] = [si, ei]
    annotated_phrases[i] = annotated_phrase
    annotated_phrase = {'text' : '',
                      'slots' : {},
                      'positions':{},
                      'intent': ''}
  out_file = open(out_file, 'w')
  json.dump(annotated_phrases, out_file, indent = 4)
  out_file.close()
  return None

In [None]:
class RawData(object):
    def __init__(self, id, intent, positions, slots, text):
        self.id = id
        self.intent = intent
        self.positions = positions
        self.slots = slots
        self.text = text

    def __repr__(self):
        return str(json.dumps(self.__dict__, indent=2))


"""
reads json from data file
returns a list containing DataInstance objects
"""



In [None]:
def read_train_json_file(filename):
    if os.path.exists(filename):
        intents = []

        with open(filename, "r", encoding="utf-8") as json_file:
            data = json.load(json_file)

            for k in data.keys():
                intent = data[k]["intent"]
                positions = data[k]["positions"]
                slots = data[k]["slots"]
                text = data[k]["text"]

                temp = RawData(k, intent, positions, slots, text)
                intents.append(temp)

        return intents
    else:
        raise FileNotFoundError("No file found with that path!")

In [None]:
# encode intents into tensors
def encode_intents(intents, intent_map):
    encoded = []
    for i in intents:
        encoded.append(intent_map[i])
    # convert to tf tensor
    return encoded, tf.convert_to_tensor(encoded, dtype="int32")

In [None]:
# gets slot name from its values
def get_slot_from_word(word, slot_dict):
    for slot_label,value in slot_dict.items():
        if word in value.split():
            return slot_label
    return None

In [None]:
# tokenize input by a pattern, removing unnecessary spaces
def tokenize(pattern, text):
  final_text = []
  tokens = re.split(pattern,text)
  for t in tokens:
    if t not in ['', ' ']:
      final_text.append(t)
  return final_text

In [None]:
# function to encode slots from each text
# v1.1, added support for converting multi-label texts into multi-label JSON format, keeping all original labels
# how it works: added standard_slots parameter, with the standard names for each slot label. Then, each slot name that contains a number at the end will be converted to the standard name

def encode_slots(all_slots, all_texts,
                 tokenizer, slot_map, max_len, standard_slots = []):
    encoded_slots = np.zeros(shape=(len(all_texts), max_len), dtype=np.int32)

    for idx, text in enumerate(all_texts):
        enc = [] # for this idx, to be added at the end to encoded_slots

        # slot names for this idx
        slot_names = all_slots[idx]

        # raw word tokens
        # not using bert for this block because bert uses
        # a wordpiece tokenizer which will make
        # the slot label to word mapping
        # difficult
        raw_tokens = tokenize('( |,)', text)

        # words or slot_values associated with a certain
        # slot_name are contained in the values of the
        # dict slots_names
        # now this becomes a two way lookup
        # first we check if a word belongs to any
        # slot label or not and then we add the value from
        # slot map to encoded for that word
        for rt in raw_tokens:
            # use bert tokenizer
            # to get wordpiece tokens
            bert_tokens = tokenizer.tokenize(rt)

            # find the slot name for a token
            copy_slots = {}
            if standard_slots != []:
              for k, v in slot_names.items():
               for s in standard_slots:
                if s in k:
                  copy_slots[s] = v

            rt_slot_name = get_slot_from_word(rt, copy_slots)
            if rt_slot_name is not None:
                # fill with the slot_map value for all bert tokens for rt
                enc.append(slot_map[rt_slot_name])
                enc.extend([slot_map[rt_slot_name]] * (len(bert_tokens) - 1))
            else:
                # rt is not associated with any slot name
                enc.append(0)
                enc.extend([0] * (len(bert_tokens) - 1))

        # now add to encoded_slots
        # ignore the first and the last elements
        # in encoded text as they're special chars
        encoded_slots[idx, 1:len(enc)+1] = enc

    return encoded_slots

**Classifier definiton**:
The model uses BERT as a base transformer layer. On top of it, a Dropout layer is added. Finally, two Dense layers, one for Intent Detection and the other for Slot Filling, are placed at the top of the model. It returns logit values.

In [None]:
model_name = "bert-base-cased"

class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None,
                 model_name=model_name, dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.intent_classifier = Dense(intent_num_labels,
                                       name="intent_classifier")
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        # two outputs from BERT
        trained_bert = self.bert(inputs, **kwargs)
        pooled_output = trained_bert.pooler_output
        sequence_output = trained_bert.last_hidden_state

        # sequence_output will be used for slot_filling / classification
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        # pooled_output for intent classification
        pooled_output = self.dropout(pooled_output,
                                     training=kwargs.get("training", False))
        intent_logits = self.intent_classifier(pooled_output)

        return slot_logits, intent_logits

**The NLU pipeline** is used to predict intent and slots from a specific text.

In [None]:
def nlu(text, tokenizer, model, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits, intent_logits = outputs

    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, :]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]

    info = {"intent": intent_names[intent_id], "slots": {}}

    out_dict = {}
    # get all slot names and add to out_dict as keys
    predicted_slots = set([slot_names[s] for s in slot_ids if s != 0])
    for ps in predicted_slots:
      out_dict[ps] = []

    tokens = tokenizer.tokenize(text, add_special_tokens=True)

    for token, slot_id in zip(tokens, slot_ids):
        #print(token, slot_id)
        # add all to out_dict
        slot_name = slot_names[slot_id]

        if slot_name == "O":
            continue

        # collect tokens
        collected_tokens = [token]
        idx = tokens.index(token)

        # see if it starts with ##
        # then it belongs to the previous token
        if token.startswith("##"):
          # check if the token already exists or not
          if tokens[idx - 1] not in out_dict[slot_name]:
            collected_tokens.insert(0, tokens[idx - 1])

        # add collected tokens to slots
        out_dict[slot_name].extend(collected_tokens)
    # process out_dict
    for slot_name in out_dict:
        tokens = out_dict[slot_name]
        slot_value = tokenizer.convert_tokens_to_string(tokens)
        info["slots"][slot_name] = slot_value.strip()

    return info, slot_ids, intent_id

Load Tokenizer from transformers

We will use a pretrained bert model `bert-base-cased` for both Tokenizer and our classifier.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

## **ATIS dataset**

In [None]:
# download train data for ATIS

In [None]:
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/train/label
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/train/seq.in
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/train/seq.out

In [None]:
#download test data for ATIS

In [None]:
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/test/label
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/test/seq.in
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/atis/test/seq.out

In [None]:
# concatenate the two datasets into one, so all the preprocessing is done at once
labels = read_lines("label") + read_lines('label.1')
seqin = read_lines('seq.in') + read_lines('seq.in.1')
seqout = read_lines('seq.out') + read_lines('seq.out.1')

In [None]:
convert_text_format_to_JSON(labels, seqin, seqout, 'atis.json')

In [None]:
# read from json file
train_data = read_train_json_file("atis.json")

In [None]:
example = train_data[1]
example

In [None]:
len(train_data)

Encode texts from the dataset.
We have to encode the texts using the tokenizer to create tensors for training the classifier.
Training set for ATIS from 0 to 4477 and test set from 4478 to end.


In [None]:
# https://huggingface.co/transformers/preprocessing.html


def encode_texts(tokenizer, texts):
    return tokenizer(texts[:4477], padding=True, truncation=True, return_tensors="tf"), tokenizer(texts[4478:], padding=True, truncation=True, return_tensors="tf")

texts = [d.text for d in train_data]
tds, test_tds = encode_texts(tokenizer, texts)
tds.keys()

In [None]:
encoded_texts = tds
encoded_texts_test = test_tds

In [None]:
# get the unique list of intents' names
intents = [d.intent for d in train_data]
intent_names = list(set(intents))
intent_names

In [None]:
# map intents' names to indexes, which is going to be used by the model to assign predictions
intent_map = dict() # index -> intent
for idx, ui in enumerate(intent_names):
    intent_map[ui] = idx
intent_map

In [None]:
# reverse intents' names with ids, to use when predicting for converting an ID into natural language
id_to_intent_name = {v: k for k, v in intent_map.items()}

In [None]:
true_intents, encoded_intents = encode_intents(intents, intent_map)

Slots

To padd all the texts to the same length, the tokenizer will use special characters. To handle those we need to add O to slots_names. It can be some other symbol as well.

In [None]:
# with other functions modifications, we can convert texts into JSON format and keep all the original labelings

standard_slots = ['O',
 'B-arrive_time.period_mod',
 'B-compartment',
 'B-arrive_date.day_number',
 'B-fare_amount',
 'B-flight_stop',
 'B-month_name',
 'B-flight',
 'B-booking_class',
 'B-stoploc.state_code',
 'B-return_date.day_number',
 'B-meal_code',
 'B-arrive_date.month_name',
 'B-arrive_time.time',
 'B-return_date.date_relative',
 'B-return_date.month_name',
 'B-stoploc.city_name',
 'B-time',
 'B-day_number',
 'B-depart_time.period_mod',
 'B-depart_time.period_of_day',
 'B-return_time.period_of_day',
 'B-city_name',
 'B-stoploc.airport_name',
 'B-depart_date.day_number',
 'B-toloc.airport_name',
 'B-arrive_time.end_time',
 'B-fromloc.state_code',
 'B-airline_name',
 'B-fare_basis_code',
 'B-aircraft_code',
 'B-days_code',
 'B-toloc.airport_code',
 'B-return_date.today_relative',
 'B-meal',
 'B-class_type',
 'B-transport_type',
 'B-day_name',
 'B-or',
 'B-depart_date.day_name',
 'B-airline_code',
 'B-arrive_date.today_relative',
 'B-toloc.country_name',
 'B-meal_description',
 'B-fromloc.state_name',
 'B-depart_time.end_time',
 'B-depart_time.start_time',
 'B-flight_time',
 'B-flight_mod',
 'B-arrive_time.period_of_day',
 'B-fromloc.airport_code',
 'B-airport_name',
 'B-economy',
 'B-toloc.state_code',
 'B-restriction_code',
 'B-connect',
 'B-depart_date.month_name',
 'B-state_code',
 'B-fromloc.airport_name',
 'B-toloc.city_name',
 'B-fromloc.city_name',
 'B-today_relative',
 'B-arrive_date.date_relative',
 'B-depart_time.time_relative',
 'B-toloc.state_name',
 'B-depart_date.date_relative',
 'B-stoploc.airport_code',
 'B-arrive_date.day_name',
 'B-airport_code',
 'B-return_date.day_name',
 'B-time_relative',
 'B-mod',
 'B-state_name',
 'B-return_time.period_mod',
 'B-depart_time.time',
 'B-flight_days',
 'B-arrive_time.start_time',
 'B-cost_relative',
 'B-depart_date.today_relative',
 'B-depart_date.year',
 'B-round_trip',
 'B-flight_number',
 'B-period_of_day',
 'B-arrive_time.time_relative']

In [None]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td.slots
    for slot in slots:
      for s in standard_slots:
        if s in slot:
          slot_names.add(s)
slot_names = list(slot_names)
slot_names.insert(0, "O")
slot_names

In [None]:
# map slots' names to indexes, which is going to be used by the model to assign predictions
slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

In [None]:
# reverse slots' names with ids, to use when predicting for converting an ID into natural language
id_to_slot_name = {v: k for k, v in slot_map.items()}

In [None]:
# test to see if the slots are aligned with the texts, by getting the slot name for a token within a text
print(train_data[1].text)
print(train_data[1].slots)
print("slot_name for baltimore is : ", get_slot_from_word("1000", train_data[1].slots))

In [None]:
# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids
max_len = len(encoded_texts["input_ids"][0])

In [None]:
# get all the slots and texts, to encode the slots
all_slots = [td.slots for td in train_data]
all_texts = [td.text for td in train_data]

In [None]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len, standard_slots)

In [None]:
encoded_slots[1]

In [None]:
# define the model
joint_model_atis = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

Hyperparams, Optimizer and Loss function

In [None]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# two outputs, one for slots, another for intents
# we have to fine tune for both
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]

metrics = [SparseCategoricalAccuracy("accuracy")]
# compile model
joint_model_atis.compile(optimizer=opt, loss=losses, metrics=metrics)

Train

In [None]:
x = {"input_ids": encoded_texts["input_ids"], "token_type_ids": encoded_texts["token_type_ids"],  "attention_mask": encoded_texts["attention_mask"]}

history = joint_model_atis.fit(
    x, (encoded_slots[:4477], encoded_intents[:4477]), epochs=32, batch_size=32, shuffle=True)

Inference

In [None]:
# slot labels for the text below
#O O O O O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name O O O O O B-stoploc.city_name I-stoploc.city_name

In [None]:
nlu("i would like to find a flight from charlotte to las vegas that makes a stop in st. louis", tokenizer, joint_model_atis,
    intent_names, slot_names)

In [None]:
#slot labels for the text below
#O B-depart_date.month_name B-depart_date.day_number O O O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name

In [None]:
nlu("on april first i need a ticket from tacoma to san jose departing before 7 am", tokenizer, joint_model_atis,
    intent_names, slot_names)

**TESTING**

In [None]:
# predict intent and slots for each text from test dataset

results = []
slots_ids = []
intent_ids = []
for i in tqdm(range(len(all_texts[4478:]))):
    res, slot_id, intent_id = nlu(all_texts[i+4478], tokenizer, joint_model_atis, slot_names, slot_names)
    results.append(res)
    slots_ids.append(slot_id)
    intent_ids.append(intent_id)

In [None]:
# calculate slot metrics

metric = load_metric("seqeval")

all_predictions = []
all_labels = []
for i in range(len(slots_ids)):
    for prediction, label in zip(slots_ids, encoded_slots[4478:]):
        for predicted_idx, label_idx in zip(prediction, label):
            all_predictions.append(id_to_slot_name[predicted_idx])
            all_labels.append(id_to_slot_name[label_idx])
metrics_to_write = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write

In [None]:
# calculate intent metrics

all_predictions = []
all_labels = []
for i in range(len(intent_ids)):
        for predicted_idx, label_idx in zip(intent_ids, true_intents[4478:]):
            all_predictions.append(id_to_intent_name[predicted_idx])
            all_labels.append(id_to_intent_name[label_idx])
metrics_to_write_intent = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write_intent

In [None]:
# manual slots accuracy

correct = 0
counter = 0
for prediction, label in zip(slots_ids, encoded_slots[4478:]):
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
print(correct/counter)

In [None]:
# manual overall accuracy

correct = 0
counter = 0
final_correct = 0
i = 4478
j = 0
test_data_length = 893
for prediction, label in zip(slots_ids, encoded_slots[4478:]):
  print(f'current turn: {j + 1}')
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
  if counter == correct:
    if intent_ids[j] == true_intents[i]:
      final_correct +=1
      print(final_correct)
  correct = 0
  counter = 0
  i += 1
  j += 1
print(final_correct/test_data_length)

In [None]:
# manual intent accuracy

correct = 0
for p, l in zip(intent_ids, true_intents[4478:]):
    if p == l:
      correct += 1
print(correct/test_data_length)

## **SNIPS dataset**

In [None]:
# download train data for SNIPS

In [None]:
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/train/label
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/train/seq.in
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/train/seq.out

In [None]:
#download test data for SNIPS

In [None]:
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/test/label
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/test/seq.in
!wget https://raw.githubusercontent.com/monologg/JointBERT/master/data/snips/test/seq.out

In [None]:
# concatenate the two datasets into one, so all the preprocessing is done at once
labels = read_lines("label.2") + read_lines('label.3')
seqin = read_lines('seq.in.2') + read_lines('seq.in.3')
seqout = read_lines('seq.out.2') + read_lines('seq.out.3')

In [None]:
convert_text_format_to_JSON(labels, seqin, seqout,'snips.json')

In [None]:
# read from json file
train_data = read_train_json_file("snips.json")

In [None]:
example = train_data[0]
example

Encode texts from the dataset.
We have to encode the texts using the tokenizer to create tensors for training the classifier.
Training set for SNIPS from 0 to 13084 and test set from 13085 to end.


In [None]:
# https://huggingface.co/transformers/preprocessing.html

def encode_texts(tokenizer, texts):
    return tokenizer(texts[:13084], padding=True, truncation=True, return_tensors="tf"), tokenizer(texts[13085:], padding=True, truncation=True, return_tensors="tf")

texts = [d.text for d in train_data]
tds, test_tds = encode_texts(tokenizer, texts)
tds.keys()

In [None]:
encoded_texts = tds
encoded_texts_test = test_tds

In [None]:
# get the unique list of intents' names

intents = [d.intent for d in train_data]
intent_names = list(set(intents))
intent_names

In [None]:
# map intents' names to indexes, which is going to be used by the model to assign predictions
intent_map = dict() # index -> intent
for idx, ui in enumerate(intent_names):
    intent_map[ui] = idx
intent_map

In [None]:
# reverse intents' names with ids, to use when predicting for converting an ID into natural language

id_to_intent_name = {v: k for k, v in intent_map.items()}

In [None]:
true_intents, encoded_intents = encode_intents(intents, intent_map)

In [None]:
# with other functions modifications, we can convert texts into JSON format and keep all the original labelings

standard_slots = ['O',
 'B-poi',
 'B-entity_name',
 'B-restaurant_type',
 'B-country',
 'B-served_dish',
 'B-genre',
 'B-object_name',
 'B-service',
 'B-sort',
 'B-state',
 'B-restaurant_name',
 'B-city',
 'B-movie_type',
 'B-movie_name',
 'B-condition_description',
 'B-object_location_type',
 'B-year',
 'B-geographic_poi',
 'B-party_size_description',
 'B-playlist',
 'B-object_select',
 'B-cuisine',
 'B-artist',
 'B-track',
 'B-party_size_number',
 'B-spatial_relation',
 'B-album',
 'B-timeRange',
 'B-object_type',
 'B-best_rating',
 'B-object_part_of_series_type',
 'B-rating_unit',
 'B-rating_value',
 'B-playlist_owner',
 'B-location_name',
 'B-music_item',
 'B-facility',
 'B-condition_temperature',
 'B-current_location']

In [None]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td.slots
    for slot in slots:
      for s in standard_slots:
        if s in slot:
          slot_names.add(s)
slot_names = list(slot_names)
slot_names.insert(0, "O")
slot_names

In [None]:
# map slots' names to indexes, which is going to be used by the model to assign predictions

slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

In [None]:
# reverse slots' names with ids, to use when predicting for converting an ID into natural language

id_to_slot_name = {v: k for k, v in slot_map.items()}

In [None]:
# test to see if the slots are aligned with the texts, by getting the slot name for a token within a text

print(train_data[0].text)
print(train_data[0].slots)
print("slot_name for westbam is : ", get_slot_from_word("westbam", train_data[0].slots))

In [None]:
# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids

max_len = len(encoded_texts["input_ids"][0])

In [None]:
# get all the slots and texts, to encode the slots

all_slots = [td.slots for td in train_data]
all_texts = [td.text for td in train_data]

In [None]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len, standard_slots)

In [None]:
joint_model_snips = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

In [None]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# two outputs, one for slots, another for intents
# we have to fine tune for both
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]

metrics = [SparseCategoricalAccuracy("accuracy")]
# compile model
joint_model_snips.compile(optimizer=opt, loss=losses, metrics=metrics)

### Train

In [None]:
x = {"input_ids": encoded_texts["input_ids"], "token_type_ids": encoded_texts["token_type_ids"],  "attention_mask": encoded_texts["attention_mask"]}

history = joint_model_snips.fit(
    x, (encoded_slots[:13084], encoded_intents[:13084]), epochs=32, batch_size=32, shuffle=True)

### Inference

In [None]:
# slots labels for text below
#O B-artist I-artist O O B-playlist I-playlist O

In [None]:
nlu("add sabrina salerno to the grime instrumentals playlist", tokenizer, joint_model_snips,
    intent_names, slot_names)

In [None]:
#slots labels for text below
#O O O O B-party_size_number O O O O O O B-spatial_relation O B-poi O O B-restaurant_type O

In [None]:
nlu("i want to bring four people to a place that s close to downtown that serves churrascaria cuisine", tokenizer, joint_model_snips,
    intent_names, slot_names)

In [None]:
#predict intent and slots for texts in the test dataset

from tqdm import tqdm

results = []
slots_ids = []
intent_ids = []
for i in tqdm(range(len(all_texts[13084:]))):
    res, slot_id, intent_id = nlu(all_texts[i+13084], tokenizer, joint_model_snips, slot_names, slot_names)
    results.append(res)
    slots_ids.append(slot_id)
    intent_ids.append(intent_id)

In [None]:
# calculate slots metrics

metric = load_metric("seqeval")

all_predictions = []
all_labels = []
for i in range(len(slots_ids)):
    for prediction, label in zip(slots_ids, encoded_slots[13084:]):
        for predicted_idx, label_idx in zip(prediction, label):
            all_predictions.append(id_to_slot_name[predicted_idx])
            all_labels.append(id_to_slot_name[label_idx])
metrics_to_write = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write

In [None]:
# calculate intent metrics

all_predictions = []
all_labels = []
for i in range(len(intent_ids)):
        for predicted_idx, label_idx in zip(intent_ids, true_intents[13084:]):
            all_predictions.append(id_to_intent_name[predicted_idx])
            all_labels.append(id_to_intent_name[label_idx])
metrics_to_write_intent = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write_intent

In [None]:
#manual slots accuracy

correct = 0
counter = 0
for prediction, label in zip(slots_ids, encoded_slots[13084:]):
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
print(correct/counter)

In [None]:
# manual overall accuracy

correct = 0
counter = 0
final_correct = 0
i = 13084
j = 0
train_set_length = 700
for prediction, label in zip(slots_ids, encoded_slots[13084:]):
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
  if counter == correct:
    print(i,j)
    if intent_ids[j] == true_intents[i]:
      final_correct +=1
  correct = 0
  counter = 0
  i += 1
  j += 1
print(final_correct/train_set_length)

In [None]:
# manual intent metrics

correct = 0
for p, l in zip(intent_ids, true_intents[13084:]):
    if p == l:
      correct += 1
print(correct/train_set_length)

## **Custom Generated Dataset**

In [None]:
# if files do not load via drag and drop, you can use the files.upload function

from google.colab import files

uploaded = files.upload()

In [None]:
# read from json file
# the notebook is set to train for the best generated dataset (as described in the paper) by default. Any change to that requires multiple changes in the continuing code

train_data = read_train_json_file("train_1250_updated.json")

In [None]:
example = train_data[0]
example

In [None]:
len(train_data)

In [None]:
# https://huggingface.co/transformers/preprocessing.html
# for the default set (train.json), use 8750 as train limit and 8751 as test starting index.
# for the train_625.json set, use 2000 as train limit and 2001 as test starting index.
# for the train_1250.json set, use 3800 as train limit and 3801 as test starting index.
# for the train_5000.json set, use 16500 as train limit and 16501 as test starting index.

limit = 3800
test_start_index = 3801

def encode_texts(tokenizer, texts):
    return tokenizer(texts[:limit], padding=True, truncation=True, return_tensors="tf"), tokenizer(texts[test_start_index:], padding=True, truncation=True, return_tensors="tf")

texts = [d.text for d in train_data]
tds, test_tds = encode_texts(tokenizer, texts)
tds.keys()

In [None]:
encoded_texts = tds
encoded_texts_test = test_tds

In [None]:
len(encoded_texts['input_ids'])

In [None]:
len(encoded_texts_test['input_ids'])

In [None]:
# get the unique list of intents' names

intents = [d.intent for d in train_data]
intent_names = list(set(intents))
intent_names

In [None]:
# map intents' names to indexes, which is going to be used by the model to assign predictions

intent_map = dict() # index -> intent
for idx, ui in enumerate(intent_names):
    intent_map[ui] = idx
intent_map

In [None]:
# reverse intents' names with ids, to use when predicting for converting an ID into natural language

id_to_intent_name = {v: k for k, v in intent_map.items()}

In [None]:
true_intents, encoded_intents = encode_intents(intents, intent_map)

In [None]:
# added to remove slots like B-remove_param0, B-remove_param1 etc. and only keep B-remove_param
# also, with other functions modifications, we can convert texts into JSON format and keep all the original labelings

standard_slots = ['O',
 'B-hasStatus',
 'B-instance_type',
 'B-old_values_hasStatus',
 'B-remove_param',
 'B-new_values_hasClass',
 'B-hasName',
 'B-old_values_hasCode',
 'B-instance',
 'B-hasCode',
 'B-hasClass',
 'B-hasRole',
 'B-new_values_hasStatus',
 'B-new_values_hasRole',
 'B-procedure',
 'B-new_values_hasName',
 'B-hasManager',
 'B-old_values_hasRole',
 'B-old_values_hasName',
 'B-new_values_hasManager',
 'B-activeEntity',
 'B-old_values_hasManager',
 'B-entity',
 'B-old_values_hasClass',
 'B-new_values_hasCode']



In [None]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td.slots
    for slot in slots:
      for s in standard_slots:
        if s in slot:
          slot_names.add(s)
slot_names = list(slot_names)
slot_names.insert(0, "O")
slot_names

In [None]:
# map slots' names to indexes, which is going to be used by the model to assign predictions

slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

In [None]:
# reverse slots' names with ids, to use when predicting for converting an ID into natural language

id_to_slot_name = {v: k for k, v in slot_map.items()}

In [None]:
len(slot_map.keys())

In [None]:
# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids

max_len = len(encoded_texts["input_ids"][0])

In [None]:
# get all slots and texts, to encode slots

all_slots = [td.slots for td in train_data]
all_texts = [td.text for td in train_data]

In [None]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len, standard_slots)

In [None]:
len(encoded_texts_test['input_ids'])

In [None]:
joint_model = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

In [None]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# two outputs, one for slots, another for intents
# we have to fine tune for both
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]

metrics = [SparseCategoricalAccuracy("accuracy")]
# compile model
joint_model.compile(optimizer=opt, loss=losses, metrics=metrics)

In [None]:
len(encoded_texts_test['input_ids'])

In [None]:
len(encoded_texts['input_ids'])

### Train

In [None]:
x = {"input_ids": encoded_texts["input_ids"], "token_type_ids": encoded_texts["token_type_ids"],  "attention_mask": encoded_texts["attention_mask"]}

history = joint_model.fit(
    x, (encoded_slots[:limit], encoded_intents[:limit]), validation_split = 0.2, epochs=4, batch_size=32, shuffle=True)

### Inference

In [None]:
joint_model.summary()

In [None]:
# intent and slots (with their positions), for the text below
#
#        "slots": {
#            "B-hasName": "Pitch-Footballer"
#        },
#        "positions": {
#            "B-hasName": [
#               16,
#               31
#            ]
#        },
#        "intent": "insert"

nlu("(agree) name is Pitch-Footballer", tokenizer, joint_model,
    intent_names, slot_names)

In [None]:
# predict intent and slots for each text in the test dataset

results = []
slots_ids = []
intent_ids = []
for i in tqdm(range(len(all_texts[test_start_index:]))):
    res, slot_id, intent_id = nlu(all_texts[i+test_start_index], tokenizer, joint_model, slot_names, slot_names)
    results.append(res)
    slots_ids.append(slot_id)
    intent_ids.append(intent_id)

In [None]:
len(slots_ids)

In [None]:
len(all_texts)

In [None]:
# calculate slots metrics

metric = load_metric("seqeval")

all_predictions = []
all_labels = []
for i in range(len(slots_ids)):
    for prediction, label in zip(slots_ids, encoded_slots[test_start_index:]):
        for predicted_idx, label_idx in zip(prediction, label):
            all_predictions.append(id_to_slot_name[predicted_idx])
            all_labels.append(id_to_slot_name[label_idx])
metrics_to_write = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write

In [None]:
# calculate intent metrics

all_predictions = []
all_labels = []
for i in range(len(intent_ids)):
        for predicted_idx, label_idx in zip(intent_ids, true_intents[test_start_index:]):
            all_predictions.append(id_to_intent_name[predicted_idx])
            all_labels.append(id_to_intent_name[label_idx])
metrics_to_write_intent = metric.compute(predictions=[all_predictions], references=[all_labels])
metrics_to_write_intent

In [None]:
# manually calculated slot accuracy

correct = 0
counter = 0
for prediction, label in zip(slots_ids, encoded_slots[test_start_index:]):
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
print(correct/counter)

In [None]:
# manually calculated overall accuracy

correct = 0
counter = 0
final_correct = 0
i = test_start_index
j = 0
for prediction, label in zip(slots_ids, encoded_slots[test_start_index:]):
  for p, l in zip(prediction, label):
    counter += 1
    if p == l:
      correct += 1
  if counter == correct:
    print(i,j)
    if intent_ids[j] == true_intents[i]:
      final_correct +=1
  correct = 0
  counter = 0
  i += 1
  j += 1
print(final_correct/len(encoded_slots[test_start_index:]))

In [None]:
# manually calculated intent accuracy

correct = 0
for p, l in zip(intent_ids, true_intents[test_start_index:]):
    if p == l:
      correct += 1
print(correct/len(encoded_slots[test_start_index:]))