# Advanced: Bert-Based Model for Dialogue Act Tagging

In the last section we want to use BERT and leverage contextual word embeddings, following on from the last lab you've 
just done. This is an advanced part of the assignment and worth 10 marks (20%) in total. You could use your BERT-based text classifier here (instead of the CNN utterance-level classifier) and see if a pre-trained BERT language model helps. The domain difference from conversational data is one possible downside to using BERT. Explore some techniques to efficiently transfer the knowledge from conversational data and to improve model performance on DA tagging.

**Downgrade tensorflow as tf.Session() is not available in tensorflow 2.0**

**Need to upload swda.zip and tokenization.py**

In [1]:
!pip3 install tensorflow==1.15

Collecting tensorflow==1.15
[?25l  Downloading https://files.pythonhosted.org/packages/3f/98/5a99af92fb911d7a88a0005ad55005f35b4c1ba8d75fba02df726cd936e6/tensorflow-1.15.0-cp36-cp36m-manylinux2010_x86_64.whl (412.3MB)
[K     |████████████████████████████████| 412.3MB 29kB/s 
Collecting tensorflow-estimator==1.15.1
[?25l  Downloading https://files.pythonhosted.org/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503kB)
[K     |████████████████████████████████| 512kB 61.9MB/s 
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<1.16.0,>=1.15.0
[?25l  Downloading https://files.pythonhosted.org/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 57.5MB/s 
Building

In [0]:
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
import numpy as np
from tokenization import FullTokenizer
from tqdm import tqdm_notebook
from tensorflow.keras import backend as K

# Initialize session
sess = tf.Session()

# Params for bert model and tokenization
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"


**Unzip swda dataset**

In [3]:
!unzip swda.zip

Archive:  swda.zip
   creating: swda/
  inflating: swda/.DS_Store          
   creating: __MACOSX/
   creating: __MACOSX/swda/
  inflating: __MACOSX/swda/._.DS_Store  
   creating: swda/sw00utt/
  inflating: swda/sw00utt/sw_0001_4325.utt.csv  
   creating: __MACOSX/swda/sw00utt/
  inflating: __MACOSX/swda/sw00utt/._sw_0001_4325.utt.csv  
  inflating: swda/sw00utt/sw_0002_4330.utt.csv  
  inflating: swda/sw00utt/sw_0003_4103.utt.csv  
  inflating: swda/sw00utt/sw_0004_4327.utt.csv  
  inflating: swda/sw00utt/sw_0005_4646.utt.csv  
  inflating: swda/sw00utt/sw_0006_4108.utt.csv  
  inflating: swda/sw00utt/sw_0007_4171.utt.csv  
  inflating: swda/sw00utt/sw_0008_4321.utt.csv  
  inflating: swda/sw00utt/sw_0009_4329.utt.csv  
  inflating: swda/sw00utt/sw_0010_4356.utt.csv  
  inflating: swda/sw00utt/sw_0011_4358.utt.csv  
  inflating: swda/sw00utt/sw_0012_4360.utt.csv  
  inflating: swda/sw00utt/sw_0013_4617.utt.csv  
  inflating: swda/sw00utt/sw_0014_4619.utt.csv  
  inflating: swda/sw00u

Importing libraries

In [0]:
import pandas as pd
import glob
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np

import sklearn.metrics
import tensorflow as tf
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

Reading the file

In [0]:
f = glob.glob("swda/sw*/sw*.csv")
frames = []
for i in range(0, len(f)):
    frames.append(pd.read_csv(f[i]))

result = pd.concat(frames, ignore_index=True)

In [6]:
print("Number of converations in the dataset:",len(result))

Number of converations in the dataset: 223606


only "text" and "act_tag" is needed for the problem

In [0]:
reduced_df = result[['act_tag','text']]

Reducing the number of tags (classes)

In [0]:
# Imported from "https://github.com/cgpotts/swda"
# Convert the combination tags to the generic 43 tags

import re
def damsl_act_tag(input):
        """
        Seeks to duplicate the tag simplification described at the
        Coders' Manual: http://www.stanford.edu/~jurafsky/ws97/manual.august1.html
        """
        d_tags = []
        tags = re.split(r"\s*[,;]\s*", input)
        for tag in tags:
            if tag in ('qy^d', 'qw^d', 'b^m'): pass
            elif tag == 'nn^e': tag = 'ng'
            elif tag == 'ny^e': tag = 'na'
            else: 
                tag = re.sub(r'(.)\^.*', r'\1', tag)
                tag = re.sub(r'[\(\)@*]', '', tag)            
                if tag in ('qr', 'qy'):                         tag = 'qy'
                elif tag in ('fe', 'ba'):                       tag = 'ba'
                elif tag in ('oo', 'co', 'cc'):                 tag = 'oo_co_cc'
                elif tag in ('fx', 'sv'):                       tag = 'sv'
                elif tag in ('aap', 'am'):                      tag = 'aap_am'
                elif tag in ('arp', 'nd'):                      tag = 'arp_nd'
                elif tag in ('fo', 'o', 'fw', '"', 'by', 'bc'): tag = 'fo_o_fw_"_by_bc'            
            d_tags.append(tag)
        # Dan J says (p.c.) that it makes sense to take the first;
        # there are only a handful of examples with 2 tags here.
        return d_tags[0]

In [9]:
reduced_df["act_tag"] = reduced_df["act_tag"].apply(lambda x: damsl_act_tag(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Getting the sentences from the dataset

In [0]:
sentences = []
for i in range(0, len(reduced_df)):
    sentences.append(reduced_df['text'].iloc[i].split(" "))

Getting the set of unique_tags

In [0]:
unique_tags = set()
for tag in reduced_df['act_tag']:
    unique_tags.add(tag)

One hot encoding for the tags

In [0]:
one_hot_encoding_dic = pd.get_dummies(list(unique_tags))

In [0]:
tags_encoding = []
for i in range(0, len(reduced_df)):
    tags_encoding.append(one_hot_encoding_dic[reduced_df['act_tag'].iloc[i]])

Splitting into test and training dataset

In [0]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(sentences,  np.array(tags_encoding))

max_seq_len for the given problem will be 137

In [0]:
HIDDEN_SIZE = len(unique_tags) 
max_seq_length = len(max(sentences, key=len))

calculating the weights of the class for a balanced model

In [0]:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_integers = np.argmax(tags_encoding, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))

# BERT MODEL

We create a custom BERT layer to be used instead of the CNN as shown in PART_B task 2.

Creating input features for the BERT layer. The features are computed by passing X_train and X_test. The bert uses its own tokenizer to convert the input to its features.

In [17]:

class PaddingInputExample(object):
    """Fake example so the num input examples is a multiple of the batch size.
  
  """


class InputExample(object):

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
    """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

def create_tokenizer_from_hub_module():

    bert_module =  hub.Module(bert_path)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    vocab_file, do_lower_case = sess.run(
        [
            tokenization_info["vocab_file"],
            tokenization_info["do_lower_case"],
        ]
    )

    return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

def convert_single_example(tokenizer, example, max_seq_length=256):
    """Converts a single `InputExample` into a single `InputFeatures`."""

    if isinstance(example, PaddingInputExample):
        input_ids = [0] * max_seq_length
        input_mask = [0] * max_seq_length
        segment_ids = [0] * max_seq_length
        label = 0
        return input_ids, input_mask, segment_ids, label

    tokens_a = tokenizer.tokenize(example.text_a)
    if len(tokens_a) > max_seq_length - 2:
        tokens_a = tokens_a[0 : (max_seq_length - 2)]

    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    return input_ids, input_mask, segment_ids, example.label

def convert_examples_to_features(tokenizer, examples, max_seq_length=256):
    """Convert a set of `InputExample`s to a list of `InputFeatures`."""

    input_ids, input_masks, segment_ids, labels = [], [], [], []
    for example in tqdm_notebook(examples, desc="Converting examples to features"):
        input_id, input_mask, segment_id, label = convert_single_example(
            tokenizer, example, max_seq_length
        )
        input_ids.append(input_id)
        input_masks.append(input_mask)
        segment_ids.append(segment_id)
        labels.append(label)
    return (
        np.array(input_ids),
        np.array(input_masks),
        np.array(segment_ids),
        np.array(labels).reshape(-1, 1),
    )

def convert_text_to_examples(texts, labels):
    """Create InputExamples"""
    InputExamples = []
    for text, label in zip(texts, labels):
        InputExamples.append(
            InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label)
        )
    return InputExamples

# Task 1
# Instantiate tokenizer
tokenizer = create_tokenizer_from_hub_module()
# Convert data to InputExample format
# X_train, X_test, y_train, y_test
input_train_ex = convert_text_to_examples(X_train, y_train)
input_test_ex = convert_text_to_examples(X_test, y_test)
# Convert to features
train_input_ids, train_input_masks, train_segment_ids, _ = convert_examples_to_features(tokenizer, input_train_ex, max_seq_length)
test_input_ids, test_input_masks, test_segment_ids, _ = convert_examples_to_features(tokenizer, input_test_ex, max_seq_length)
# End of Task 1

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore






Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(IntProgress(value=0, description='Converting examples to features', max=167704, style=ProgressS…




HBox(children=(IntProgress(value=0, description='Converting examples to features', max=55902, style=ProgressSt…




In [18]:
train_input_ids.shape

(167704, 137)

Creating the custom BERT layer. There are 110,302,011 parameters to be trained. Therefore we fine tune only 5 layers. Thus the n_fine_tuen_layers = 5.

In [0]:
class BertLayer(tf.keras.layers.Layer):
    def __init__(
        self,
        n_fine_tune_layers=5,
        pooling="first",
        bert_path="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1",
        **kwargs,
    ):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = True
        self.output_size = 768
        self.pooling = pooling
        self.bert_path = bert_path
        if self.pooling not in ["first", "mean"]:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.bert = hub.Module(
            self.bert_path, trainable=self.trainable, name=f"{self.name}_module"
        )

        # Remove unused layers
        trainable_vars = self.bert.variables
        if self.pooling == "first":
            trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]
            trainable_layers = ["pooler/dense"]

        elif self.pooling == "mean":
            trainable_vars = [
                var
                for var in trainable_vars
                if not "/cls/" in var.name and not "/pooler/" in var.name
            ]
            trainable_layers = []
        else:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        # Select how many layers to fine tune
        for i in range(self.n_fine_tune_layers):
            trainable_layers.append(f"encoder/layer_{str(11 - i)}")

        # Update trainable vars to contain only the specified layers
        trainable_vars = [
            var
            for var in trainable_vars
            if any([l in var.name for l in trainable_layers])
        ]

        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)

        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)

        super(BertLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        if self.pooling == "first":
            pooled = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "pooled_output"
            ]
        elif self.pooling == "mean":
            result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "sequence_output"
            ]

            mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
            masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (
                    tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
            input_mask = tf.cast(input_mask, tf.float32)
            pooled = masked_reduce_mean(result, input_mask)
        else:
            raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")

        return pooled

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)

In [23]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, Bidirectional, LSTM, Reshape, concatenate, Activation
from tensorflow.keras.optimizers import Adam

# input layers for input_ids, input_masks and segment_ids
input_ids = Input(shape=(max_seq_length, ), dtype='int32')
input_masks = Input(shape=(max_seq_length, ), dtype='int32')
segment_ids = Input(shape=(max_seq_length, ), dtype='int32')
# cutom bert layer with input as a list of input_ids, input_masks and segment_ids
bert = BertLayer()([input_ids, input_masks, segment_ids])
# a dense layer with relu activation and bert as the input
dense1 = Dense(HIDDEN_SIZE, activation='relu')(bert)
# dropout layer to prevent overfitting
dropout1 = Dropout(0.2)(dense1)
# reshaping the output of the bert + dense + dropout to (None, 1, 43) so that it is compatible with the input for BiLSTM
reshape = Reshape((1, HIDDEN_SIZE))(dropout1)

# Bidirectional LSTM 1 with reshape as input
biLSTM1 = Bidirectional(LSTM(HIDDEN_SIZE, return_sequences='true'))(reshape)
# Bidirectional LSTM 2
biLSTM2 = Bidirectional(LSTM(HIDDEN_SIZE))(biLSTM1)
# dense with relu activation
dense2 = Dense(HIDDEN_SIZE, activation='relu')(biLSTM2)
# dropout to prevent overfitting
dropout2 = Dropout(0.2)(dense2)

# concatenate the output of bert + dense + dropout with BiLSTM + dense + dropout
merged_2 = concatenate([dropout1, dropout2])
# output dense layer with 43 units for multiclass classification
dense_3 = Dense(HIDDEN_SIZE)(merged_2)
# softmax activation layer for multiclass classification
output = Activation('softmax')(dense_3)

optimizer = Adam()

model = Model(inputs=[input_ids, input_masks, segment_ids], outputs=output)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()


def initialize_vars(sess):
    sess.run(tf.local_variables_initializer())
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    K.set_session(sess)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 137)]        0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            [(None, 137)]        0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None, 137)]        0                                            
__________________________________________________________________________________________________
bert_layer_1 (BertLayer)        (None, 768)          110104890   input_4[0][0]                    
                                                                 input_5[0][0]              

In [24]:
initialize_vars(sess)
# training the model on 90% on data and validating it on 10%.
model.fit(
    [train_input_ids, train_input_masks, train_segment_ids], 
    y_train,
    validation_split = 0.1,
    epochs=1,
    batch_size=32
)

Train on 150933 samples, validate on 16771 samples


<tensorflow.python.keras.callbacks.History at 0x7f72b0d9be80>

Evaluating the model on the test set

In [25]:
score = model.evaluate([test_input_ids, test_input_masks, test_segment_ids], y_test, batch_size=100)



The model is very large and takes about 3 hours to train on Google colab TPU. Therefore the output could not be obtained.
The training and validation accuracy is low as the fine_tuned_layers are set to 5 and the number of epochs it was trained for is 1.



get predictions of the model

In [0]:
predictions = model.predict([test_input_ids, test_input_masks, test_segment_ids], batch_size=100)

get the confusion matrix

In [0]:
matrix = sklearn.metrics.confusion_matrix(y_test.argmax(axis=1), predictions.argmax(axis=1))

Get class accuracy for br and bf

In [0]:
acc_class = matrix.diagonal()/matrix.sum(axis=1)

index_br = list(one_hot_encoding_dic["br"][one_hot_encoding_dic["br"]==1].index)[0]
br_accuracy = acc_class[index_br]*100
print("br accuracy: {}".format(br_accuracy))

index_bf = list(one_hot_encoding_dic["bf"][one_hot_encoding_dic["bf"]==1].index)[0]
bf_accuracy = acc_class[index_bf]*100
print("bf accuracy: {}".format(bf_accuracy))