<a href="https://colab.research.google.com/github/Mahsalo/TestRepo/blob/master/Deep_Learning_for_NLP_(ICME_Summer_Workshop%2C_2019).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for Natural Language Processing

*Instructor: Luke de Oliveira*

*Teaching Assistant: Alex Matton*

*Date: August 16th, 2019*

*Contact email: [lukedeo@ldo.io](mailto:lukedeo@ldo.io)*

## Structure

This notebook is split up into three parts. 

The first part is an introduction to "wrangling" text data in Python and how to prepare text data for use with Machine Learning algorithms. 

The second part walks through an implementation of a sentiment detection model with a Long Short-Term Memory network (LSTM). 

The third part will use a demo dataset to train a Semantic Chunking model for a conversational agent. 

To use a hardware accelerator (i.e., a GPU) navigate in the menu above to **`Runtime > Change runtime type > GPU`**.

## License

All code examples and code downloads are licensed under the (extremely permissive) [MIT license](https://opensource.org/licenses/MIT). My goal is to have this be a useful base for you, should you so desire.

## Datasets

This notebook will use two datasets: 

* A binary sentiment dataset, with reviews / produced content from Yelp, Amazon, and Twitter
* A semantic chunking dataset for a virtual assistant use case, where we'll be understanding weather queries

To download the sentiment dataset, run this cell:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/sentiment-data.json

To download the virtual assistant dataset, run this cell:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/weather-assistant.json

To download the `icmenlp` package, we run:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/icmenlp.py

Let's take a look in our VM's directory...

In [0]:
ls

## Setup

We'll be using Colaboratory built-in libraries (scikit-learn & Keras) in order to avoid set up!

Let's set up our imports below.

In [0]:
# Make sure if we change any imports, they're reflected in our notebook
%load_ext autoreload
%autoreload 2

In [0]:
import keras

Now, we'll use our library for this tutorial - `icmenlp`. This library provides two main utilities -- first, a principled way to load the data for this session, and second, a vocabulary container, which we will describe later.

In [0]:
import icmenlp

Let's open our datasets using data loading functions that will provide us with a train-test-validate split.

In [0]:
ASSISTANT_DATA = 'weather-assistant.json'
SENTIMENT_DATA = 'sentiment-data.json'

In [0]:
assistant_data = icmenlp.load_data(ASSISTANT_DATA, 'assistant')

In [0]:
assistant_data['train'][4]

In [0]:
# Define a small utility to display chunked data a bit more easily
def display_chunking(chunks):
    sent = ''
    for ch in chunks:
        label = ch.get('label')
        text = ch['text']
        if not label:
            sent += text
        else:
            sent += '[{} | {}]'.format(text, label.upper())
    return sent

In [0]:
display_chunking(assistant_data['train'][4])

# Manipulating Text Data for ML

One of the most asked questions both from students and from industry concerns how to prepare text data for deep learning. Today, we're going to focus on **embeddings** (one of the more popular incarnations of this is Word2Vec). Right now, we'll learn how to prepare data for usage in a model that learns to embed text.

The first step of any such pipeline is **tokenization**, that is, converting a single text or document into a *sequence* of *tokens*. For word level models, these tokens roughly correspond to words / contractions, and in a character model, this corresponds to individual bytes. Many modern methods use **subword** information, which allows you to make predictions over text that has words that were not trained on (the so-called OOV, or out of vocabulary, problem).

For example, the sentence
    
    Is there a minimum balance I need to maintain in my accounts?
    
could be *tokenized* as:
    
    'Is', 'there', 'a', 'minimum', 'balance', 'I', 'need', 'to', 'maintain', 'in', 'my', 'accounts', '?'

How can one systematically convert text into these *tokens* then? It turns out this is one of the most critical,  important, and underappreciated steps. It drastically differs from language to language, and requires a lot of care to ensure consistency. This is one of the reasons why **character level** or **subword** models can be so useful in applied settings with inconsistent spelling, grammar, and nomenclature.

The dominant approach to doing word-based tokenization consists of using a [**regular expression**](https://en.wikipedia.org/wiki/Regular_expression). Regular expressions define a formal language for searching through strings for matches to a query. We'll use one here to work with our text in this tutorial 

In [0]:
# Re is the Python RegEx library
import re

def tokenize(text):
    return [
        # Make sure there is no trailing whitespace
        x.strip() 
        # Split the text on matches of at least one "word"
        for x in re.split('(\W+)', text) 
        # Only include the token if it is not null
        if x.strip() 
    ]

Let's load the sentiment dataset to try this out!

In [0]:
sentiment_data = icmenlp.load_data(SENTIMENT_DATA, 'sentiment')

# Let's grab out a random training point
text = sentiment_data['train'][0][2]

In [0]:
print(text)

In [0]:
print(tokenize(text))

We will get into more detail about how embedding models work later on, but for now, let's discuss the conversion of text into a format that is useful for embedding models. Embedding models rely on an **integer** representation for each word, since we will use it as lookup into a **lookup table**.

An important consideration when using deep learning models is the length of a piece of text, as well as a signal of the beginning and end of a sentence. Most deep learning models require that each batch of text passed in to the model have identical sentence lengths. We commonly solve this using **padding**, or adding a series of meaningless tokens to increase the length of a document. We can then use **masking** to ensure that our model does not incorporate these into the learning procedure.

We'll use `<PAD>` as the pad token, and `<S>`/`</S>` to delimit the beginning of a sentence and the end of a sentence respectively. In addition, any tokens that are unknown to us (for example, a word that is in the test set but not the train set) are mapped to the `<UNK>` token.

To map from tokens to integers, we will define a bijective (two-way) map from words to integers that we will learn from data.

We're going to use the one from our `icmenlp` package.

In [0]:
# Let's look inside the code of the utility provided
icmenlp.VocabularyContainer??

In [0]:
# Create a vocab collection object with the tokenizer we defined above
vocab = icmenlp.VocabularyContainer(tokenizer=tokenize)

In [0]:
text = sentiment_data['train'][0]

In [0]:
vocab.fit(text)

In [0]:
vocab.transform(['this product was bad!'])

In [0]:
vocab.transform(['this product was bad!', 'meh'], pad_length=10)

In [0]:
vocab.inverse_transform(vocab.transform(['this product was bad!']))

In [0]:
vocab.inverse_transform(vocab.transform(['this product was abhorrent!']))

In [0]:
vocab.inverse_transform(
    vocab.transform(
        ['this product was abhorrent!'], 
        pad_length=9, 
        add_start=True, 
        add_end=True
    )
)

We now know how to preprocess text data for use in deep learning. To summarize:

1.) Each document gets split into tokens, a process called tokenization.

2.) A mapping is learned from the vocabulary to unique integers.

3.) We have four special tokens -- the pad token `<PAD>`, the start-of-sentence token `<S>`, the end-of-sentence token `</S>`and the unknown token `<UNK>`.

4.) We pad sentences with the `<PAD>` token to make them the same size.

Now, let's learn about deep learning for NLP!

# Sentiment Analysis

For the next segment, we're going to train a sentiment prediction model using deep learning. In particular, we're going to use a recurrent neural network (RNN) to process text **token by token**. We'll be using words as tokens.

An RNN uses a **cell** to process each timestep of s sequence. In this case, the cell of our RNN will process text one word at a time. A common problem with RNNs is that early timesteps are forgotten, and our network has no signal related to early timesteps. A Long short-term memory (LSTM) network solves this by introducing a better method for retaining state/memory via a **memory cell**, and introducing a **forget gate**, which allows the network to learn when to forget.

Let's walk through how to build a model for our sentiment analysis task. 

*Best-practice tip*: we're going to create and use a **new class** to make sure we can keep track of the preprocessing that goes into such a model.

In [0]:
import numpy as np
from sklearn.preprocessing import LabelEncoder


class BinarySentimentModel():

    def __init__(self, embedding_dim=128, lstm_size=256, bidirectional=False,
                 optimizer='adam'):
        """Create a new model to handle a binary sentiment task.

        The wrapper class will hold all state with respect to preprocessing,
        transformation, and model training.

        Args:
            embedding_dim (int): The dimension of the "word vectors" to be
                trained in the model.
            lstm_size (int): The number of hidden units in the LSTM.
            bidirectional (bool): Whether or not to process the data using a
                bidirectional or single-directional LSTM.
            optimizer (str | keras.optimizers.Optimizer): The optimizer to use
                when optimizing the loss on the training set using Stochastic
                Gradient Descent (SGD).
        """
        # We need to convert the names of labels to integers for our model
        self.labelencoder = LabelEncoder()

        # We're going to use our vocab container from before to store all the
        # token -> ID mappings
        self.vocab = icmenlp.VocabularyContainer(tokenizer=tokenize)
        self.model = None

        # We're storing *hyperparameters* of the model here
        self.embedding_dim = embedding_dim
        self.lstm_size = lstm_size
        self.bidirectional = bidirectional
        self.optimizer = optimizer

    def make_model(self, vocab_size):
        """Creates a new keras model for the class.

        Args:
            vocab_size (int): The number of unique tokens contained in the
                training vocabulary

        Returns:
            keras.models.Model: A built and compiled Keras model for thre task.
        """
        
        # The None tells Keras that we can expect differing sentence lengths
        text = keras.Input(shape=(None, ), dtype='int32')
        
        # TODO: Write model together!

        output = None

        model = keras.Model(text, output)

        # We're going to use crossentropy as our loss function here
        model.compile(optimizer=self.optimizer,
                      loss='binary_crossentropy',
                      metrics=['acc'])
        return model

    def fit(self, documents, labels, validation_data=None, pad_length='max',
            **kwargs):
        """Trains (or fits) the model on training data while validating
        on validation data

        Args:
            documents (List[str]): A list of training documents
            labels (List[str | int]): A list of training labels associated to
                each document.
            validation_data (Tuple[List[str], List[str]]): A tuple of
                validation data of the form (val_documents, val_labels)
            pad_length (str | int): The padding length to use for training.
            **kwargs: Passed to keras.Model.fit

        Returns:
            self

        Raises:
            ValueError: If validation data is not of the correct format.
        """
        # X will be an array of integers corresponding to words
        X = np.array(
            self.vocab.fit(documents).transform(
                documents, pad_length=pad_length)
        )

        # y will be an array of integers corresponding to sentiment labels (0
        # or 1)
        y = self.labelencoder.fit_transform(labels)

        if validation_data:
            # Process validation data the same way we process our training data
            if not len(validation_data) == 2:
                raise ValueError('Validation data must be a tuple (X, y)')
            documents_val, labels_val = validation_data
            validation_data = (
                np.array(self.vocab.transform(
                    documents_val, pad_length=pad_length)),
                self.labelencoder.transform(labels_val)
            )
            _ = kwargs.pop('validation_data', None)

        # Construct our Keras model
        self.model = self.make_model(vocab_size=self.vocab.vocab_size)

        # In practice, we would set up a *callback* in order to stop
        # training when the validation error is minimized
        self.model.fit(X, y, validation_data=validation_data, **kwargs)
        return self

    def predict_proba(self, documents, pad_length='max', **kwargs):
        # To run a prediction, we have to run all the way from:
        # text -> tokenization -> integers -> model
        # Here, we only care to get the probabilities for each class
        indices = np.array(self.vocab.transform(
            documents, pad_length=pad_length))
        return self.model.predict(indices, **kwargs).ravel()

    def predict(self, documents, pad_length='max', **kwargs):
        # We can use the above method to get probabilities per class, then
        # we just take the most likely one and recover the label
        label_inv = self.predict_proba(documents, pad_length=pad_length,
                                       **kwargs)
        return self.labelencoder.inverse_transform(1 * (label_inv > 0.5))


In [0]:
sentiment_model = BinarySentimentModel(bidirectional=True)

In [0]:
text, labels = sentiment_data['train']

In [0]:
text[:5]

In [0]:
# In a real application, we would use a EarlyStopping and ModelCheckpoint 
# callback to stop training at the bottom of the validation loss curve.
sentiment_model.fit(
    *sentiment_data['train'], 
    validation_data=sentiment_data['val'], 
    epochs=3
)

In [0]:
sentiment_model.predict([
    'this was the worst product ive ever used!!!', 
    'pretty awesome product!!!'
])

In [0]:
sentiment_model.predict_proba([
    'this was the worst product ive ever used!!!', 
    'this was the best product ive ever used!!!'
])

# Virtual Assistant Dataset

In this example, we're going to train a model that is able to peform query understanding on a dataset of weather requests to a virtual assistant. We're going to build an LSTM model to predict what recognized component of a query each word in a piece of dialog corresponds to.

In [0]:
# Let's get an example to understand
example = assistant_data['train'][60]

In [0]:
print(example)

How can entities be encoded into labels? There is much debate on this, but there are generally three approaches:

* **IO-encoding**: Only encodes that a token is an entity of a given type or not
* **BIO-encoding**: Encodes the begining of an entity with a `B`, and also any following tokens in an entity with an `I`
* **BILUO-encoding**: Encodes the begining of an entity with a `B`, any following tokens in an entity with an `I`, and the last token of an entity with a `L`. Single token entities are a `U`

For example:

*How cold is it tomorrow evening?*

IO:
*How cold [`TEMP`] is it tomorrow [`TIME`] evening [`TIME`]?*

BIO:
*How cold [`B-TEMP`] is it tomorrow [`B-TIME`] evening [`I-TIME`]?*

BILUO:
*How cold [`U-TEMP`] is it tomorrow [`B-TIME`] evening [`L-TIME`]?*

In [0]:
def get_word_labels(example, tokenizer=tokenize):
    """
    We define an function that can take this data, and 
    return a sequence of tokens and a sequence of token-labels.
    """
    tokenized_text = []
    labels = []
    for chunk in example:
        tokenized_chunk_text = tokenize(chunk['text'])
        if 'label' in chunk:
            # We're doing a subobtimal thing jere
            chunk_labels = [chunk['label'].upper()] * len(tokenized_chunk_text)
        else:
            chunk_labels = ['-'] * len(tokenized_chunk_text)
        tokenized_text.extend(tokenized_chunk_text)
        labels.extend(chunk_labels)
    return tokenized_text, labels

In [0]:
example

In [0]:
get_word_labels(example)

In [0]:

class WeatherAssistantModel():

    def __init__(self, embedding_dim=128, lstm_size=256, optimizer='adam'):
        """Create a new model to handle a weather assistant use case.

        The wrapper class will hold all state with respect to preprocessing,
        transformation, and model training.

        Args:
            embedding_dim (int): The dimension of the "word vectors" to be
                trained in the model.
            lstm_size (int): The number of hidden units in the LSTM.
            optimizer (str | keras.optimizers.Optimizer): The optimizer to use
                when optimizing the loss on the training set using Stochastic
                Gradient Descent (SGD).
        """
        # We encode the label space and the vocabulary as `VocabularyContainer`
        # for padding. Our text is pretokenized, so we don't need a tokenizer
        self.labelencoder = icmenlp.VocabularyContainer(tokenizer=lambda x: x)
        self.vocab = icmenlp.VocabularyContainer(tokenizer=lambda x: x)
        self.model = None

        # Store our hyperparameters
        self.embedding_dim = embedding_dim
        self.lstm_size = lstm_size
        self.optimizer = optimizer

    def make_model(self, nb_classes, vocab_size):
        """Creates a new keras model for the class.

        Args:
            nb_classes (int): The total number of entity classes to predict
            vocab_size (int): The number of unique tokens contained in the
                training vocabulary

        Returns:
            keras.models.Model: A built and compiled Keras model for the task.
        """

        # The None tells Keras that we can expect differing sentence lengths
        text = keras.Input(shape=(None, ), dtype='int32')
        
        # TODO: Write model together!

        output = None

        model = keras.Model(text, output)
        model.compile(optimizer=self.optimizer,
                      loss='sparse_categorical_crossentropy')

        return model

    def fit(self, chunked_documents, validation_data=None, **kwargs):
        """Trains (or fits) the model on training data while validating
        on validation data

        Args:
            chunked_documents (List[List[Dict]]): Training data
            validation_data (List[List[Dict]]): Validation data
            **kwargs: Passed to keras.Model.fit

        Returns:
            self
        """
        # Get paired pre-tokenized documents with per-token tags
        documents, labels = zip(*[
            get_word_labels(example, tokenizer=tokenize)
            for example in chunked_documents
        ])

        # We want our input documents and our output labels to be integers
        X = np.array(self.vocab.fit(documents).transform(
            documents, pad_length=30))
        Y = np.expand_dims(
            np.array(self.labelencoder.fit(
                labels).transform(labels, pad_length=30)),
            -1
        )

        nb_classes = len(np.unique(Y))
        nb_classes = self.labelencoder.vocab_size

        if validation_data:
            # Process validation data identically to training data
            documents_val, labels_val = zip(*[
                get_word_labels(example, tokenizer=tokenize)
                for example in validation_data
            ])
            validation_data = (
                np.array(self.vocab.transform(documents_val, pad_length=30)),
                np.expand_dims(np.array(self.labelencoder.transform(
                    labels_val, pad_length=30)), -1)
            )
            _ = kwargs.pop('validation_data', None)

        self.model = self.make_model(nb_classes=nb_classes,
                                     vocab_size=self.vocab.vocab_size)

        # In practice, we would set up a callback in order to stop
        # training when the validation error is minimized
        self.model.fit(X, Y, validation_data=validation_data, **kwargs)

        return self

    def predict(self, document):
        # To run inference on a single document, we tokenize it, obtain indices
        # run it though the model, then decode the most likely tags per token.
        segments, _ = get_word_labels([{'text': document}], tokenizer=tokenize)
        indices = np.array(self.vocab.transform(segments))
        label_inv = self.model.predict(indices).argmax(-1).astype(int).tolist()
        return list(zip(
            segments, self.labelencoder.inverse_transform(label_inv)
        ))


In [0]:
assistant_model = WeatherAssistantModel(lstm_size=256, embedding_dim=64)

In [0]:
# In a real application, we would use a EarlyStopping and ModelCheckpoint 
# callback to stop training at the bottom of the validation loss curve.
assistant_model.fit(assistant_data['train'], assistant_data['val'], 
                    epochs=20, batch_size=32)

In [0]:
result = assistant_model.predict("How cold is it tomorrow?")

In [0]:
result

In [0]:
understood_entities = [
    (word, ent_type[0])
    for word, ent_type in result
    if ent_type[0] != '-'
]

In [0]:
understood_entities