<a href="https://colab.research.google.com/github/Mahsalo/TestRepo/blob/master/Deep_Learning_for_NLP_(Part_1%2C_ICME_Summer_Workshop%2C_2019).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for Natural Language Processing

*Instructor: Luke de Oliveira*

*Teaching Assistant: Alex Matton*

*Date: August 16th, 2019*

*Contact email: [lukedeo@ldo.io](mailto:lukedeo@ldo.io)*

## Structure

This notebook is split up into three parts. 

The first part is an introduction to "wrangling" text data in Python and how to prepare text data for use with Machine Learning algorithms. 

The second part walks through an implementation of a sentiment detection model with a Long Short-Term Memory network (LSTM). 

The third part will use a demo dataset to train a Semantic Chunking model for a conversational agent. 

To use a hardware accelerator (i.e., a GPU) navigate in the menu above to **`Runtime > Change runtime type > GPU`**.

## License

All code examples and code downloads are licensed under the (extremely permissive) [MIT license](https://opensource.org/licenses/MIT). My goal is to have this be a useful base for you, should you so desire.

## Datasets

This notebook will use two datasets: 

* A binary sentiment dataset, with reviews / produced content from Yelp, Amazon, and Twitter
* A semantic chunking dataset for a virtual assistant use case, where we'll be understanding weather queries

To download the sentiment dataset, run this cell:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/sentiment-data.json

To download the virtual assistant dataset, run this cell:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/weather-assistant.json

To download the `icmenlp` package, we run:

In [0]:
!wget -q https://ldo.io/icme-sws/2019/icmenlp.py

Let's take a look in our VM's directory...

In [0]:
ls

## Setup

We'll be using Colaboratory built-in libraries (scikit-learn & Keras) in order to avoid set up!

Let's set up our imports below.

In [0]:
# Make sure if we change any imports, they're reflected in our notebook
%load_ext autoreload
%autoreload 2

In [0]:
import keras

Now, we'll use our library for this tutorial - `icmenlp`. This library provides two main utilities -- first, a principled way to load the data for this session, and second, a vocabulary container, which we will describe later.

In [0]:
import icmenlp

Let's open our datasets using data loading functions that will provide us with a train-test-validate split.

In [0]:
ASSISTANT_DATA = 'weather-assistant.json'
SENTIMENT_DATA = 'sentiment-data.json'

In [0]:
assistant_data = icmenlp.load_data(ASSISTANT_DATA, 'assistant')

In [0]:
assistant_data['train'][4]

In [0]:
# Define a small utility to display chunked data a bit more easily
def display_chunking(chunks):
    sent = ''
    for ch in chunks:
        label = ch.get('label')
        text = ch['text']
        if not label:
            sent += text
        else:
            sent += '[{} | {}]'.format(text, label.upper())
    return sent

In [0]:
display_chunking(assistant_data['train'][4])

# Manipulating Text Data for ML

One of the most asked questions both from students and from industry concerns how to prepare text data for deep learning. Today, we're going to focus on **embeddings** (one of the more popular incarnations of this is Word2Vec). Right now, we'll learn how to prepare data for usage in a model that learns to embed text.

The first step of any such pipeline is **tokenization**, that is, converting a single text or document into a *sequence* of *tokens*. For word level models, these tokens roughly correspond to words / contractions, and in a character model, this corresponds to individual bytes. Many modern methods use **subword** information, which allows you to make predictions over text that has words that were not trained on (the so-called OOV, or out of vocabulary, problem).

For example, the sentence
    
    Is there a minimum balance I need to maintain in my accounts?
    
could be *tokenized* as:
    
    'Is', 'there', 'a', 'minimum', 'balance', 'I', 'need', 'to', 'maintain', 'in', 'my', 'accounts', '?'

How can one systematically convert text into these *tokens* then? It turns out this is one of the most critical,  important, and underappreciated steps. It drastically differs from language to language, and requires a lot of care to ensure consistency. This is one of the reasons why **character level** or **subword** models can be so useful in applied settings with inconsistent spelling, grammar, and nomenclature.

The dominant approach to doing word-based tokenization consists of using a [**regular expression**](https://en.wikipedia.org/wiki/Regular_expression). Regular expressions define a formal language for searching through strings for matches to a query. We'll use one here to work with our text in this tutorial 

In [0]:
# Re is the Python RegEx library
import re

def tokenize(text):
    return [
        # Make sure there is no trailing whitespace
        x.strip() 
        # Split the text on matches of at least one "word"
        for x in re.split('(\W+)', text) 
        # Only include the token if it is not null
        if x.strip() 
    ]

Let's load the sentiment dataset to try this out!

In [0]:
sentiment_data = icmenlp.load_data(SENTIMENT_DATA, 'sentiment')

# Let's grab out a random training point
text = sentiment_data['train'][0][2]

In [0]:
print(text)

In [0]:
print(tokenize(text))

We will get into more detail about how embedding models work later on, but for now, let's discuss the conversion of text into a format that is useful for embedding models. Embedding models rely on an **integer** representation for each word, since we will use it as lookup into a **lookup table**.

An important consideration when using deep learning models is the length of a piece of text, as well as a signal of the beginning and end of a sentence. Most deep learning models require that each batch of text passed in to the model have identical sentence lengths. We commonly solve this using **padding**, or adding a series of meaningless tokens to increase the length of a document. We can then use **masking** to ensure that our model does not incorporate these into the learning procedure.

We'll use `<PAD>` as the pad token, and `<S>`/`</S>` to delimit the beginning of a sentence and the end of a sentence respectively. In addition, any tokens that are unknown to us (for example, a word that is in the test set but not the train set) are mapped to the `<UNK>` token.

To map from tokens to integers, we will define a bijective (two-way) map from words to integers that we will learn from data.

We're going to use the one from our `icmenlp` package.

In [0]:
# Let's look inside the code of the utility provided
icmenlp.VocabularyContainer??

In [0]:
# Create a vocab collection object with the tokenizer we defined above
vocab = icmenlp.VocabularyContainer(tokenizer=tokenize)

In [0]:
text = sentiment_data['train'][0]

In [0]:
vocab.fit(text)

In [0]:
vocab.transform(['this product was bad!'])

In [0]:
vocab.transform(['this product was bad!', 'meh'], pad_length=10)

In [0]:
vocab.inverse_transform(vocab.transform(['this product was bad!']))

In [0]:
vocab.inverse_transform(vocab.transform(['this product was abhorrent!']))

In [0]:
vocab.inverse_transform(
    vocab.transform(
        ['this product was abhorrent!'], 
        pad_length=9, 
        add_start=True, 
        add_end=True
    )
)

We now know how to preprocess text data for use in deep learning. To summarize:

1.) Each document gets split into tokens, a process called tokenization.

2.) A mapping is learned from the vocabulary to unique integers.

3.) We have four special tokens -- the pad token `<PAD>`, the start-of-sentence token `<S>`, the end-of-sentence token `</S>`and the unknown token `<UNK>`.

4.) We pad sentences with the `<PAD>` token to make them the same size.

Now, let's learn about deep learning for NLP!

# End of Part 1!