# Natural Language Processing (Part 2)

Many of the examples and pieces of the code are taken from the "Deep Learning with Python" book by Francois Chollet... We've recommended it before, but it is a well done book so let us recommend it again!

In [None]:
import gensim.downloader as api
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.decomposition import PCA
from tensorflow.keras import preprocessing
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.utils import to_categorical

np.set_printoptions(precision=2, suppress=True, linewidth=140)
%matplotlib inline

In [None]:
!pip install gensim

## Text data can be challenging to work with

Text data introduces a variety of unique challenges, some of which we've already discussed in previous lectures. We summarize some of these challenges below,

* Textual data is inherently _not_ numeric while the methods we have learned require numeric data
* The same word can be used differently in different contexts, i.e. "The new Apple product is great." vs "This apple is delicious"
* Text data is high dimensional -- Most languages have between 100,000 and 1,000,000 words.
* Text data is often not structured in the same way that we're used to. How should we account for document structure like sentences, paragraphs, and quotes?

## Review:

Let's quickly review some of the concepts from our previous lecture on NLP.

Our main goal was to "tokenize" the text that we were working with. Tokenization is the process by which we convert the textual data into a numeric representation. In our last notebook, we did this in two steps:

* Preprocessing
* Bag of words

### Preprocessing

In order to process textual data, we needed a way to convert it into numerical data. The first step of this was to preprocess the data by separating the words and normalizing them via a few steps, i.e.

> The quick brown fox jumped over the lazy dog.

became

```python
processed_text = [
    "the", "quick", "brown", "fox",
    "jump", "over", "the", "lazy", "dog"
]
```

### Bag of words

Once we had processed the text, we used two versions of the _bag of words_ algorithm. In a bag of words style algorithm for converting text to numbers, we ignore the order of the words and simply consider an $N$ element vector where $N$ is the "size of our vocabulary". The elements of this vector are assigned based on the algorithm used.

* Binary presence: Each word becomes an element of the vector and receives a 1 if the word appears at all in the text and a 0 if the word does not appear in the text
* N-gram: The elements of the vector correspond to either the number of times that a particular word appears or the frequency with which it appears

**Shortcomings**

The main shortcoming of this style of algorithm is that it ignores order! Ignoring order means that it will be difficult to distinguish between how the word "apple" is used in different contexts like from the example above.

Keras makes both of these easy and we should use their functionality whenever possible!

In [None]:
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "The quick red fox jumps over the sleepy dog",
    "The quick brown fox finds the destructive groundhogs"
]

In [None]:
tokenizer = Tokenizer(num_words=14)

tokenizer.fit_on_texts(sentences)

tokenizer.word_index

In [None]:
tokenizer.texts_to_matrix(sentences, mode="binary")

In [None]:
tokenizer.texts_to_matrix(sentences, mode="count")

In [None]:
tokenizer.texts_to_matrix(sentences, mode="freq")

## Advanced tokenization

We discuss two additional methods that can be used to tokenize our data:

* one-hot encoding
* word embeddings

### One-hot encoding

One-hot encoding methods are the same idea as how one-hot encoding is used for categorical data. We consider $N$ unique possible words and then embed the sentence using vectors that have many 0s and a single 1.

For example,

> The quick brown fox jumped over the lazy dog

would be expressed by:

```python
tokenized_sentence = np.array(
    [[1, 0, 0, 0, 0, 0, 0, 0],  # The
    [0, 1, 0, 0, 0, 0, 0, 0],  # quick
    [0, 0, 1, 0, 0, 0, 0, 0],  # brown
    [0, 0, 0, 1, 0, 0, 0, 0],  # fox
    [0, 0, 0, 0, 1, 0, 0, 0],  # jump
    [0, 0, 0, 0, 0, 1, 0, 0],  # over
    [1, 0, 0, 0, 0, 0, 0, 0],  # the
    [0, 0, 0, 0, 0, 0, 1, 0],  # lazy
    [0, 0, 0, 0, 0, 0, 0, 1]],  # dog
)
```

Make note that the columns could be in a different order. (Why?)

**Implementing one-hot encoding**

While it would be relatively straightforward to write our own version of one-hot encoding, `keras` takes care of various details for us automatically and we should leverage these tools.

In [None]:
sentence = "The quick brown fox jumped over the lazy dog"

In [None]:
one_hot(sentence, 9)

In [None]:
to_categorical(
    one_hot(sentence, n=9)
)

Wait... This doesn't look right.

Under the hood, keras is actually using one-hot hashing. If the number of words in your dictionary is "too small" then you can wind up with "hash collisions" which makes the algorithm think that two distinct words are the same... We found this on [StackOverflow](https://stackoverflow.com/questions/66507613/confused-by-output-of-keras-text-preprocessing-one-hot)

Let's try with a larger vocabulary

In [None]:
# Choose n to be very large so that we don't have collisons
word_nums = one_hot(sentence, n=1000)
my_encoding = dict(
    zip(word_nums, range(len(word_nums)))
)

ohe = to_categorical(
    np.array([my_encoding[x] for x in word_nums])
)

In [None]:
my_encoding

In [None]:
word_nums

In [None]:
ohe

**Sparsity of one-hot encoding representations**

The output of the one-hot encoding tokenization is a very large (but very sparse!) representation of the text. There can be no more than one non-zero element per row.

This is a relatively inefficient way to store the data... Is there a way to lower the dimensionality?

Yes! By using something called "word embeddings"

### Word embeddings

Word embeddings are an alternative to one-hot encoding representations of text. Whereas one-hot encoding is exceptionally high dimensional and sparse, word embeddings aim to be lower dimensional but dense.

One can obtain a word embedding in one of two ways:

1. Use a prepackaged word embedding model trained by someone else (something akin to transfer learning)
2. Learn a word embedding in conjunction with the main task that you're attempting to complete

**Desired properties of an embedding**

1. Similar words produce similar outputs. "jog" and "run" and "frog" and "toad" should produce similar vectors
2. Linear substructures. The canonical example is `"king" - "man" + "woman" = "queen"`
3. Sufficiently low dimensional to be useful

**Common word embedding models**

There are a few word embedding models that are commonly used:

* Google's `Word2Vec` model trained on their Google News dataset which had ~100 billion words
* GloVe is a model trained by researchers at Stanford using a different methodology than `Word2Vec`

We will use the Google News `word2vec` model to illustrate word embeddings

In [None]:
# Big file! Will take a bit to download/load
w2v = api.load('word2vec-google-news-300')


In [None]:
w2v.get_vector("queen").shape  # 300 dimensional

In [None]:
def display_pca_scatterplot(model, words):
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:, :2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:, 0], twodim[:, 1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(
    w2v,
    [
        "man", "woman",
        "king", "queen",
        "prince", "princess"
    ]
)

In [None]:
display_pca_scatterplot(
    w2v,
    [
        "walk", "jog", "run",
        "frog", "toad",
    ]
)

In [None]:
display_pca_scatterplot(
    w2v,
    [
        "tall", "taller", "tallest",
        "long", "longer", "longest"
    ]
)

### Training our own embedding

We can also learn our own embeddings based on a specific task.

In this example, we'll use IMDB movie review text to classify reviews as either positive or negative. We will use preprocessed data that keras provides for us.

* The `x` data contains an array or lists of integers. Each element of the array represents a single review and the list of integers is used to encode the words used in the review. For example, if `"test" -> 1` and `"case" -> 2` then `np.array([list(1, 2, 1), list(2, 1, 2)])` would represent two reviews with the words `"test case test"` and `"case test case"`.
* The `y` data contains an array of 0s and 1s. If element $i$ of this array is 1 (0) then review $i$ was positive (negative).

When we load the data, we specify the number of words that we would like to (which will keep the $n$ most commonly used words) and we then keep only the first $m$ words of each review.

In [None]:
nvocab = 10_000
nkeep = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words=nvocab
)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=nkeep)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=nkeep)

**`Embedding` layer**

The `Embedding` layer takes an input of integers of shape `(samples, sequence_length)` and converts them into vectors of floats with size `(samples, sequence_length, embedding_dimensionality)`.

It does this by mapping each unique integer into its own pre-specified floating point vector. For example, assume the word `good` might be represented by integer `1` and be mapped to `[0.1, 0.2, 0.3]` and word `bad` was represented by integer `2` and was mapped to `[-0.1, -0.2, -0.3]`. Then a sample of two reviews that said `"good good good"` and `"bad bad bad"` would then be represented by

```python
np.array([list(1, 1, 1), list(2, 2, 2)])
```

and would be assigned an embedding of

```python
np.array([
    [[0.1, 0.2, 0.3], [0.1, 0.2, 0.3], [0.1, 0.2, 0.3]],
    [[-0.1, -0.2, -0.3], [-0.1, -0.2, -0.3], [-0.1, -0.2, -0.3]
])
```

Training the embedding layer attempts to find these vectors.

Let's train an embedding for our IMDB data

In [None]:
nvocab

In [None]:
nkeep

In [None]:
embedding_model = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 32, input_length=nkeep),
        # Converts from 3D to 2d of shape (samples, maxlen*32)
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

embedding_model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

In [None]:
embedding_model.summary()

In [None]:
history = embedding_model.fit(
    x_train, y_train,
    epochs=10, batch_size=64,
    validation_data=(x_test, y_test)
)

In [None]:
def make_acc_loss_plot(history):

    epoch = history.epoch

    fig, ax = plt.subplots(2, 1, figsize=(10, 8))

    # Accuracy
    ax[0].plot(epoch, history.history["acc"], linestyle="-.", label="Training")
    ax[0].plot(epoch, history.history["val_acc"], linestyle="-", label="Validation")
    ax[0].legend()
    ax[0].set_title("Model accuracy")

    # Loss
    ax[1].plot(epoch, history.history["loss"], linestyle="-.")
    ax[1].plot(epoch, history.history["val_loss"])
    ax[1].set_title("Loss")

    fig.tight_layout()

    return fig

make_acc_loss_plot(history);

## Can we do better?

The dense layer we used in training our own embedding seems to have performed relatively well, but the dense layer only observes each word as a separate entity and ignores the fact that combinations of words might mean something...

Do we know any methods that allow us to analyze data sequentially?

**Recurrent neural networks strike again!**

The other main application of recurrent neural networks is text analysis because they use "memory" to understand the context of certain sentences.

In our example, the sentence, "This move is the bomb" is much different than the sentence, "This movie is a bomb"...

_Simple RNN_

In [None]:
simple_rnn = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 32, input_length=nkeep),
        tf.keras.layers.SimpleRNN(8, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

simple_rnn.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

simple_rnn.summary()

In [None]:
simple_rnn_history = simple_rnn.fit(
    x_train, y_train,
    epochs=10, batch_size=64,
    validation_data=(x_test, y_test)
)

In [None]:
make_acc_loss_plot(simple_rnn_history);

_LSTM RNN_

In [None]:
lstm_rnn = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 32, input_length=nkeep),
        tf.keras.layers.LSTM(8, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

lstm_rnn.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

lstm_rnn.summary()

In [None]:
lstm_rnn_history = lstm_rnn.fit(
    x_train, y_train,
    epochs=10, batch_size=64,
    validation_data=(x_test, y_test)
)

In [None]:
make_acc_loss_plot(lstm_rnn_history);

_GRU RNN_

In [None]:
gru_rnn = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 32, input_length=nkeep),
        tf.keras.layers.GRU(8, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

gru_rnn.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

gru_rnn.summary()

In [None]:
gru_rnn_history = gru_rnn.fit(
    x_train, y_train,
    epochs=10, batch_size=64,
    validation_data=(x_test, y_test)
)

In [None]:
make_acc_loss_plot(gru_rnn_history);

### Challenge: Train a better model

I'm going to restrict myself to about 10 minutes to train a better model for sentiment analysis using a RNN. I'll post the output of each of my models below and we can talk about why I tried some of the things that I tried.

In [None]:
def test_a_model(model):
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
    print(model.summary())
    history = model.fit(
        x_train, y_train,
        epochs=10, batch_size=64,
        validation_data=(x_test, y_test)
    )

    make_acc_loss_plot(history)

    return model

In [None]:
model_1 = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 10, input_length=nkeep),
        tf.keras.layers.GRU(4, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

test_a_model(model_1)

In [None]:
model_2 = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 10, input_length=nkeep),
        tf.keras.layers.GRU(8, dropout=0.1, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

test_a_model(model_2)

In [None]:
model_3 = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 10, input_length=nkeep),
        tf.keras.layers.GRU(8, return_sequences=True),
        tf.keras.layers.GRU(2, dropout=0.1, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

test_a_model(model_3)

In [None]:
model_4 = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 20, input_length=nkeep),
        tf.keras.layers.GRU(8, return_sequences=True),
        tf.keras.layers.GRU(2, dropout=0.1, return_sequences=False),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

test_a_model(model_4)

In [None]:
model_5 = tf.keras.Sequential(
    [
        # Embedding layer
        tf.keras.layers.Embedding(nvocab, 10, input_length=nkeep),
        tf.keras.layers.Dense(8, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

test_a_model(model_5)

## References

* [Gensim documentation](https://radimrehurek.com/gensim/index.html)
* [GloVe documentation](https://nlp.stanford.edu/projects/glove/)