# Example: Word-based Language Modeling

In the previous notebook we have seen an example of a character-based language model. Let's improve that, by building a word-based language model.

We still build our language model based on a sequence of $N=10$ words.

In addition, this time we use LSTM instead of SimpleRNN.

In [1]:
N=10

## DATASET

In [2]:
import numpy as np 
import pandas as pd 
import os

### Download

In [3]:
INPUT_FILE = "wonderland.txt"

In [5]:
text_file = open(INPUT_FILE, 'rb')

In [6]:
# List, which will contain all the lines of the book. Each line is stored as a string
lines = []
for line in text_file:
    line = line.strip()
    # We transform all the characters to lowercase
    line = line.lower()
    line = line.decode("ascii", "ignore")
    #line = line.
    # We skip empty lines
    if len(line) == 0:
        continue
    lines.append(line)
#text_file.close()

In [7]:
# Single string, containing the whole text
text = " ".join(lines)

In [8]:
# List of words
words_text = text.split(' ')

In [9]:
words_text[:10]

['project',
 'gutenbergs',
 'alices',
 'adventures',
 'in',
 'wonderland,',
 'by',
 'lewis',
 'carroll',
 'this']

In [10]:
len(words_text)

29697

### Set of all possible words

In [11]:
# Set of all the possible words in our dataset
words = set(words_text)

# Number of all the possible words
n_words = len(words)
n_words

5071

### PREPROCESSING

### Mapping words into integers

In [12]:
# Dictionary, which maps words into the corresponding integers (i.e. indeces) 
word2index = dict((w, i) for i, w in enumerate(words))

# Dictionary, which maps integers/indeces into the corresponding words 
index2word = dict((i, w) for i, w in enumerate(words))

### Inputs and targets

In [13]:
# List which will contain all the possible instances x, which are all the possible sequences of N adjacent words
inputs = []

# List which will contain the targets for the corresponding instances x
targets = []

# We iterate over all the possible words in the text
# Actually, we don't consider the last N characters
for i in range(0, len(words_text)-N):
  # Instance x: it consists in the N consecutive words starting from the index 'i'
  x = words_text[i : i+N]
  inputs.append(x)

  # Target corresponding to the instance x: it is the word after N characters. It is the word rigth after the sequence
  y = words_text[i+N]
  targets.append(y)

In [14]:
inputs[0]

['project',
 'gutenbergs',
 'alices',
 'adventures',
 'in',
 'wonderland,',
 'by',
 'lewis',
 'carroll',
 'this']

In [15]:
targets[0]

'ebook'

In [16]:
# Number of instances x in our dataset, where an instance x is a sequence of N=10 consecutive words
M = len(inputs)
M

29687

### Transforming words into integers
We transform each word into the corresponding integer/index. We do that both in the inputs and in the targets

In [17]:
inputs_integers = [[word2index[w] for w in input] for input in inputs]
inputs_integers[0]

[601, 2294, 1035, 4302, 1009, 1986, 3420, 125, 4875, 920]

In [18]:
targets_integers = [word2index[target] for target in targets]
targets_integers[0]

167

### Next preprocessing step
The next preprocessing step could be to one-hot encode the words, as seen in the last notebook for characters.

However, we follow a better approach: word embeddings.

https://machinelearningmastery.com/what-are-word-embeddings/#:~:text=A%20word%20embedding%20is%20a,challenging%20natural%20language%20processing%20problems.

https://www.tensorflow.org/text/guide/word_embeddings

We map each word into a vector of `embedding_dim` values. Basically, we map the words into an embedding space. The aim of this is to have a vectorial representation of words in which similar words are near to each others.

This mapping into the embedding space is learnt. The embedding is put as a layer into our NN.

In [19]:
embedding_dim = 128

## FIRST MODEL

In [21]:
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Dropout 
from tensorflow.keras import Model

In [22]:
# Input x: sequence of N word, where each word is an integer/index
xin = Input(shape=(N,))

# Embedding: we trasform each word in the sequence to a vector of 'embedding_dim' values.
# This layer has parameters which must be learnt
x = Embedding(n_words, embedding_dim)(xin)

h_outputs = LSTM(units=256, return_sequences=True)(x)
h_outputs = Dropout(0.2)(h_outputs)

# LSTM: it takes in input a sequence of N words, where each word is a vector of 'embedding_dim' values.
# We keep only the last output h_N, which is a vector with 128 values
last_h = LSTM(units=256)(h_outputs)
last_h = Dropout(0.2)(last_h)

# Dense layer: it takes in input h_n, and it produces y_hat, which is the categorical distribution over all the possible words, represented as integers
y_hat = Dense(units=n_words, activation='softmax')(last_h)

model = Model(inputs=xin, outputs=y_hat)

In [23]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding (Embedding)       (None, 10, 128)           649088    
                                                                 
 lstm (LSTM)                 (None, 10, 256)           394240    
                                                                 
 dropout (Dropout)           (None, 10, 256)           0         
                                                                 
 lstm_1 (LSTM)               (None, 256)               525312    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense (Dense)               (None, 5071)              130324

As it can be seen, the Embedding layer has many parameters which must be learnt

### Compile

We use **sparse categorical crossentropy** as loss function, since our target data `target_integers` contain words represented as integers. The words are represented with the labels, and not with the true categorical distrbution: therefore, we use sparse categorical crossentropy and not simply categorical crossentropy.

In [24]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")

### Train

In [25]:
 model.fit(inputs_integers, targets_integers, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f46c6b65b90>

### Generating text

In [26]:
def generate_text(l, K):
  """ Generated K new charactes after the given string list of words `l`. List of N words. """

  for i in range(K):
    # We transform the list of words 'l' into the list of corresponding integers/indeces.
    # Basically, we transform each word w into the corresponding integer/index.
    x = [word2index[w] for w in l]
    # We add the batch dimension into x, since our NN processes only batches. The shape of x_batch is 1*N
    x_batch = np.expand_dims(x, 0)

    # We apply the NN, and we get the predicted categorical distribution for the next character. Actually, we get a batch
    y_hat_batch = model.predict(x_batch)
    # We extract the categorical distribution from the batch
    y_hat = y_hat_batch[0]

    # Predicted word, corresponding to the word with higher probabiliry
    w_hat = index2word[np.argmax(y_hat)]
    # We print that
    print(w_hat, end=" ")

    # We update our list of N words, by removing the first one, and by appending the new one
    l.append(w_hat)
    l = l[1:]

In [27]:
l = ['alice', 'said', 'she', 'wanted', 'to', 'go', 'outside', 'the', 'place', 'where']
generate_text(l, 20)

the gryphon very sitting and and an right on his house who had not another right as he used to 

In [28]:
l = ['nor', 'did', 'alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of']
generate_text(l, 10)

the way to hear the rabbit say to talk oh 

In [29]:
l = ['alice', 'saw', 'the', 'rabbit', 'and', 'the', 'mad', 'hatter', 'and', 'thought']
generate_text(l, 10)

the rabbit, of her time, what did something get to 

## VECTORIZATION

https://keras.io/examples/nlp/pretrained_word_embeddings/

https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization

We did by hand the process of mapping words into integers. This could have been done automatically by a keras layer: `TextVectorization`.

In [30]:
from tensorflow.keras.layers import TextVectorization

In [31]:
# We define our vectorizer, such that it finds the words by splitting using the whitespace. It performs also lower-case transformation and strip of the punctuation
vectorizer = TextVectorization(max_tokens=n_words, standardize='lower_and_strip_punctuation', split='whitespace')

In [32]:
# Compute the voucaboulary on our dataset, which is a list of words
vectorizer.adapt(words_text)

In [33]:
print(len(words_text))
print(words_text[:5])

29697
['project', 'gutenbergs', 'alices', 'adventures', 'in']


In [34]:
# Compute the integers/indeces corresponding to our words
print(vectorizer(words_text).shape)
print(vectorizer(words_text)[:5])

(29697, 1)
tf.Tensor(
[[  48]
 [1517]
 [ 243]
 [ 370]
 [  11]], shape=(5, 1), dtype=int64)


In [35]:
# Get the vocabulary. It is a list of words, ordered according to their integer/index
voc = vectorizer.get_vocabulary()
voc[:5]

['', '[UNK]', 'the', 'and', 'to']

In [36]:
# Build the dictionary for the mapping word -> index
word2index = dict(zip(voc, range(len(voc))))

# Build the dictionary for the mapping index -> word
index2word = dict(zip(range(len(voc)), voc))

In [37]:
# The size of our new vocubulary is smaller than the number of all possible words computed before. Because now we have removed the punctuation
print(n_words)
print(len(voc))

5071
3255


In [38]:
index2word[word2index['alice.']]

KeyError: ignored

In [39]:
print(vectorizer(['alice.']))
print(vectorizer(['alice']))

tf.Tensor([[13]], shape=(1, 1), dtype=int64)
tf.Tensor([[13]], shape=(1, 1), dtype=int64)


This is the actual size of our new vocubulary. This is the actual number of possible different words, instead of the old `n_words`.

In [40]:
n_words = len(voc)+2

### Transform the dataset using the vectorizer
We transform `inputs` and `targets`, which contain words represented as strings, to `inputs_integers` and `targets_integers`, which contain words represented as integers/indeces. 

In [41]:
inputs_integers = vectorizer(np.array([s for s in inputs]).reshape((M,N,1))).numpy().reshape((M,N))
targets_integers = vectorizer(np.array([s for s in targets])).numpy().reshape((M,))

In [42]:
inputs_integers.shape

(29687, 10)

In [43]:
inputs_integers[0]

array([  48, 1517,  243,  370,   11,  448,   60,  848,  913,   22])

In [44]:
targets_integers.shape

(29687,)

In [45]:
targets_integers[0]

437

## PRE-TRAINED EMBEDDING
Before we have trained our embedding. We can also import and use a pre-trained embedding. We use the GloVe pre-trained embedding. We use $100$-dimensional embeddings (i.e. $100$ values).

https://keras.io/examples/nlp/pretrained_word_embeddings/

In [46]:
embedding_dim = 100

In [47]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2022-07-02 12:54:51--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-07-02 12:54:51--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-07-02 12:54:51--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [48]:
path_to_glove_file = 'glove.6B.100d.txt'

We load the embedding. We load the mapping from words to embeddings. Map from real words to vectors.

In [49]:
word2embedding = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, embedding = line.split(maxsplit=1)
        embedding = np.fromstring(embedding, "f", sep=" ")
        word2embedding[word] = embedding

print("Found %s word vectors." % len(word2embedding))

Found 400000 word vectors.


Now we build the mapping from our integers/indeces to embeddings. With our integers/indeces we mean the ones built before using the vectorizer.

In [50]:
# We count the number of words in our (vectorized) dataset which are not found in the imported embedding
hits = 0
misses = 0

index2embedding = {}
for word, i in word2index.items():
    embedding_vector = word2embedding.get(word)
    if embedding_vector is not None:
        # Words not found in word2embedding will be all-zeros.
        word2embedding[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 2970 words (285 misses)


Now, for using our pre-trained embedding, we feed it into an `Embedding` layer. Basically, we give to the `Embedding` layer the `index2embedding` dictionary. In this way, the pre-trained embedding will trasform words represented as integers/indeces to words represented as embedding.

Actually, the `Embedding` layer does not accept a `dict` datatype: he want a numpy array. Let's transform the `index2embedding` dictionary into a numpy array: numpy matrix which containes, in the row $i$, the embedding for that integer/index. 

In [51]:
embedding_matrix = np.zeros((n_words, embedding_dim))
for i, embedding_vector in index2embedding.items():
    embedding_matrix[i] = embedding_vector

## MODEL
Let's now define again our model. We use the vectorizer and the pre-trained Embedding.

In [52]:
import tensorflow.keras as ks

In [58]:
# Input x: sequence of N word, where each word is an index/integer
xin = Input(shape=(N,))

# Embedding: we put as embedding layer our pre-trained embedding.
# It transforms a word represented as an index to an embedding vector, with 'embedding_dim' values
embedding_layer = Embedding(
    n_words,
    embedding_dim,
    embeddings_initializer=ks.initializers.Constant(embedding_matrix),
    trainable=False,  # WE SET IT AS NON-TRAINABLE
)
x = embedding_layer(xin)

h_outputs = LSTM(units=256,  return_sequences=True)(x)
h_outputs = Dropout(0.2)(h_outputs)

# LSTM: it takes in input a sequence of N words, where each word is a vector of 'embedding_dim' values.
# We keep only the last output h_N, which is a vector with 128 values
last_h = LSTM(units=256)(h_outputs)
last_h = Dropout(0.2)(last_h)

# Dense layer: it takes in input h_n, and it produces y_hat, which is the categorical distribution over all the possible words, represented as integers
y_hat = Dense(units=n_words)(last_h) #activation='softmax')(last_h)

model = Model(inputs=xin, outputs=y_hat)

In [59]:
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_2 (Embedding)     (None, 10, 100)           325700    
                                                                 
 lstm_4 (LSTM)               (None, 10, 256)           365568    
                                                                 
 dropout_4 (Dropout)         (None, 10, 256)           0         
                                                                 
 lstm_5 (LSTM)               (None, 256)               525312    
                                                                 
 dropout_5 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 3257)              8370

As it can be seen, the Embedding layer has all the parameters which are non-trainable.

### Compile

We use **sparse categorical crossentropy** as loss function, since our target data `target_integers` contain words represented as integers. The words are represented with the labels, and not with the true categorical distrbution: therefore, we use sparse categorical crossentropy and not simply categorical crossentropy.

In [55]:
from tensorflow.keras.optimizers import Adam 
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [61]:
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=Adam())

In [None]:
# DELETE
targets_oneHotEncoded = np.zeros((M,n_words))
for i,word_idx in enumerate(targets_integers):
  targets_oneHotEncoded[i, word_idx] = 1.0

In [None]:
model.compile(loss="categorical_crossentropy", optimizer="adam")

### Train

In [62]:
 model.fit(inputs_integers, targets_integers, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f46bd1a4e50>

### Generating text

In [63]:
l = ['alice', 'said', 'she', 'wanted', 'to', 'go', 'outside', 'the', 'place', 'where']
generate_text(l, 20)

the the the the the the the the the the the the the the the the the the the the 

In [64]:
l = ['nor', 'did', 'alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of']
generate_text(l, 10)

the the the the the the the the the the 

In [65]:
l = ['alice', 'saw', 'the', 'rabbit', 'and', 'the', 'mad', 'hatter', 'and', 'thought']
generate_text(l, 10)

the the the the the the the the the the 