# Week 4: Predicting the next word

Welcome to this assignment! During this week you saw how to create a model that will predict the next word in a text sequence, now you will implement such model and train it using a corpus of Shakespeare's sonnets, while also creating some helper functions to pre-process the data.


Let's get started!

_**NOTE:** To prevent errors from the autograder, pleave avoid editing or deleting non-graded cells in this notebook . Please only put your solutions in between the `### START CODE HERE` and `### END CODE HERE` code comments, and also refrain from adding any new cells._

In [None]:
!python3.8 -m pip install ipykernel
!python3.8 -m ipykernel install --user --name python38 --display-name "Python 3.8"


In [None]:
import sys
print(sys.version)


In [2]:
# grader-required-cell

import numpy as np 
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

In [6]:
# grader-required-cell
! conda install -y gdown
!gdown --id 1AOpA6ogBNWuDE8ocYfRiKFJs0GX8SUzx

Collecting package metadata (current_repodata.json): done
Solving environment: - 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - rapidsai/linux-64::libcuml==21.12.00=cuda11_g04c4927f3_0
  - conda-forge/linux-64::abseil-cpp==20211102.0=h93e1e8c_3
  - rapidsai/linux-64::dask-cudf==21.12.02=cuda_11_py37_g06540b9b37_0
  - conda-forge/linux-64::pyarrow==5.0.0=py37h8cf84b7_35_cuda
  - rapidsai/linux-64::cuml==21.12.00=cuda11_py37_g04c4927f3_0
  - conda-forge/linux-64::grpc-cpp==1.45.2=he70e3f0_3
  - rapidsai/linux-64::libcudf==21.12.02=cuda11_g06540b9b37_0
  - conda-forge/linux-64::arrow-cpp==5.0.0=py37h846d386_35_cuda
  - rapidsai/linux-64::cudf==21.12.02=cuda_11_py37_g06540b9b37_0
  - conda-forge/noarch::parquet-cpp==1.5.1=2
  - conda-forge/linux-64::libabseil==20211102.0=cxx17_h48a1fff_3
done


  current version: 22.9.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -

In [7]:
# grader-required-cell

# Define path for file with sonnets
MARTIN_F_FILE = './hernandez_jose_-_la_vuelta_de_martin_fierro.txt'

# Read the data
with open(MARTIN_F_FILE) as f:
    data = f.read()

lines = data.split("\n")
non_empty_lines = [line for line in lines if line.strip()]
data = "\n".join(non_empty_lines)

# Convert to lower case and save as a list
corpus = data.lower().split("\n")

print(f"There are {len(corpus)} lines of sonnets\n")
print(f"The first 5 lines look like this:\n")
for i in range(5):
  print(corpus[i])

There are 2049 lines of sonnets

The first 5 lines look like this:

﻿i
atención pido al silencio y silencio a la atención, que voy en esta ocasión, si me ayuda la memoria,
a mostrarles que a mi historia le faltaba lo mejor.
viene uno como dormido cuando vuelve del desierto; veré si a esplicarme acierto entre gente tan bizarra,
y si al sentir la guitarra de mi sueño me dispierto.


## Tokenizing the text


In [8]:
# grader-required-cell

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

When converting the text into sequences you can use the `texts_to_sequences` method as you have done throughout this course.

In the next graded function you will need to process this corpus one line at a time. Given this, it is important to keep in mind that the way you are feeding the data unto this method affects the result. Check the following example to make this clearer.

The first example of the corpus is a string and looks like this:

In [4]:
# grader-required-cell

corpus[1]

NameError: name 'corpus' is not defined

If you pass this text directly into the `texts_to_sequences` method you will get an unexpected result:

In [None]:
# grader-required-cell

tokenizer.texts_to_sequences(corpus[1])

This happened because `texts_to_sequences` expects a list and you are providing a string. However a string is still and `iterable` in Python so you will get the word index of every character in the string.

Instead you need to place the example whithin a list before passing it to the method:

In [12]:
# grader-required-cell

tokenizer.texts_to_sequences([corpus[1]])

[[290,
  581,
  16,
  218,
  3,
  218,
  6,
  5,
  290,
  1,
  123,
  7,
  85,
  131,
  24,
  9,
  782,
  5,
  291]]

Notice that you received the sequence wrapped inside a list so in order to get only the desired sequence you need to explicitly get the first item in the list like this:

In [13]:
# grader-required-cell

tokenizer.texts_to_sequences([corpus[1]])[0]

[290, 581, 16, 218, 3, 218, 6, 5, 290, 1, 123, 7, 85, 131, 24, 9, 782, 5, 291]

## Generating n_grams

Now complete the `n_gram_seqs` function below. This function receives the fitted tokenizer and the corpus (which is a list of strings) and should return a list containing the `n_gram` sequences for each line in the corpus:

In [9]:
# grader-required-cell

# GRADED FUNCTION: n_gram_seqs
def n_gram_seqs(corpus, tokenizer):
    """
    Generates a list of n-gram sequences
    
    Args:
        corpus (list of string): lines of texts to generate n-grams for
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary
    
    Returns:
        input_sequences (list of int): the n-gram sequences for each line in the corpus
    """
    input_sequences = []

    ### START CODE HERE
    for line in corpus:

        # Tokenize the current line
        token_list = tokenizer.texts_to_sequences([line])[0]

        # Loop over the line several times to generate the subphrases
        for i in range(1, len(token_list)):
		
		# Generate the subphrase
            n_gram_sequence = token_list[:i+1]

            # Append the subphrase to the sequences list
            input_sequences.append(n_gram_sequence)
    ### END CODE HERE
    
    return input_sequences

In [16]:
# grader-required-cell

# Test your function with one example
first_example_sequence = n_gram_seqs([corpus[5]], tokenizer)

print("n_gram sequences for first example look like this:\n")
first_example_sequence

n_gram sequences for first example look like this:



[[1847, 1],
 [1847, 1, 25],
 [1847, 1, 25, 240],
 [1847, 1, 25, 240, 1848],
 [1847, 1, 25, 240, 1848, 1],
 [1847, 1, 25, 240, 1848, 1, 10],
 [1847, 1, 25, 240, 1848, 1, 10, 1108],
 [1847, 1, 25, 240, 1848, 1, 10, 1108, 25],
 [1847, 1, 25, 240, 1848, 1, 10, 1108, 25, 132]]

**Expected Output:**

```
n_gram sequences for first example look like this:

[[34, 417],
 [34, 417, 877],
 [34, 417, 877, 166],
 [34, 417, 877, 166, 213],
 [34, 417, 877, 166, 213, 517]]
```

In [None]:
# grader-required-cell

# Test your function with a bigger corpus
next_3_examples_sequence = n_gram_seqs(corpus[1:4], tokenizer)

print("n_gram sequences for next 3 examples look like this:\n")
next_3_examples_sequence

**Expected Output:**

```
n_gram sequences for next 3 examples look like this:

[[8, 878],
 [8, 878, 134],
 [8, 878, 134, 351],
 [8, 878, 134, 351, 102],
 [8, 878, 134, 351, 102, 156],
 [8, 878, 134, 351, 102, 156, 199],
 [16, 22],
 [16, 22, 2],
 [16, 22, 2, 879],
 [16, 22, 2, 879, 61],
 [16, 22, 2, 879, 61, 30],
 [16, 22, 2, 879, 61, 30, 48],
 [16, 22, 2, 879, 61, 30, 48, 634],
 [25, 311],
 [25, 311, 635],
 [25, 311, 635, 102],
 [25, 311, 635, 102, 200],
 [25, 311, 635, 102, 200, 25],
 [25, 311, 635, 102, 200, 25, 278]]
```

Apply the `n_gram_seqs` transformation to the whole corpus and save the maximum sequence length to use it later:

In [10]:
# grader-required-cell

# Apply the n_gram_seqs transformation to the whole corpus
input_sequences = n_gram_seqs(corpus, tokenizer)

# Save max length 
max_sequence_len = max([len(x) for x in input_sequences])

print(f"n_grams of input_sequences have length: {len(input_sequences)}")
print(f"maximum length of sequences is: {max_sequence_len}")

n_grams of input_sequences have length: 21293
maximum length of sequences is: 78


**Expected Output:**

```
n_grams of input_sequences have length: 15462
maximum length of sequences is: 11
```

## Add padding to the sequences

Now code the `pad_seqs` function which will pad any given sequences to the desired maximum length. Notice that this function receives a list of sequences and should return a numpy array with the padded sequences: 

In [11]:
# grader-required-cell

# GRADED FUNCTION: pad_seqs
def pad_seqs(input_sequences, maxlen):
    """
    Pads tokenized sequences to the same length
    
    Args:
        input_sequences (list of int): tokenized sequences to pad
        maxlen (int): maximum length of the token sequences
    
    Returns:
        padded_sequences (array of int): tokenized sequences padded to the same length
    """
    ### START CODE HERE
    padded_sequences = pad_sequences(input_sequences, maxlen=maxlen, padding= "pre")
    
    return padded_sequences
    ### END CODE HERE

In [12]:
# grader-required-cell

# Test your function with the n_grams_seq of the first example
first_padded_seq = pad_seqs(first_example_sequence, max([len(x) for x in first_example_sequence]))
first_padded_seq

NameError: name 'first_example_sequence' is not defined

**Expected Output:**

```
array([[  0,   0,   0,   0,  34, 417],
       [  0,   0,   0,  34, 417, 877],
       [  0,   0,  34, 417, 877, 166],
       [  0,  34, 417, 877, 166, 213],
       [ 34, 417, 877, 166, 213, 517]], dtype=int32)
```

In [None]:
# grader-required-cell

# Test your function with the n_grams_seq of the next 3 examples
next_3_padded_seq = pad_seqs(next_3_examples_sequence, max([len(s) for s in next_3_examples_sequence]))
next_3_padded_seq

**Expected Output:**

```
array([[  0,   0,   0,   0,   0,   0,   8, 878],
       [  0,   0,   0,   0,   0,   8, 878, 134],
       [  0,   0,   0,   0,   8, 878, 134, 351],
       [  0,   0,   0,   8, 878, 134, 351, 102],
       [  0,   0,   8, 878, 134, 351, 102, 156],
       [  0,   8, 878, 134, 351, 102, 156, 199],
       [  0,   0,   0,   0,   0,   0,  16,  22],
       [  0,   0,   0,   0,   0,  16,  22,   2],
       [  0,   0,   0,   0,  16,  22,   2, 879],
       [  0,   0,   0,  16,  22,   2, 879,  61],
       [  0,   0,  16,  22,   2, 879,  61,  30],
       [  0,  16,  22,   2, 879,  61,  30,  48],
       [ 16,  22,   2, 879,  61,  30,  48, 634],
       [  0,   0,   0,   0,   0,   0,  25, 311],
       [  0,   0,   0,   0,   0,  25, 311, 635],
       [  0,   0,   0,   0,  25, 311, 635, 102],
       [  0,   0,   0,  25, 311, 635, 102, 200],
       [  0,   0,  25, 311, 635, 102, 200,  25],
       [  0,  25, 311, 635, 102, 200,  25, 278]], dtype=int32)
```

In [13]:
# grader-required-cell

# Pad the whole corpus
input_sequences = pad_seqs(input_sequences, max_sequence_len)

print(f"padded corpus has shape: {input_sequences.shape}")

padded corpus has shape: (21293, 78)


**Expected Output:**

```
padded corpus has shape: (15462, 11)
```

## Split the data into features and labels

Before feeding the data into the neural network you should split it into features and labels. In this case the features will be the padded n_gram sequences with the last word removed from them and the labels will be the removed word.

Complete the `features_and_labels` function below. This function expects the padded n_gram sequences as input and should return a tuple containing the features and the one hot encoded labels.

Notice that the function also receives the total of words in the corpus, this parameter will be very important when one hot enconding the labels since every word in the corpus will be a label at least once. If you need a refresh of how the `to_categorical` function works take a look at the [docs](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)

In [14]:
# grader-required-cell

# GRADED FUNCTION: features_and_labels
def features_and_labels(input_sequences, total_words):
    """
    Generates features and labels from n-grams
    
    Args:
        input_sequences (list of int): sequences to split features and labels from
        total_words (int): vocabulary size
    
    Returns:
        features, one_hot_labels (array of int, array of int): arrays of features and one-hot encoded labels
    """
    ### START CODE HERE
    features = input_sequences[:,:-1]
    labels = input_sequences[:,-1]
    one_hot_labels = to_categorical(labels, total_words, dtype="float32")
    ### END CODE HERE

    return features, one_hot_labels

In [21]:
# grader-required-cell

# Test your function with the padded n_grams_seq of the first example
first_features, first_labels = features_and_labels(first_padded_seq, total_words)

print(f"labels have shape: {first_labels.shape}")
print("\nfeatures look like this:\n")
first_features

NameError: name 'first_padded_seq' is not defined

**Expected Output:**

```
labels have shape: (5, 3211)

features look like this:

array([[  0,   0,   0,   0,  34],
       [  0,   0,   0,  34, 417],
       [  0,   0,  34, 417, 877],
       [  0,  34, 417, 877, 166],
       [ 34, 417, 877, 166, 213]], dtype=int32)
```

In [15]:
# grader-required-cell

# Split the whole corpus
features, labels = features_and_labels(input_sequences, total_words)

print(f"features have shape: {features.shape}")
print(f"labels have shape: {labels.shape}")

features have shape: (21293, 77)
labels have shape: (21293, 4958)


**Expected Output:**

```
features have shape: (15462, 10)
labels have shape: (15462, 3211)
```

## Create the model

Now you should define a model architecture capable of achieving an accuracy of at least 80%.

Some hints to help you in this task:

- An appropriate `output_dim` for the first layer (Embedding) is 100, this is already provided for you.
- A Bidirectional LSTM is helpful for this particular problem.
- The last layer should have the same number of units as the total number of words in the corpus and a softmax activation function.
- This problem can be solved with only two layers (excluding the Embedding) so try out small architectures first.

In [16]:
!pip install wandb -qqq
import wandb

In [17]:
wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [24]:
#20 Experimentos
import random
import os
from tensorflow.keras.callbacks import Callback, ModelCheckpoint
from wandb.keras import WandbCallback
from tensorflow.keras.optimizers import Adam



def run_model(total_words, max_sequence_len, features, labels, num_test=1):
    # Inicializar las variables para almacenar los mejores resultados
    best_acc = 0
    best_model = None
    for run in range(num_test):
        # Start a run, tracking hyperparameters
        wandb.init(
        project="NLP_Martin_Fierro_La_Vuelta",
        # Set entity to specify your username or team name
        # ex: entity="carey",
        entity="marioherlein",
        config={
            "embedding_dims": int(random.uniform(200, 256)),
            "cells1": int(random.uniform(150, 300)),
            "optimizer": "adam",
            "loss": "categorical_crossentropy",
            "metric": "accuracy",
            "epoch": 20,
            "learning_rate": random.uniform(0.001, 0.01),
        })
        config = wandb.config


        #Creamos el modelo

        embedding_dims = 256 #256
        rnn_units = 1024

        model = Sequential()

        model.add(Embedding(total_words, config.embedding_dims, input_length=max_sequence_len-1))   #240

        model.add(Bidirectional(LSTM(config.cells1)))   #150

        model.add(Dense(total_words, activation='softmax'))

        #Creamos el optimizador
        adam = Adam(learning_rate=config.learning_rate)  #0.01

        #Compilamos el modelo
        model.compile(optimizer=adam,loss=config.loss,metrics=[config.metric])

        #model.compile(loss="categorical_crossentropy", optimizer=adam, metrics=['accuracy'])
        # Configurar los callbacks
        wandb_callback = WandbCallback()
        checkpoint_callback = ModelCheckpoint(f'model{run}.h5', monitor='accuracy', save_best_only=True, mode='max')



        history = model.fit(features,labels, epochs=config.epoch, verbose=1,callbacks=[wandb_callback, checkpoint_callback])
        wandb.finish()

        acc = history.history['accuracy'][-1]

        # Comprobar si se obtuvo la mejor precisión de validación hasta ahora
        if acc > best_acc:
            best_acc = acc
            best_model = model
            id_model = run

    # Guardar el mejor modelo
    best_model.save(f'{id_model}best_model_final.h5')
    return best_model
    

In [None]:
# Create wandb config


# Run model
# history = model.fit(features, labels, epochs=50, verbose=1)
model = run_model(total_words, max_sequence_len, features, labels,  20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


0,1
accuracy,▁▁▂▂▂▃▄▅▆▆▇▇▇███████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
loss,█▇▆▆▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁

0,1
accuracy,0.67294
epoch,19.0
loss,1.33225


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
accuracy,▁▁▁▂▃▄▆▇▇███████████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
loss,█▇▇▆▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁

0,1
accuracy,0.89128
epoch,19.0
loss,0.4394


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

In [None]:
# Take a look at the training curves of your model

from keras.models import load_model

# Cargar el mejor modelo guardado
best_model = load_model('best_model_final{id_model}.h5')

# Obtener el historial de entrenamiento del mejor modelo
history = best_model.history.history


acc = history['accuracy']
loss = history['loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.title('Training accuracy')

plt.figure()

plt.plot(epochs, loss, 'b', label='Training Loss')
plt.title('Training loss')
plt.legend()

plt.show()

## See your model in action

After all your work it is finally time to see your model generating text. 

Run the cell below to generate the next 100 words of a seed text.

After submitting your assignment you are encouraged to try out training for different amounts of epochs and seeing how this affects the coherency of the generated text. Also try changing the seed text to see what you get!

In [None]:
seed_text = "aquí me pongo a cantar"
next_words = 100
  
for _ in range(next_words):
    # Convert the text into sequences
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    # Pad the sequences
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    # Get the probabilities of predicting a word
    predicted = best_model.predict(token_list, verbose=0)
    # Choose the next word based on the maximum probability
    predicted = np.argmax(predicted, axis=-1).item()
    # Get the actual word from the word index
    output_word = tokenizer.index_word[predicted]
    # Append to the current text
    seed_text += " " + output_word

print(seed_text)