# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letter

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is to give a input sequence and to predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [1]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=10)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence  - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start by the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [2]:
import numpy as np

In [3]:
def part_sent(sentence):
    if len(sentence)<302:
        return None
    i = len(sentence)-302
    begin_index = np.random.randint(0,i)
    return (sentence[begin_index:begin_index+300], sentence[begin_index+300])

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [4]:
part_sent(X[0])

('re was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and yo',
 'u')

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [5]:
def data_gen(nb):
    dataX = []
    datay = []
    count = nb
    while count != 0:
        index = np.random.randint(0,len(X))
        if part_sent(X[index])!=None :
            xi, yi = part_sent(X[index])
            dataX.append(xi)
            datay.append(yi)
            count = count - 1
    return dataX, datay

In [6]:
data_gen(4)

(["racker bag last night before a preview screening of disney's holes i don't know who decided to show it but i'm so very glad they did cracker bag is an absolute gem a snapshot of australia in the early 80s as seen through a child's eye the conversations between eddie and her brother were hilarious an",
  "at much of the world we are seeing is about to be swept away in the cataclysm of world war 2 and the communist revolution br br which makes the central character's desire to adhere to old customs and traditions all the more poignant br br but the film also raises issues which are of vital importance",
  "why damn fox canceled the season3 although season2 was not as good as season1 which is excellent indeed i like it so much that i even thinking about buying dvd on amazon failed i am a chinese student and it's inconvenient for me to get a international credit card and i just hope fox can bring back d",
  'y happened viewers will recognize his co workers the actors clarence kolb donal

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
string, letter = data_gen(7000)

In [10]:
len(letter)

7000

In [13]:
string_train, string_test, y_train, y_test = train_test_split(string, letter, test_size=0.2)

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [14]:
len(string_train)

5600

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer

big_string = "".join(string_train) + "".join(y_train)

tokenizer = Tokenizer(char_level=True, oov_token='UNKNOWN')
tokenizer.fit_on_texts("".join(string_train))

In [18]:
dico_size = len(tokenizer.word_index)

In [17]:
len(tokenizer.word_index), tokenizer.word_index

(61,
 {'UNKNOWN': 1,
  ' ': 2,
  'e': 3,
  't': 4,
  'a': 5,
  'i': 6,
  'o': 7,
  's': 8,
  'n': 9,
  'r': 10,
  'h': 11,
  'l': 12,
  'd': 13,
  'c': 14,
  'm': 15,
  'u': 16,
  'f': 17,
  'y': 18,
  'g': 19,
  'w': 20,
  'b': 21,
  'p': 22,
  'v': 23,
  'k': 24,
  "'": 25,
  'j': 26,
  'x': 27,
  'z': 28,
  'q': 29,
  '0': 30,
  '1': 31,
  '9': 32,
  '2': 33,
  '3': 34,
  '5': 35,
  '4': 36,
  '7': 37,
  '8': 38,
  '6': 39,
  'é': 40,
  '\x96': 41,
  '\x85': 42,
  '´': 43,
  'ä': 44,
  '\x97': 45,
  'ç': 46,
  'ï': 47,
  'ã': 48,
  'è': 49,
  '“': 50,
  '”': 51,
  'å': 52,
  'ö': 53,
  'à': 54,
  'ü': 55,
  '–': 56,
  '’': 57,
  'ó': 58,
  'ù': 59,
  'í': 60,
  '\xa0': 61})

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and stores them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [26]:
tokenizer.word_index

{'UNKNOWN': 1,
 ' ': 2,
 'e': 3,
 't': 4,
 'a': 5,
 'i': 6,
 'o': 7,
 's': 8,
 'n': 9,
 'r': 10,
 'h': 11,
 'l': 12,
 'd': 13,
 'c': 14,
 'm': 15,
 'u': 16,
 'f': 17,
 'y': 18,
 'g': 19,
 'w': 20,
 'b': 21,
 'p': 22,
 'v': 23,
 'k': 24,
 "'": 25,
 'j': 26,
 'x': 27,
 'z': 28,
 'q': 29,
 '0': 30,
 '1': 31,
 '9': 32,
 '2': 33,
 '3': 34,
 '5': 35,
 '4': 36,
 '7': 37,
 '8': 38,
 '6': 39,
 'é': 40,
 '\x96': 41,
 '\x85': 42,
 '´': 43,
 'ä': 44,
 '\x97': 45,
 'ç': 46,
 'ï': 47,
 'ã': 48,
 'è': 49,
 '“': 50,
 '”': 51,
 'å': 52,
 'ö': 53,
 'à': 54,
 'ü': 55,
 '–': 56,
 '’': 57,
 'ó': 58,
 'ù': 59,
 'í': 60,
 '\xa0': 61}

In [19]:
X_train = np.array(tokenizer.texts_to_sequences(string_train))

In [22]:
np.unique(X_train)

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 54, 55, 56, 57, 58, 59, 60, 61])

In [23]:
X_test = np.array(tokenizer.texts_to_sequences(string_test))

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [24]:
y_train_tok = np.array(tokenizer.texts_to_sequences(y_train))

In [25]:
np.unique(y_train_tok)

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39])

In [27]:
y_test_tok = tokenizer.texts_to_sequences(y_test)

In [28]:
np.unique(y_test_tok)

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31])

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [None]:
X_train[:100]

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [None]:
baseline = sorted(list(tokenizer.word_counts.values()))[-1]/sum(list(tokenizer.word_counts.values()))

# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [30]:
from tensorflow.keras.utils import to_categorical

In [31]:
y_train_cat = to_categorical(y_train_tok)

In [32]:
y_train_cat

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [33]:
X_train.shape, y_train_cat.shape

((5600, 300), (5600, 40))

In [40]:
from tensorflow.keras import layers, models

model_rnn = models.Sequential()

model_rnn.add(layers.Embedding(input_dim = dico_size+1, input_length=300, output_dim=20))

model_rnn.add(layers.LSTM(20))

model_rnn.add(layers.Dense(50, activation = 'relu'))
model_rnn.add(layers.Dense(25, activation = 'relu'))
model_rnn.add(layers.Dense(40, activation = 'softmax'))

model_rnn.compile(loss = "categorical_crossentropy",
             optimizer = 'rmsprop',
             metrics = "accuracy")

In [41]:
model_rnn.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 300, 20)           1240      
_________________________________________________________________
lstm_2 (LSTM)                (None, 20)                3280      
_________________________________________________________________
dense_5 (Dense)              (None, 50)                1050      
_________________________________________________________________
dense_6 (Dense)              (None, 25)                1275      
_________________________________________________________________
dense_7 (Dense)              (None, 40)                1040      
Total params: 7,885
Trainable params: 7,885
Non-trainable params: 0
_________________________________________________________________


In [42]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, restore_best_weights=True)

model_rnn.fit(X_train, y_train_cat, 
          epochs=100, 
          batch_size=64,
          validation_split=0.3,
          callbacks=[es],
          workers=-1
         )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100


Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100


<tensorflow.python.keras.callbacks.History at 0x150f771c0>

❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point. If the loss gets decreasing, you will get a better accuracy then. 

You should get an accuracy better than 35% 

In [None]:
np.array(y_train).reshape(1400,)

In [None]:
X_train[:100]

❓ **Question** ❓ Evaluate your model on the test set

In [45]:
y_test_cat = to_categorical(y_test_tok, 40)

In [46]:
model_rnn.evaluate(X_test, y_test_cat)



[2.279909610748291, 0.3171428442001343]

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decoded the predicted token to know which letter it corresponds to.

You will have to convert your string to a list of tokens, and then, get the most probable class and convert it back to a letter.

You should do it in a function.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Now, write a function that takes as input a string, predict the next letter, append the letter to the initial string, then redo a prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should take as input the number of time you repeat the operation

You can have some fun trying different input sequences here.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.