# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letters

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is, given a input sequence predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [1]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X_data = load_data(percentage_of_sentences=10)

❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start at the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [2]:
import numpy as np

def split_string(str):
    if len(str) > 300:
        start_point = np.random.randint(300,len(str))
        return (str[start_point-300:start_point], str[start_point])
    else:
        return None

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [3]:
print(split_string(X_data[0]))
print(split_string(X_data[1]))
print(split_string(X_data[127]))

(" location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection wit", 'h')
('hair is big lots of boobs bounce men wear those cut tee shirts that show off their stomachs sickening that men actually wore them and the music is just synthesiser trash that plays over and over again in almost every scene there is trashy music boobs and paramedics taking away bodies and the gym sti', 'l')
("son michalka was bad all the way through chris gallinger who played the love interest of amanda was playing a french guy but had an awful accent one good thing about this movie was the completely adorable michael trevino who played alyson's love interest just something to keep in mind if this movie ", 'h')


❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [4]:
def x_and_y_generator(X_input):
    
    X = []
    y = []
    
    for X_sample in X_input:
        if (result:= split_string(X_sample)) is not None:
            X.append(result[0])
            y.append(result[1])
            
    return X, y

In [5]:
X, y = x_and_y_generator(X_data)

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [28]:
from sklearn.model_selection import train_test_split

string_train, string_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [43]:
#string_train_cleaned = [text.replace(" ", "") for text in string_train]
#y_train_cleaned = [text.replace(" ", "") for text in y_train]

unique_char = ['unknown']
for text in string_train:
    for c in text:
        if not c in unique_char:
            unique_char.append(c)

for text in y_train:
    for c in text:
        if not c in unique_char:
            unique_char.append(c)
            
print(unique_char)

['unknown', ' ', 'a', 'n', 'y', 't', 'h', 'i', 'g', 'o', 'd', 'w', 'e', 'r', 'l', 'm', 'k', 'b', 'u', 's', 'p', 'c', 'f', 'j', "'", 'v', '2', 'q', 'z', '1', '0', 'x', '9', '4', 'é', '3', '7', '8', '5', '6', '\x95', '\xa0', '\x96', '\x85', 'è', '\x97', 'ü', '´', '–', 'ó']


In [44]:
keys = unique_char
values = list(range(0,len(keys)+1))

token_dict = {}

for (k, v) in zip(keys, values):
    token_dict[k] = v

token_dict

{'unknown': 0,
 ' ': 1,
 'a': 2,
 'n': 3,
 'y': 4,
 't': 5,
 'h': 6,
 'i': 7,
 'g': 8,
 'o': 9,
 'd': 10,
 'w': 11,
 'e': 12,
 'r': 13,
 'l': 14,
 'm': 15,
 'k': 16,
 'b': 17,
 'u': 18,
 's': 19,
 'p': 20,
 'c': 21,
 'f': 22,
 'j': 23,
 "'": 24,
 'v': 25,
 '2': 26,
 'q': 27,
 'z': 28,
 '1': 29,
 '0': 30,
 'x': 31,
 '9': 32,
 '4': 33,
 'é': 34,
 '3': 35,
 '7': 36,
 '8': 37,
 '5': 38,
 '6': 39,
 '\x95': 40,
 '\xa0': 41,
 '\x96': 42,
 '\x85': 43,
 'è': 44,
 '\x97': 45,
 'ü': 46,
 '´': 47,
 '–': 48,
 'ó': 49}

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and store them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [53]:
X_train = [[token_dict[_] for _ in x] for x in string_train]
X_test = [[token_dict[_] if _ in token_dict else token_dict['unknown'] for _ in x ] for x in string_test]

X_train = np.array(X_train)
X_test = np.array(X_test)

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [47]:
y_train_token = [token_dict[x] for x in y_train]
y_test_token = [token_dict[x] if x in token_dict else token_dict['unknown'] for x in y_test]

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [48]:
from tensorflow.keras.utils import to_categorical

y_train_cat = to_categorical(y_train_token, num_classes=len(token_dict))
y_test_cat = to_categorical(y_test_token, num_classes=len(token_dict))

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [49]:
from sklearn.metrics import accuracy_score

unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))
print('Number of labels in train set', counts)

w = -1
y_pred = ''
for k, v in counts.items():
    if v > w:
        y_pred = k
        w = v

print('Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))

Number of labels in train set {' ': 315, "'": 16, '0': 2, '1': 2, '2': 1, '4': 3, '7': 1, 'a': 109, 'b': 30, 'c': 30, 'd': 43, 'e': 173, 'f': 24, 'g': 30, 'h': 65, 'i': 91, 'j': 1, 'k': 11, 'l': 61, 'm': 39, 'n': 79, 'o': 121, 'p': 26, 'q': 1, 'r': 77, 's': 99, 't': 128, 'u': 30, 'v': 21, 'w': 25, 'x': 3, 'y': 27, 'z': 2}
Baseline accuracy:  0.20193637621023514


# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [51]:
from tensorflow.keras import Sequential, layers

def init_model(vocab_size):
    model = Sequential()
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=30))
    model.add(layers.GRU(30, activation='tanh'))
    model.add(layers.Dense(30, activation='relu'))
    model.add(layers.Dense(vocab_size, activation='softmax'))
    
    
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model(len(token_dict))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 30)          1500      
                                                                 
 gru (GRU)                   (None, 30)                5580      
                                                                 
 dense (Dense)               (None, 30)                930       
                                                                 
 dense_1 (Dense)             (None, 50)                1550      
                                                                 
Total params: 9,560
Trainable params: 9,560
Non-trainable params: 0
_________________________________________________________________


❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point, and hopefully keep decreasing from here. 

You should get an accuracy better than 35% 

In [52]:
from tensorflow.keras.callbacks import EarlyStopping

model.fit(X_train, y_train_cat,
          epochs=400, 
          batch_size=50,
          callbacks=EarlyStopping(patience=5, monitor='val_loss'),
          validation_split=0.3)

Epoch 1/400


2022-11-20 15:04:53.135792: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400


<keras.callbacks.History at 0x293871270>

❓ **Question** ❓ Evaluate your model on the test set

In [54]:
model.evaluate(X_test, y_test_cat)



[2.555826425552368, 0.28215768933296204]

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decode the predicted token to know which letter it corresponds to.

You will have to convert your input string to a list of tokens, get the most probable output class, and then convert it back to a letter.

You should do it in a function.

In [55]:
token_to_letter = {v: k for k, v in token_dict.items()}

def get_predicted_letter(string):
    string_convert = [token_dict[_] for _ in string]

    pred = model.predict([string_convert])
    pred_class = np.argmax(pred[0])
    pred_letter = token_to_letter[pred_class]
    
    return pred_letter

string = 'this is a good'

get_predicted_letter(string)



' '

❓ **Question** ❓ Now, write a function that takes a string as an input, predicts the next letter, appends the letter to the initial string, then redoes the prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should also take the number of times you repeat the operation as an input.

You can have some fun trying different input sequences here.

In [57]:
def repeat_prediction(string, repetition):
    string_tmp = string
    for i in range(repetition):
        predicted_letter = get_predicted_letter(string_tmp)
        string_tmp = string_tmp + predicted_letter

    return string_tmp

strings = ['what i like is ',
          ]

[repeat_prediction(string, 20) for string in strings]



['what i like is the the the the the ']

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.