# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letters

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is, given a input sequence predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [2]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=10)

❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start at the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [117]:
def get_strings(string):
    size = 300
    idx = np.random.randint(0,(len(X)))
    
    stri = string[idx :(idx+size)] 
    letter = string[(idx+size):(idx+size)+1]
    
    if len(stri) < 300:
        return None
    
    return stri, letter
        

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [218]:
strii = get_strings(X[0])
print(strii)


('ilm the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a', ' ')


In [147]:
strii[1]

'e'

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [239]:
def get_X(df):
    return [get_strings(df[sample]) for sample in range(len(df))]


❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [240]:
X_bis = get_X(X)


In [249]:
X_bis = list(filter(None,X_bis))
a, b = X_bis[1]
b

'h'

In [251]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X_bis,test_size=0.3) 

In [254]:
len(X_test), len(X_train)

(274, 639)

In [281]:
X_test[1][1]

'i'

In [285]:
string_test = [X_test[i][0] for i in range(0,274)]
y_test = [X_test[i][1] for i in range(0,274)]
string_train = [X_train[i][0] for i in range(0,274)]
y_train = [X_train[i][1] for i in range(0,274)]

#string_train, y_train = X_train

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [None]:
key = {x for l in string_train for x in l}

In [350]:
def make_dict(X,y):
    dico = {}
    key_X = {x for l in X for x in l}
    key_Y = {x for l in y for x in l}
    key = key_X.union(key_Y)
    i = 0.01
    
    for k in key:
        dico[k] = round(i,2)
        i = i+0.01
    dico['UNKNOW'] = round(i,2)
            
    return dico

token = make_dict(string_train, y_train)
token['x']

0.01

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and store them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [356]:
def tokenize_string(X):
    for sample in X:
        for word in sample:
            for l in word:
                word.replace(l,str(token[l]))
    return X

X_try = tokenize_string(X_train)

KeyError: 'ã'

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [None]:
# YOUR CODE HERE

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [None]:
# YOUR CODE HERE

# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point, and hopefully keep decreasing from here. 

You should get an accuracy better than 35% 

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Evaluate your model on the test set

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decode the predicted token to know which letter it corresponds to.

You will have to convert your input string to a list of tokens, get the most probable output class, and then convert it back to a letter.

You should do it in a function.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Now, write a function that takes a string as an input, predicts the next letter, appends the letter to the initial string, then redoes the prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should also take the number of times you repeat the operation as an input.

You can have some fun trying different input sequences here.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.