# Text generation

In diesem, auf https://github.com/m2dsupsdlclass/lectures-labs/blob/master/labs/06_seq2seq/french_numbers.py basierenden, Notebook bauen wir ein character-basiertes Modell zur Texterzeugung

In [9]:
import numpy as np

from french_numbers import generate_translations, to_french_phrase

from tensorflow.contrib.keras.python.keras.models import Sequential
from tensorflow.contrib.keras.python.keras.layers import Embedding, Dropout, LSTM, GRU, Dense
from tensorflow.contrib.keras.python.keras.optimizers import Adam
from tensorflow.contrib.keras.python.keras.preprocessing import sequence
from tensorflow.contrib.keras.python.keras.preprocessing.text import Tokenizer


from sklearn.model_selection import train_test_split

from IPython.display import Image
from IPython.core.display import HTML 

#seed for randomness
SEED =42

#maximum sentence length
MAXLEN = 20

#learning rate 
LR = 1e-3

#maximum number of words
MAX_NB_WORDS = 50

#BATCH SIZE
BATCH_SIZE = 32

Wir betrachten einige Beispielzahlen... 

In [4]:
for x in [21, 80, 81, 300, 213, 1100, 1201, 301000, 80080]:
    print("{} {}".format(x,to_french_phrase(x)))

21 vingt et un
80 quatre vingts
81 quatre vingt un
300 trois cents
213 deux cent treize
1100 mille cent
1201 mille deux cent un
301000 trois cent un mille
80080 quatre vingt mille quatre vingts


...und erzeugen dann trainings und testset

In [5]:
numbers, french_numbers = generate_translations(
    low=1, high=int(1e6) - 1, exhaustive=5000, random_seed=0)
num_trn, num_dev, fr_trn, fr_dev = train_test_split()

num_val, num_tst, fr_val, fr_tst = train_test_split()

## Seq2Seq mit LSTM

Wir definieren nun ein sequence-to-sequence Modell mit Hilfe eines LSTMs. 

In [6]:
Image(url= "basic_seq2seq.png")

Zunaechst wandeln wir eingabe und ausgabe in listen um

In [7]:
def make_input_output(source_tokens, target_tokens):
    input_tokens = source_tokens + ['_GO'] + target_tokens
    output_tokens = target_tokens + ['_EOS']
    return input_tokens, output_tokens

Nun fuegen wir Input und Output zusammen und fitten einen Tokenizer.

In [8]:

tokenizer = Tokenizer()
[pairs_trn, pairs_val, pairs_tst] = [None, None, None]
tokenizer.fit_on_texts(pairs_trn)

In [11]:
pairs_trn[0]

'deux mille huit cent quatre vingt deux _GO 2 8 8 2'

Wir definieren ausserdem ein dictionary um token-ids in woerter zurueckzuverwandeln.

In [10]:
tokenizer.word_index
idx2word = dict({})
idx2word[0] = ''
idx2word[40] = ''

Die Trainingsdaten werden mithilfe dieses Tokenizers transformiert

In [13]:
[X_trn, X_val, X_tst] = []
[Y_trn, Y_val] = []

In [14]:
X_trn[0]

[17, 3, 19, 1, 4, 14, 17, 2, 6, 9, 9, 6]

In [16]:
Y_trn[0]

[6, 9, 9, 6, 40]

Anschliessend forcieren wir alle Daten auf die gleiche Laenge

In [17]:
[X_trn, X_val, X_tst, Y_trn, Y_val] = [sequence.pad_sequences() ]

In [18]:
X_trn[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0, 17,  3, 19,  1,  4, 14, 17,  2,  6,
        9,  9,  6], dtype=int32)

Zur Texterzeugung verwenden wir ein LSTM das auf Embeddings basiert

In [21]:
simple_seq2seq = Sequential([
    Embedding(None, None, input_length = None, mask_zero = True),
    LSTM(),
    #Dropout(0.2),
    Dense()
]
)

simple_seq2seq.compile(loss = 'sparse_categorical_crossentropy')

Dieses Modell trainieren wir nun auf den tokenisierten Daten

In [30]:
simple_seq2seq.optimizer.lr = 1e-4

In [22]:
simple_seq2seq.fit(X_trn, np.expand_dims(Y_trn, -1),
                             validation_data = (X_val, np.expand_dims(Y_val, -1)),
                             epochs = 5, batch_size = BATCH_SIZE)

Train on 10000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.contrib.keras.python.keras.callbacks.History at 0x7f502daf4f28>

Wir speichern das gefittete Modell

In [60]:
simple_seq2seq.save_weights('../models/french_num_weights.h5')

In [24]:
simple_seq2seq.load_weights('../../refereeReports/talks/deepLearning/models/french_num_weights.h5')

Jetzt pruefen wir das Modell anhand einiger Beispiele. 

In [25]:
print("prediction token ids:", None)
print("predicted number:", None)
print("test number:", None)

prediction token ids: [ 0  0  0  0  0  0  0  0  0  0  0  0  0  5  6 13 12 11 15 40]
predicted number: 329750
test number: 329750


Nun fuehren wir eine Greedy-Uebersetzung ein

In [32]:
def greedy_translate(model, source_sequence):
    """Greedy decoder recursively predicting one token at a time"""
    input_ids = tokenizer.texts_to_sequences([source_sequence])[0] + [2]

    # Prepare a fixed size numpy array that matches the expected input
    # shape for the model
    input_array = np.zeros(shape=(1, model.input_shape[1]),
                           dtype=np.int32)
    decoded_tokens = []
    while len(input_ids) <= MAXLEN:
        #update input_array
        
        # Predict the next output: greedy decoding with argmax
        next_token_id = None
        
        # Stop decoding if the network predicts end of sentence:
        if next_token_id in [0, 40]:
            break
            
        # Otherwise append output to decoded tokens and input_ids

        
    return ''.join([None for token_id in decoded_tokens])



Wir testen die Greedy-Uebersetzung an Beispielsaetzen.

In [33]:
phrases = [
    "un",
    "deux",
    "trois",
    "onze",
    "quinze",
    "cent trente deux",
    "cent mille douze",
    "sept mille huit cent cinquante neuf",
    "vingt et un",
    "vingt quatre",
    "quatre vingts",
    "quatre vingt onze mille",
    "quatre vingt onze mille deux cent deux",
]
for phrase in phrases:
    translation = greedy_translate(simple_seq2seq, phrase)
    print('{}: {}'.format(phrase, ''.join(translation)))

un: 0
deux: 20
trois: 3
onze: 100
quinze: 15
cent trente deux: 132
cent mille douze: 10102
sept mille huit cent cinquante neuf: 7859
vingt et un: 21
vingt quatre: 24
quatre vingts: 80
quatre vingt onze mille: 911
quatre vingt onze mille deux cent deux: 91202


## Modellauswertung

Nun werten wir die Qualitaet des Modells aus.

In [47]:
def phrase_accuracy(model, num_sequences, fr_sequences, n_samples=300,
                    decoder_func=greedy_translate):
    correct = 0
    return np.mean([num_seq == decoder_func(simple_seq2seq, fr_seq) 
                    for _, num_seq, fr_seq in zip(range(n_samples), num_sequences, fr_sequences)])

In [49]:
phrase_accuracy(simple_seq2seq, num_tst, fr_tst, n_samples = 300)

0.94666666666666666

Schliesslich implementieren wir noch beam search

In [50]:
# %load solutions/beam_search.py
def beam_translate(model, source_sequence,
                   word_level_source=True, word_level_target=True,
                   beam_size=10, return_ll=False):
    """Decode candidate translations with a beam search strategy
    
    If return_ll is False, only the best candidate string is returned.
    If return_ll is True, all the candidate strings and their loglikelihoods
    are returned.
    """

    # Prepare a fixed size numpy array that matches the expected input
    # shape for the model
    input_array = np.empty(shape=(beam_size, model.input_shape[1]),
                           dtype=np.int32)
    input_ids = tokenizer.texts_to_sequences([source_sequence])[0] + [2]
    
    
    # initialize loglikelihood, input token ids, decoded tokens for
    # each candidate in the beam
    candidates = [(0, input_ids[:], [], False)]
    
    while any([not done and (len(input_ids) < MAXLEN)
               for _, input_ids, _, done in candidates]):
        # Vectorize a the list of input tokens and use zeros padding.
        input_array.fill(0)
        for i, (_, input_ids, _, done) in enumerate(candidates):
            if not done:
                input_array[i, -len(input_ids):] = input_ids
        
        # Predict the next output in a single call to the model to amortize
        # the overhead and benefit from vector data parallelism on GPU.
        next_likelihood_batch = model.predict(input_array)
        
        # Build the new candidates list by summing the loglikelood of the
        # next token with their parents for each new possible expansion.
        new_candidates = []
        for i, (ll, input_ids, decoded, done) in enumerate(candidates):
            if done:
                new_candidates.append((ll, input_ids, decoded, done))
            else:
                next_loglikelihoods = np.log(next_likelihood_batch[i, -1])
                for next_token_id, next_ll in enumerate(next_loglikelihoods):
                    new_ll = ll + next_ll
                    new_input_ids = input_ids[:]
                    new_input_ids.append(next_token_id)
                    new_decoded = decoded[:]
                    new_done = done
                    if next_token_id  in [0, 40]:
                        new_done = True
                    if not new_done:
                        new_decoded.append(next_token_id)
                    new_candidates.append(
                        (new_ll, new_input_ids, new_decoded, new_done))
        
        # Only keep a beam of the most promising candidates
        new_candidates.sort(reverse=True)
        candidates = new_candidates[:beam_size]

    separator = " " if word_level_target else ""
    if return_ll:
        return [(separator.join(decoded), ll) for ll, _, decoded, _ in candidates]
    else:
        _, _, decoded, done = candidates[0]
        return ''.join([idx2word[token_id] for token_id in decoded])

In [51]:
phrases = [
    "un",
    "deux",
    "trois",
    "onze",
    "quinze",
    "cent trente deux",
    "cent mille douze",
    "sept mille huit cent cinquante neuf",
    "vingt et un",
    "vingt quatre",
    "quatre vingts",
    "quatre vingt onze mille",
    "quatre vingt onze mille deux cent deux",
]
for phrase in phrases:
    translation = beam_translate(simple_seq2seq, phrase)
    print('{}: {}'.format(phrase, translation))

un: 
deux: 20
trois: 3
onze: 11
quinze: 15
cent trente deux: 132
cent mille douze: 10102
sept mille huit cent cinquante neuf: 7859
vingt et un: 21
vingt quatre: 24
quatre vingts: 80
quatre vingt onze mille: 911
quatre vingt onze mille deux cent deux: 91202
