## Esercitazione: Machine Translation

Costruire un modello sequenziale per la traduzione (dall'inglese all'italiano). Il modello prende una sequenza in inglese e torna in output una sequenza in italiano:
1. Costruire le sequenze ed effettuare il padding per entrambe le lingue (NB: le sequenze vanno paddate alla maxlen di entrambe le lingue)
2. Dividi il dataset tra train e test con il 20% di test_size
2. Definire un modello che abbia uno strato di embedding e almeno due strati ricorrenti e in uscita uno strato Dense con il numero di neuroni pari al vocabolario per la traduzione (italiano)
3. NB: Per migliorare le performance sullo strato Dense conviene applicare un layer TimeDistributed in questo modo
        TimeDistributed(Dense())
3. Eseguire l'addestramento per almeno 100 epoche

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("machine_translation.csv")

In [2]:
df.head()

Unnamed: 0,italian,english
0,tom portò i suoi.,tom brought his.
1,a te non piace il pesce?,don't you like fish?
2,non abbiamo mai riso.,we never laughed.
3,aspetti un momento.,hang on a moment.
4,quando è finito?,when did that end?


In [4]:
english_sentences = df.english.values
italian_sentences = df.italian.values

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(english_sentences,italian_sentences, test_size = 0.2, random_state=1)

In [6]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

eng_tokenizer = Tokenizer()
ita_tokenizer = Tokenizer()

eng_tokenizer.fit_on_texts(X_train)
ita_tokenizer.fit_on_texts(Y_train)


#sequences
eng_sequences_train = eng_tokenizer.texts_to_sequences(X_train)
ita_sequences_train = ita_tokenizer.texts_to_sequences(Y_train)
eng_sequences_test = eng_tokenizer.texts_to_sequences(X_test)
ita_sequences_test = ita_tokenizer.texts_to_sequences(Y_test)



2022-11-19 08:42:15.260121: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
maxlen_ita = len(max(ita_sequences_train,key=len))
maxlen_eng = len(max(eng_sequences_train,key=len))


#train padding
padded_eng_sentences_train = pad_sequences(eng_sequences_train, padding = 'pre', maxlen = maxlen_eng)
padded_ita_sentences_train = pad_sequences(ita_sequences_train, padding = 'pre', maxlen = maxlen_ita)


#test padding
padded_eng_sentences_test = pad_sequences(eng_sequences_test, padding = 'pre', maxlen = maxlen_eng)
padded_ita_sentences_test = pad_sequences(ita_sequences_test, padding = 'pre', maxlen = maxlen_ita)


In [8]:
italian_vocab_size = len(ita_tokenizer.word_index)+1
english_vocab_size = len(eng_tokenizer.word_index)+1
print("Max Italian sentence length: {}".format(padded_ita_sentences_train.shape[1]))
print("Max English sentence length: {}".format(padded_eng_sentences_train.shape[1]))
print("Italian vocabulary size: {}".format(italian_vocab_size))
print("English vocabulary size: {}".format(english_vocab_size))


Max Italian sentence length: 10
Max English sentence length: 6
Italian vocabulary size: 10005
English vocabulary size: 4932


In [9]:
tmp_x = pad_sequences(padded_eng_sentences_train, maxlen_ita)


In [10]:
tmp_x.shape

(40000, 10)

In [11]:
from keras.backend import clear_session
from keras.models import Sequential
from keras.layers import Embedding, GRU, LSTM, TimeDistributed, Dense, Bidirectional

clear_session()
model = Sequential()
model.add(Embedding(english_vocab_size, 128, input_length=maxlen_ita))
model.add(Bidirectional(LSTM(64, return_sequences=True, activation="tanh")))
model.add(Bidirectional(LSTM(64, return_sequences=True, activation="tanh")))
model.add(TimeDistributed(Dense(italian_vocab_size, activation="softmax")))
model.summary()



2022-11-19 08:42:32.324370: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 128)           631296    
                                                                 
 bidirectional (Bidirectiona  (None, 10, 128)          98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 10, 128)          98816     
 nal)                                                            
                                                                 
 time_distributed (TimeDistr  (None, 10, 10005)        1290645   
 ibuted)                                                         
                                                                 
Total params: 2,119,573
Trainable params: 2,119,573
Non-trainable params: 0
______________________________________________

In [12]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])


In [13]:
model.fit(tmp_x, padded_ita_sentences_train, batch_size=512, epochs=180,validation_split=.20)

Epoch 1/180
Epoch 2/180
Epoch 3/180
Epoch 4/180
Epoch 5/180
Epoch 6/180
Epoch 7/180
Epoch 8/180
Epoch 9/180
Epoch 10/180
Epoch 11/180
Epoch 12/180
Epoch 13/180
Epoch 14/180
Epoch 15/180
Epoch 16/180
Epoch 17/180
Epoch 18/180
Epoch 19/180
Epoch 20/180
Epoch 21/180
Epoch 22/180
Epoch 23/180
Epoch 24/180
Epoch 25/180
Epoch 26/180
Epoch 27/180
Epoch 28/180
Epoch 29/180
Epoch 30/180
Epoch 31/180
Epoch 32/180
Epoch 33/180
Epoch 34/180
Epoch 35/180
Epoch 36/180
Epoch 37/180
Epoch 38/180
Epoch 39/180
Epoch 40/180
Epoch 41/180
Epoch 42/180
Epoch 43/180
Epoch 44/180
Epoch 45/180
Epoch 46/180
Epoch 47/180
Epoch 48/180
Epoch 49/180
Epoch 50/180
Epoch 51/180
Epoch 52/180
Epoch 53/180
Epoch 54/180
Epoch 55/180
Epoch 56/180
Epoch 57/180


Epoch 58/180
Epoch 59/180
Epoch 60/180
Epoch 61/180
Epoch 62/180
Epoch 63/180
Epoch 64/180
Epoch 65/180
Epoch 66/180
Epoch 67/180
Epoch 68/180
Epoch 69/180
Epoch 70/180
Epoch 71/180
Epoch 72/180
Epoch 73/180
Epoch 74/180
Epoch 75/180
Epoch 76/180
Epoch 77/180
Epoch 78/180
Epoch 79/180
Epoch 80/180
Epoch 81/180
Epoch 82/180
Epoch 83/180
Epoch 84/180
Epoch 85/180
Epoch 86/180
Epoch 87/180
Epoch 88/180
Epoch 89/180
Epoch 90/180
Epoch 91/180
Epoch 92/180
Epoch 93/180
Epoch 94/180
Epoch 95/180
Epoch 96/180
Epoch 97/180
Epoch 98/180
Epoch 99/180
Epoch 100/180
Epoch 101/180
Epoch 102/180
Epoch 103/180
Epoch 104/180
Epoch 105/180
Epoch 106/180
Epoch 107/180
Epoch 108/180
Epoch 109/180
Epoch 110/180
Epoch 111/180
Epoch 112/180
Epoch 113/180
Epoch 114/180
Epoch 115/180
Epoch 116/180
Epoch 117/180
Epoch 118/180
Epoch 119/180
Epoch 120/180
Epoch 121/180
Epoch 122/180
Epoch 123/180
Epoch 124/180
Epoch 125/180
Epoch 126/180
Epoch 127/180
Epoch 128/180
Epoch 129/180
Epoch 130/180
Epoch 131/180
Epoch 

Epoch 171/180
Epoch 172/180
Epoch 173/180
Epoch 174/180
Epoch 175/180
Epoch 176/180
Epoch 177/180
Epoch 178/180
Epoch 179/180
Epoch 180/180


<keras.callbacks.History at 0x1353a2990>

In [14]:
test_x = pad_sequences(padded_eng_sentences_test, maxlen_ita)
model.evaluate(test_x,padded_ita_sentences_test)



[0.9975460171699524, 0.8099499940872192]

In [15]:
emb_preds = model.predict(test_x)
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

for i in range(0,20):
    print(X_test[i])
    print(logits_to_text(emb_preds[i],ita_tokenizer))
    print("----------------------------------------------------------\n")

tom is awake.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom è sveglio
----------------------------------------------------------

i just did it.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> io l'ho appena fatto
----------------------------------------------------------

tom is brilliant.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom è brillante
----------------------------------------------------------

allow me to help.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> deve aiutare
----------------------------------------------------------

we heard tom.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> sentito tom
----------------------------------------------------------

don't waste my time.
<PAD> <PAD> <PAD> <PAD> <PAD> si del i mio tempo
----------------------------------------------------------

you were sick.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> eri malata
----------------------------------------------------------

tom plays piano.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom a 