### This homework is adapted from a Notebook by Shaheel Khan (https://github.com/shaheelkhan/Word-Level-NMT/blob/main/eng_spa.ipynb). All credit goes to him!

Please do not worry if not all of this makes sense, just do your best and understand as much as you can -- wherever you start, you will learn something and that's great!

### Question 1 

Please read the Machine Translation chapter of our in-progress textbook and use bullet points to indicate 3 things that you learned and/or constructive suggestions.

### Your answer to Question 1

goes here

In [None]:
#import necessary libraries
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import string
from string import digits

### Question 2

Please Google around and spend a few minutes reading about encoder/decoder models for machine translation, and indicate 3 things that you learned (it is okay if you are still confused).

### Your answer to Question 2

goes here

Download data here: http://www.manythings.org/anki/spa-eng.zip, unzip it, find the file "spa.txt" and put it in the same location as your Notebook.

In [14]:
#Data Path
data_path = "spa.txt"

#Number of samples to train
num_samples = 20000

lines = pd.read_table(data_path, names=['eng','spa',''])

In [15]:
lines.head()


Unnamed: 0,eng,spa,Unnamed: 3
0,Go.,Ve.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Vete.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,Vaya.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Go.,Váyase.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
4,Hi.,Hola.,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


In [16]:
lines.shape


(134736, 3)

In [17]:
def text_preprocess(text):

  #Remove qoutes
  new_text = re.sub("'",'',text) #this will replace words like can't --> cant

  #Lower the cases
  new_text = new_text.lower()
  
  #Remove special characters
  nopunc = set(string.punctuation)
  new_text = [char for char in new_text if char not in nopunc]
  new_text = ''.join(new_text)

  #Remove numbers from the text
  removed_numbers = str.maketrans('','',digits)
  new_text = new_text.translate(removed_numbers)

  #Remove extra space
  new_text = new_text.strip()
  new_text = re.sub(' +',' ',new_text)

  return new_text

In [18]:
lines['eng'] = lines['eng'].apply(text_preprocess)
lines['spa'] = lines['spa'].apply(text_preprocess)

In [19]:
#To indicate start of seq in target we'll use "START_ " and for end of the seq we'll use " _END"
lines['spa'] = lines['spa'].apply(lambda x: 'START_ '+ x +' _END')

In [20]:
lines.head()

Unnamed: 0,eng,spa,Unnamed: 3
0,go,START_ ve _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,go,START_ vete _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,go,START_ vaya _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,go,START_ váyase _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
4,hi,START_ hola _END,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


We'll create two sets one for English(input) and another for Spanish(target). To this we'll add all the unique words from 20000 sentences.



In [21]:
line_subset = lines[:num_samples+1]


In [22]:
#Get the all unique english words
eng_words = set()
for word in line_subset['eng']:
  for item in word.split():
    if item not in eng_words:
      eng_words.add(item)

#Get the all unique spanish words
spa_words = set()
for word in line_subset['spa']:
  for item in word.split():
    if item not in spa_words:
      spa_words.add(item)

In [23]:
#Get length of the sentences which has maxm number of words for both the language
max_eng_length = max([len(txt) for txt in line_subset['eng'].apply(lambda x: x.split(' '))])
max_spa_length = max([len(txt) for txt in line_subset['spa'].apply(lambda x: x.split(' '))])
print(max_eng_length)
print(max_spa_length)

6
17


In [24]:
#Sort the input and target words and convert it to a list
input_words = sorted(list(eng_words))
target_words = sorted(list(spa_words))
num_encoder_tokens = len(eng_words)
num_decoder_tokens = len(spa_words)
print("Number of unique words in input(English):- ",num_encoder_tokens)
print("Number of unique words in target(Spanish):- ",num_decoder_tokens)

Number of unique words in input(English):-  3731
Number of unique words in target(Spanish):-  7839


### Question 3

Why do you think there might be more unique words in the Spanish version of our corpus compared to the English one?

### Your answer to Question 3

goes here

In [25]:
#For zero padding increase the count of decoder-token by 1
num_decoder_tokens += 1
num_decoder_tokens

7840

Vectorize the input and target words



In [26]:
#Create a dictionary for input and target words and assign an index position to each words for one hot encoding
input_token_index = dict([(word,i) for i,word in enumerate(input_words)])
target_token_index = dict([(word,i) for i,word in enumerate(target_words)])

In [27]:
#Create 3 Numpy arrays:- encoder_input_data, decoder_input_data, decoder_target_data:
encoder_input_data = np.zeros((len(line_subset['eng']),max_eng_length),dtype='float32')
decoder_input_data = np.zeros((len(line_subset['spa']),max_spa_length),dtype='float32')
decoder_target_data = np.zeros((len(line_subset['spa']),max_spa_length,num_decoder_tokens),dtype='float32')

In [49]:
encoder_input_data

array([[1.377e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.377e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.377e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.377e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.541e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [2.735e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [2.735e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [2.735e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [2.735e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [2.735e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [3.617e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [3.689e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [9.870e+02, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.212e+03, 0.000e

In [50]:
decoder_input_data

array([[0.000e+00, 6.983e+03, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [0.000e+00, 7.072e+03, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [0.000e+00, 6.977e+03, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [0.000e+00, 7.221e+03, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
       [0.000e+00, 3.592e+03, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
    

### Question 4

What are these dataframes?  What is in them?  Why are they sized/shaped the way they are?

### Your answer to Question 4

goes here

In [28]:
for i, (input_text,target_text) in enumerate(zip(line_subset['eng'],line_subset['spa'])):
  
  for t, word in enumerate(input_text.split()):
    encoder_input_data[i,t] = input_token_index[word]
  
  for t, word in enumerate(target_text.split()):
    decoder_input_data[i,t] = target_token_index[word]

    if t > 0:
      decoder_target_data[i, t-1, target_token_index[word]] = 1.

In [29]:
from keras.layers import Input,LSTM,Embedding,Dense
from keras.models import Model
embedding_size = 80


### Encoder model

In [30]:

encoder_inputs = Input(shape=(None,))
encoder_embed = Embedding(num_encoder_tokens,embedding_size)(encoder_inputs)
encoder_lstm = LSTM(embedding_size,return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embed)
encoder_states = [state_h,state_c]

### Decoder model

In [31]:
decoder_inputs = Input(shape=(None,))
decoder_embed_layer = Embedding(num_decoder_tokens,embedding_size)
decoder_embed = decoder_embed_layer(decoder_inputs)

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(units=embedding_size,return_sequences=True,return_state=True)
decoder_outputs,_,_ = decoder_lstm(decoder_embed,initial_state = encoder_states)
decoder_dense = Dense(num_decoder_tokens,activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [32]:
#Define the model and compile
model = Model([encoder_inputs,decoder_inputs],decoder_outputs)
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 80)     298480      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 80)     627200      input_2[0][0]                    
______________________________________________________________________________________________

### You can see Shaheel's original blog post for a visualization of the model (https://github.com/shaheelkhan/Word-Level-NMT/blob/main/eng_spa.ipynb) -- please take a look!

### Quesiton 5

What, in your best understanding, does the visual of the model (on Shaheel's github) represent?

### Your answer to Question 5

goes here

### Train the model

In [38]:

batch_size = 128
epochs = 100

### This code takes a long time to run! (an hour??) -- feel free to take a well-deserved break 

In [39]:
model.fit([encoder_input_data,decoder_input_data],decoder_target_data,batch_size=batch_size,epochs=epochs,validation_split=0.1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100


Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x164d05700>

### Save the weights

In [40]:
model.save_weights('eng_spa_weights.h5')


In [124]:

#Inference model

encoder_model = Model([encoder_inputs,encoder_states])

decoder_state_input_h = Input(shape=(embedding_size,))
decoder_state_input_c = Input(shape=(embedding_size,))
decoder_state_inputs = [decoder_state_input_h,decoder_state_input_c]

# Get the embeddings of the decoder sequence
decoder_embed2 = decoder_embed_layer(decoder_inputs)

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(decoder_embed2,initial_state=decoder_state_inputs)
decoder_states2 = [state_h2,state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model([decoder_inputs]+decoder_state_inputs, [decoder_outputs2]+decoder_states2)

In [125]:
#Reverse-lookup token index to turn sequences back to words
reverse_input_token_index = dict((i,char) for char, i in input_token_index.items())
reverse_target_token_index = dict((i,char) for char, i in target_token_index.items())

Function (by LG not Shaheel) that turns text into a vector.


In [126]:
def encode_text_to_seq(input_text):
    array = np.zeros(shape=(1,6), dtype=float, order='F')
    for t, word in enumerate(input_text.split()):
        try:
            array[0,t] = input_token_index[word]
        except:
            print("ERROR never seen this word")
    return array

In [127]:
encode_text_to_seq("i love school")

array([[1639., 1966., 2791.,    0.,    0.,    0.]])

In [128]:
encode_text_to_seq("you are trying to learn")

array([[3715.,  156., 3409., 3334., 1871.,    0.]])

In [129]:
encode_text_to_seq("i love linguistics")

ERROR never seen this word


array([[1639., 1966.,    0.,    0.,    0.,    0.]])

### Question 6

What does this function do?

### Your answer to Question 6

goes here

In [130]:
def decode_sequence(input_text):
    
  input_seq = encode_text_to_seq(input_text)
  encoder_model = Model(encoder_inputs,encoder_states)

  #Encode the input as state vectors
  states_value = encoder_model.predict(input_seq)

  #Generate empty target sequence of length 1 with only the start character
  target_seq = np.zeros((1,1))
  target_seq[0,0] = target_token_index['START_']

  #Create a output sequence loop untill we recieve a stop sign
  stop_condition = False
  decoded_sentence = ""

  while not stop_condition:
    output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

    # Sample a token and add the corresponding character to the decoded sequence
    # argmax: Returns the indices of the maximum values along an axis
    # just like find the most possible char

    # Get the predicted token (the token with the highest score)
    sampled_token_index = np.argmax(output_tokens[0, -1, :])

    # Get the character belonging to the token
    sampled_char = reverse_target_token_index[sampled_token_index]

    # Append char to decoded sequence
    decoded_sentence += ' '+sampled_char

    # check for the exit condition: either hitting max length
    # or predicting the 'stop' character
    if (sampled_char == '_END') or len(decoded_sentence) > max_spa_length:
      stop_condition = True

    #Update the target sequence
    target_seq = np.zeros((1,1))
    target_seq[0,0] = sampled_token_index

    #Update the state vectors
    states_value = [h, c]
  
  return decoded_sentence

In [132]:
sents = [
    "hi how are you",
    "my name is tom",
    "i love school"
]

for sent in sents:
    decoded_sentence = decode_sequence(sent)
    print('-')
    print('Input sentence:', sent)
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: hi how are you
Decoded sentence:  ¿qué _END
-
Input sentence: my name is tom
Decoded sentence:  mi me es tom _END
-
Input sentence: i love school
Decoded sentence:  me gusta la _END


### Question 7

Try out a few more sentences  (by editing the list "sents") and explore this tool.  How well does it work?  Where does it make mistakes?

### Your answer to Question 7 

goes here

### Question 8

Please explain what has happened in this homework, as if you are talking to a non-expert.

### Your answer to Question 8

goes here