## Step 0 :- Importing necessary libraries to be used

In [1]:
import numpy as np
import pandas as pd
from random import shuffle
import os
import re
from keras.models import Sequential
from keras.layers import Bidirectional,Dense, CuDNNGRU, CuDNNLSTM, RepeatVector, TimeDistributed, BatchNormalization, Embedding
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Using TensorFlow backend.


## Step 1 :- Importing the dataset
<div class="alert alert-block alert-success">
    We have the dataset in form of a txt file. We will read it in a format that will make seperation of each individual english to polish translations easy <br><br>
<b>The dataset is loaded from the site :- https://www.manythings.org/anki/ </b>
</div>


In [2]:
input_file_path     = r'D:\kaggle_trials\polish_to_english'+'\\pol.txt'
file                = open(input_file_path,mode = 'rt',encoding='utf-8')
text_to_read        = file.read()
file.close()
lines               = text_to_read.strip().split('\n')
text_translate      = [line.split('\t') for line in  lines]
print(text_translate[1:10])

[['Hi.', 'Cześć.'], ['Run!', 'Uciekaj!'], ['Run.', 'Biegnij.'], ['Run.', 'Uciekaj.'], ['Who?', 'Kto?'], ['Wow!', 'O, dziamdzia zaprzała jej szadź!'], ['Wow!', 'Łał!'], ['Help!', 'Pomocy!'], ['Jump.', 'Skok.']]


## Step 2 :- Preprocessing of the text data
<div class="alert alert-block alert-success">
<b>Step 1:</b> Splitting the data and removal of all special characters. <br>
<b>Step 2:</b> Removal of digits from the text.<br>
<b>Step 3:</b> Lower-case all the text<br><br>
Also the Polish language contains some special characters which are not present in the English language. The polish language contains 32 alphabets. They exclude V and X letters from english and have 9 diacritics. While preprocessing we have removed the diacritics from the Polish language. However, a commented segment can be uncommented to preserve the diacritics    
</div>



In [3]:
clean_text_translated     = []
for i in range(len(text_translate)):
    text_pair             = text_translate[i]
    eng_text              = text_pair[0]
    pol_text              = text_pair[1]
    
    eng_text_split        = eng_text.split()
    pol_text_split        = pol_text.split()
    
    eng_text_punc_removed = ' '.join([re.sub('[^A-Za-z]+', '', eng_text_split[i]) for i in range(len(eng_text_split))])
    pol_text_punc_removed = ' '.join([re.sub('[^A-Za-z]+', '', pol_text_split[i]) for i in range(len(pol_text_split))])
#     pol_text_punc_removed = ' '.join([re.sub('[^A-Za-zćąęńóśźżł]+', '', pol_text_split[i]) for i in range(len(pol_text_split))])
    
    
    eng_text_nums_removed = ' '.join([word.lower() for word in eng_text_punc_removed.split() if word.isalpha()])
    pol_text_nums_removed = ' '.join([word.lower() for word in pol_text_punc_removed.split() if word.isalpha()])
    
    clean_text_translated.append([eng_text_nums_removed,pol_text_nums_removed])
clean_text_translated     = np.array(clean_text_translated)
print('Punctuations removed and all the sentences converted to smaller case.')

Punctuations removed and all the sentences converted to smaller case.


In [4]:
print('The clean text translations are below(contains first 10 translations):-\n ',clean_text_translated[1:10])
print('\n')
print('The clean text translations for last 3 translations are shown below :- \n',clean_text_translated[-3:])
print('\n')

The clean text translations are below(contains first 10 translations):-
  [['hi' 'cze']
 ['run' 'uciekaj']
 ['run' 'biegnij']
 ['run' 'uciekaj']
 ['who' 'kto']
 ['wow' 'o dziamdzia zaprzaa jej szad']
 ['wow' 'a']
 ['help' 'pomocy']
 ['jump' 'skok']]


The clean text translations for last 3 translations are shown below :- 
 [['since there are usually multiple websites on any given topic i usually just click the back button when i arrive on any webpage that has popup advertising i just go to the next page found by google and hope for something less irritating'
  'zwykle jest wiele stron internetowych na kady temat wic kiedy trafiam na stron z popupami najczciej wciskam guzik wstecz i id do nastpnej strony znalezionej przez googlea by trafi na co mniej denerwujcego']
 ['if you want to sound like a native speaker you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly a

## Step 3:- Train test split 
<div class="alert alert-block alert-success">
<b>Split :</b> We use 80% data for training and remaining 20% data for testing. However, we will be working with only a part of the dataset (first 4000 translations) because of memory constraints of my system
</div>

In [5]:
clean_text_translated = clean_text_translated[:4000]
shuffle(clean_text_translated)
splitting_boundary   = int(0.8*len(clean_text_translated))
train, test          = clean_text_translated[:splitting_boundary], clean_text_translated[splitting_boundary+1:]

## Step 4 :- Tokenizing and Encoding 
<div class="alert alert-block alert-success">
<b>Tokenizers:</b> We tokenize the English and Polish vocabulary separately because they are having different structures.  <br> </br>
    <b> Encoders: </b> We encode the text of both Polish and English language based on the tokenizers declared. <br> </br>
    <b> One hot encode : </b> We will one hot encode the target data <br><br>
    Since we are doing Polish to English translation, we will have the english translation as the target data and we will be encoding the target data with respect to the English Vocabulary.
    
   
</div>

In [6]:
eng_tokenizer       = Tokenizer()
eng_tokenizer.fit_on_texts(clean_text_translated[:,0])
eng_vocab           = len(eng_tokenizer.word_index) + 1
eng_length          = max([len(clean_text_translated[:,0][i].split()) for i in range(len(clean_text_translated[:,0]))])
print('The number of distinct words in the English text is ',eng_vocab)
print('The maximum length of sentence in English is        ',eng_length)

The number of distinct words in the English text is  1141
The maximum length of sentence in English is         5


In [7]:
pol_tokenizer       = Tokenizer()
pol_tokenizer.fit_on_texts(clean_text_translated[:,1])
pol_vocab           = len(pol_tokenizer.word_index) + 1
pol_length          = max([len(clean_text_translated[:,1][i].split()) for i in range(len(clean_text_translated[:,1]))])
print('The number of distinct words in the Polish text is ',pol_vocab)
print('The maximum length of sentence in Polish is        ',pol_length)

The number of distinct words in the Polish text is  1671
The maximum length of sentence in Polish is         8


In [8]:
def encode_sequences(tokenizer, length, lines):
    X = tokenizer.texts_to_sequences(lines)
    X = pad_sequences(X, maxlen=length, padding='post')
    return X
def encode_output(sequences, vocab_size):
    y_list = []
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        y_list.append(encoded)
    y = np.array(y_list)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

In [9]:
# Training data
train_X = encode_sequences(pol_tokenizer, pol_length, train[:, 1])
train_Y = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
train_Y = encode_output(train_Y, eng_vocab)

In [10]:
# Test data
test_X = encode_sequences(pol_tokenizer, pol_length, test[:, 1])
test_Y = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
test_Y = encode_output(test_Y, eng_vocab)

## Step 5 :- Building the model
<div class="alert alert-block alert-success">
<b>Bidirectional Encoder:</b> We use a bidirectional Encoder LSTM applied on the Polish dataset <br>
    <b> Repeatvectors: </b> We carry forward the hidden layer output of the the bidirectional Encoder as an input to each hidden layer of the bidirectional Decoder. <br>
    <b> Bidirectional Decoder: </b> We will use a bidirectional Decoder LSTM to get the hidden layer output distributed to the maximum vocab size of English language. 
</div>

In [11]:
model  = Sequential()
model.add(Embedding(pol_vocab, 20, input_length=pol_length))
model.add(BatchNormalization())
model.add(Bidirectional(CuDNNLSTM(600)))
model.add(BatchNormalization())
model.add(RepeatVector(eng_length))
model.add(Bidirectional(CuDNNLSTM(600, return_sequences=True)))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(eng_vocab, activation='softmax')))
model.compile(optimizer=adam(lr=0.005), loss='categorical_crossentropy', metrics = ['accuracy'])
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 20)             33420     
_________________________________________________________________
batch_normalization_1 (Batch (None, 8, 20)             80        
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1200)              2985600   
_________________________________________________________________
batch_normalization_2 (Batch (None, 1200)              4800      
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 5, 1200)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 5, 1200)           8649600   
_________________________________________________________________
batc

## Step 6 :- Creating Checkpoints and getting outputs 

In [12]:
reduce_lr  = ReduceLROnPlateau(monitor='val_acc', factor=0.02,verbose=1,
                              patience=5, min_lr=0.0001)
es         = EarlyStopping(monitor='val_acc', patience=15, verbose=1, mode='auto', baseline=None, 
                          restore_best_weights=True)
filepath   = os.getcwd()+'\\chkpts\\'+"weights-improvement-{epoch:02d}-{loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='auto')

In [13]:
model.fit(train_X, train_Y, epochs=120, batch_size=128, validation_data=(test_X, test_Y), 
          callbacks        = [es,reduce_lr,checkpoint])

Instructions for updating:
Use tf.cast instead.
Train on 3200 samples, validate on 799 samples
Epoch 1/120

Epoch 00001: val_acc improved from -inf to 0.00025, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Machine translation Seq2Seq\chkpts\weights-improvement-01-4.54.hdf5
Epoch 2/120

Epoch 00002: val_acc improved from 0.00025 to 0.01176, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Machine translation Seq2Seq\chkpts\weights-improvement-02-1.65.hdf5
Epoch 3/120

Epoch 00003: val_acc improved from 0.01176 to 0.01927, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Machine translation Seq2Seq\chkpts\weights-improvement-03-0.90.hdf5
Epoch 4/120

Epoch 00004: val_acc improved from 0.01927 to 0.02003, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Machine translation Seq2Seq\chkpts\weights-improvement-04-0.60.hdf5
Epoch 5/120

Epoch 00005: val_acc improved from 0.02003 to 0.10013, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Machine translation Seq2Seq\chkpts\weight


Epoch 00030: val_acc did not improve from 0.73016
Epoch 31/120

Epoch 00031: val_acc did not improve from 0.73016
Epoch 32/120

Epoch 00032: val_acc did not improve from 0.73016
Epoch 33/120

Epoch 00033: val_acc did not improve from 0.73016
Epoch 34/120

Epoch 00034: val_acc did not improve from 0.73016
Epoch 35/120

Epoch 00035: val_acc did not improve from 0.73016
Epoch 36/120

Epoch 00036: val_acc did not improve from 0.73016
Epoch 37/120

Epoch 00037: val_acc did not improve from 0.73016
Epoch 38/120

Epoch 00038: val_acc did not improve from 0.73016
Epoch 39/120
Restoring model weights from the end of the best epoch

Epoch 00039: val_acc did not improve from 0.73016
Epoch 00039: early stopping


<keras.callbacks.History at 0x218820cf048>

## Step 7 :- Getting Predictions 
We finally use the model to get predictions

In [14]:
def word_int(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

In [15]:
def predict_sequence(model, tokenizer, value):
    prediction = model.predict(value, verbose=0)[0]
    integers = [np.argmax(vector) for vector in prediction]
    target = []
    for i in integers:
        word = word_int(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

In [17]:
value

array([[47, 31, 37,  0,  0,  0,  0,  0]])

In [21]:
test_X[798]

array([47, 31, 37,  0,  0,  0,  0,  0])

In [22]:
source    = []
target    = []
predicted = []
for i in range(30):
        value = test_X[i]
        value = value.reshape((1, value.shape[0]))
        translation = predict_sequence(model, eng_tokenizer, value)
        val1, val2 = test[i]
        target.append(val1)
        source.append(val2)
        predicted.append(translation)

## Displaying the translation results 

In [30]:
finaldf = pd.DataFrame(columns=['Source','Target','Predicted'])
finaldf['Source'] = source
finaldf['Target'] = target
finaldf['Predicted'] = predicted
finaldf

Unnamed: 0,Source,Target,Predicted
0,wrcie,youre back,youre back
1,nienawidz toma,i hate tom,i hate tom
2,chce mi si spa,i want to sleep,i feel face
3,to mj dom,this is my home,thats my my
4,dzwonie,did you call,did you call
5,to musi by tutaj,it must be here,it is tom
6,bed za wami tskni,i will miss you,i came for
7,poka mi,show me,show me
8,nie zamiecaj,dont litter,dont litter
9,zapaciem rachunki,i paid my bills,i paid bills


<div class="alert alert-block alert-success">
<b>Comments:</b>  <br>
    <b> Accuracy:-</b> We were able to get fairly decent translations done from Polish to English. But as the words increase the accuracy tends to be on the lesser side. We don't really get exact match to the target phrases. <br>
    <b> Alternatives :- </b> We can use attention mechanism to get better results. The Seq2Seq architecture works fairly well on small sequences of data but for bigger sequences, we need to extract the words from the phrases that are more important than others. <br>
    <b> Dataset size :- </b> Due to extensive computational infrastructure needed for Seq2Seq models, I have limited the data size. The dataset has less number of bigger sentences as compared to smaller ones. Hence it has got much less data to train on bigger sequences of words. We can train the model on the entire dataset to get better results (if infrastructure permits)
</div>