# NLP - Machine Translation Pipeline

# Intoduction & Loading of packages and data

## Problem Statement

Prepare a python notebook (recommended- use Google Colab) to build, train and evaluate a deep neural network that functions as a part of an end-to-end machine translation pipeline that will accept English text as input and return the Korean translation. Read the instructions carefully.

### Checking the availability of GPU

In [4]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Mar 13 08:56:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Installing required packages

In [71]:
! pip install kaggle
! pip install pandas
! pip install tabulate
! pip install keras
! pip install numpy



#### Uploading & Unzipping the data

In [1]:
!unzip archive

Archive:  archive.zip
  inflating: multitarget-ted/en-ko/raw/ted_dev_en-ko.raw.en  
  inflating: multitarget-ted/en-ko/raw/ted_dev_en-ko.raw.ko  
  inflating: multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.en  
  inflating: multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.ko  
  inflating: multitarget-ted/en-ko/raw/ted_train_en-ko.raw.en  
  inflating: multitarget-ted/en-ko/raw/ted_train_en-ko.raw.ko  


### Downloading stopwords

In [1]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Importing the required packages

We have used keras & tensorflow to implement the version of RNN to prepare our own tokenizers, word embeddings & model.

In [2]:
import re
import collections
import numpy as np
import pandas as pd
import tensorflow as tf
from tabulate import tabulate
from nltk.corpus import stopwords
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.losses import sparse_categorical_crossentropy
from keras.layers import GRU, Input, Dense, TimeDistributed

### Loading the dataset

In [3]:
with open("multitarget-ted/en-ko/raw/ted_train_en-ko.raw.en", "r") as english_train_file:
    english_sentences_train = english_train_file.readlines()

with open("multitarget-ted/en-ko/raw/ted_train_en-ko.raw.ko", "r") as korean_train_file:
    korean_sentences_train = korean_train_file.readlines()

with open("multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.en", "r") as english_test_file:
    english_sentences_test = english_test_file.readlines()

with open("multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.ko", "r") as korean_test_file:
    korean_sentences_test = korean_test_file.readlines()

train_data = pd.DataFrame(
    {"english_sentences_train": english_sentences_train,
     "korean_sentences_train": korean_sentences_train,
    })

print("Dataset Loaded")

Dataset Loaded


# Data Exploration

### Printing 5 rows - to perform sanity check

In [5]:
print(tabulate(train_data[:5], tablefmt="psql"))

+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| 0 | (Applause) David Gallo: This is Bill Lange. I'm Dave Gallo.                                                                                                           | (박수) 이쪽은 Bill Lange 이고, 저는 David Gallo입니다                                                                  |
| 1 | And we're going to tell you some stories from the sea here in video.                                                                                                  | 우리는 여러분에게 바닷속 이야기를 영상과 함께 들려주고자 합니다.                                                       |
| 2 | We've got some of the most incredible video of Titanic that's ever been seen, and we're not going to show you any of it.                             

### Checking the number of sentences in train & test data

In [7]:
print("Number of sentences in English train file - ", len(english_sentences_train)) 
print("Number of sentences in Korean train file - ", len(korean_sentences_train))

Number of sentences in English train file -  166215
Number of sentences in Korean train file -  166215


In [8]:
print("Number of sentences in English test file - ", len(english_sentences_test)) 
print("Number of sentences in Korean test file - ", len(korean_sentences_test))

Number of sentences in English test file -  1982
Number of sentences in Korean test file -  1982


### Exploration of English train file

The following exploration have been performed on the data:


1.   Total number of words
2.   Total number of unique words
3.   10 most common words
4.   10 most common words when stopwords are excluded
5.   Average number of words per sentence



In [9]:
english_words_counter = collections.Counter([word.lower() for sentence in english_sentences_train for word in sentence.split()])
total_number_of_tokens = len([word.lower() for sentence in english_sentences_train for word in sentence.split()])
common_word_list = []
count = 0
for word in english_words_counter:
    if word not in stop_words:
        common_word_list.append(word)
        count += 1
    if count == 10:
        break
print("Total number of English tokens - {}".format(total_number_of_tokens))
print("Total unique tokens - {}".format(len(english_words_counter)))
print("10 Most common words in the English dataset:", '"' + '", "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print("10 Most common words in the English dataset without stop words:", '"' + '", "'.join(common_word_list) + '"')
print("Average number of tokens per sentence - {}".format(round(total_number_of_tokens / len(english_sentences_train))))

Total number of English tokens - 2910762
Total unique tokens - 105027
10 Most common words in the English dataset: "the", "and", "to", "of", "a", "that", "in", "i", "is", "you"
10 Most common words in the English dataset without stop words: "(applause)", "david", "gallo:", "bill", "lange.", "i'm", "dave", "gallo.", "we're", "going"
Average number of tokens per sentence - 18


### Exploration of Korean train file

The following exploration have been performed on the data:


1.   Total number of words
2.   Total number of unique words
3.   10 most common words
5.   Average number of words per sentence

In [10]:
korean_words_counter = collections.Counter([word for sentence in korean_sentences_train for word in sentence.split()])
total_number_of_korean_tokens = len([word for sentence in korean_sentences_train for word in sentence.split()])
print("Total number of Korean tokens - {}".format(total_number_of_korean_tokens))
print("Total unique tokens - {}".format(len(korean_words_counter)))
print("10 Most common words in the Korean dataset:", '"' + '", "'.join(list(zip(*korean_words_counter.most_common(10)))[0]) + '"')
print("Average number of tokens per sentence - {}".format(round(total_number_of_korean_tokens / len(korean_sentences_train))))

Total number of Korean tokens - 2026438
Total unique tokens - 330791
10 Most common words in the Korean dataset: "수", "이", "그", "그리고", "있습니다.", "저는", "있는", "우리는", "제가", "우리가"
Average number of tokens per sentence - 12


# Data Preprocessing

### Preprocessing of text

During preprocessing of text all the special characters, new lines & tabs have been removed from them. Stopwords have not been removed because stopwords form an important part of a given language (in this case English) & removing them would impact the translation efficiency.

In [11]:
def preprocess_text(text_list):
    cleaned_list = []
    for sentence in text_list:
        sentence = sentence.replace("\"", "").replace("!", "").replace("(", "").replace("@", "").replace("#", "").replace("$", "").replace("%", "").replace("^", "").replace("&", "").replace("*", "")
        sentence = sentence.replace(")", "").replace("[", "").replace("]", "").replace("{", "").replace("}", "").replace(":", "").replace(";", "").replace(",", "").replace(".", "").replace("/", "").replace("'", "")
        sentence = sentence.replace("<", "").replace(">", "").replace("?", "").replace("\\", "").replace("|", "").replace("`", "").replace("~", "").replace("=", "").replace("_", "").replace("+", "").replace("-", "")
        sentence = sentence.replace("\r", "").replace("\n", "")
        sentence = sentence.replace("\t", "")
        sentence = ' '.join(sentence.split())
        cleaned_list.append(sentence)
    return cleaned_list

In [12]:
print(english_sentences_train[:5])
english_sentences_train = preprocess_text(english_sentences_train)
english_sentences_test = preprocess_text(english_sentences_test)
print(english_sentences_train[:5])

["(Applause) David Gallo: This is Bill Lange. I'm Dave Gallo. \n", "And we're going to tell you some stories from the sea here in video. \n", "We've got some of the most incredible video of Titanic that's ever been seen, and we're not going to show you any of it. \n", "(Laughter) The truth of the matter is that the Titanic -- even though it's breaking all sorts of box office records -- it's not the most exciting story from the sea. \n", 'And the problem, I think, is that we take the ocean for granted. \n']
['Applause David Gallo This is Bill Lange Im Dave Gallo', 'And were going to tell you some stories from the sea here in video', 'Weve got some of the most incredible video of Titanic thats ever been seen and were not going to show you any of it', 'Laughter The truth of the matter is that the Titanic even though its breaking all sorts of box office records its not the most exciting story from the sea', 'And the problem I think is that we take the ocean for granted']


In [13]:
print(korean_sentences_train[:5])
korean_sentences_train = preprocess_text(korean_sentences_train)
korean_sentences_test = preprocess_text(korean_sentences_test)
print(korean_sentences_train[:5])

['(박수) 이쪽은 Bill Lange 이고, 저는 David Gallo입니다\n', '우리는 여러분에게 바닷속 이야기를 영상과 함께 들려주고자 합니다.\n', '저희는 끝내주는 타이타닉 비디오도 있긴 합니다만 뭐..여기서는 눈꼽만큼도 보여줄 생각이없습니다.\n', '(웃음) 비록 타이타닉이 박스오피스에서 굉장한 실적을 거두긴 했지만 바다가 들려주는 이야기 중 가장 재밌는 것은 아닙니다.\n', '문제라면 우리는 우리가 바다를 이미 알고있다고 믿는거죠.\n']
['박수 이쪽은 Bill Lange 이고 저는 David Gallo입니다', '우리는 여러분에게 바닷속 이야기를 영상과 함께 들려주고자 합니다', '저희는 끝내주는 타이타닉 비디오도 있긴 합니다만 뭐여기서는 눈꼽만큼도 보여줄 생각이없습니다', '웃음 비록 타이타닉이 박스오피스에서 굉장한 실적을 거두긴 했지만 바다가 들려주는 이야기 중 가장 재밌는 것은 아닙니다', '문제라면 우리는 우리가 바다를 이미 알고있다고 믿는거죠']


### Tokenize & Padding of text

In [14]:
def tokenize(x):
    x_tk = Tokenizer(char_level = False)
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk

In [15]:
def pad(x, length=None):
    if length is None:
        length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen = length, padding = 'post')


In [16]:
def preprocess(x, y):
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)
    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
    return preprocess_x, preprocess_y, x_tk, y_tk

In [17]:
preproc_english_sentences_train, preproc_korean_sentences_train, english_tokenizer_train, korean_tokenizer_train = preprocess(english_sentences_train, korean_sentences_train)
preproc_english_sentences_test, preproc_korean_sentences_test, english_tokenizer_test, korean_tokenizer_test = preprocess(english_sentences_test, korean_sentences_test)

### Checking the maximum sentence length & vocabulary size of both languages

In [18]:
max_english_sequence_length = preproc_english_sentences_train.shape[1]
max_korean_sequence_length = preproc_korean_sentences_train.shape[1]
english_vocab_size = len(english_tokenizer_train.word_index)
korean_vocab_size = len(korean_tokenizer_train.word_index)

print('Data preprocessing completed')
print("Max English sentence length:", max_english_sequence_length)
print("Max Korean sentence length:", max_korean_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("Korean vocabulary size:", korean_vocab_size)

Data preprocessing completed
Max English sentence length: 114
Max Korean sentence length: 108
English vocabulary size: 53584
Korean vocabulary size: 292902


# Building & Training Model

The model has been trained using RNN with embeddings formed by the words present in both the datasets. Since the dataset is huge in size & there might be a lot of words repeating, the model stabilizes relatively quickly. After running for 5 epochs, both train & validation accuracy stays near 89%.

In [68]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, korean_vocab_size):
    learning_rate = 1e-3
    rnn = GRU(64, return_sequences=True, activation="tanh")
    
    embedding = Embedding(korean_vocab_size, 64, input_length=input_shape[1]) 
    logits = TimeDistributed(Dense(korean_vocab_size, activation="softmax"))
    
    model = Sequential()
    model.add(embedding)
    model.add(rnn)
    model.add(logits)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate), 
                  metrics=['accuracy'])
    return model

tmp_x = pad(preproc_english_sentences_train, max_english_sequence_length)
tmp_y = pad(preproc_korean_sentences_train, max_english_sequence_length)
embeded_model = embed_model(tmp_x.shape, max_english_sequence_length, english_vocab_size, korean_vocab_size)

In [21]:
embeded_model.fit(tmp_x, tmp_y, batch_size=24, epochs=5, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f24e4c23690>

# Prediction

One of the ways to evaluate a language translation model would be to have a bidirectional language translation model. For example, a sentence would be translated from English->Korean->English to check the efficiency of the model. Building 2 models to perform language translation would require a lot of intensive resources & hence we have used the help from google translate to perform the translation & check the accuracy of it.

In [65]:
sentence_list = ["Thank you so much."]
cleaned_sentence_list = preprocess_text(sentence_list)
words_to_index = {word: id for word, id in english_tokenizer_train.word_index.items()}
preprocessed_sentence_list = []
for sentence in cleaned_sentence_list:
    preprocessed_sentence = []
    for word in sentence.split():
        preprocessed_sentence.append(words_to_index[word.lower()])
    preprocessed_sentence_list.append(preprocessed_sentence)
tmp_y = pad(preproc_english_sentences_test, max_english_sequence_length)
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
output = (logits_to_text(embeded_model.predict(tmp_x[:5])[0], korean_tokenizer_train))
print(output.replace("<PAD>", ""))

박수 감사합니다                                                                                                                
