# Neural Machine Translation
While it might not seem like it, one of the very first development in Artificial Intellience is Neural Machine Translation. Traditionally, machine translation is a challenging task that involves large statistical models developed using high sophisticated linguistic knowledge. In neural machine translation, deep neural networks are developed for the problem. Advancing toward using Artificial Intelligence in machine translation task, AI to look for patterns in the input language and provide the target language representations as output. 

## Data Preparation
Like many machine learning task, we need to start with the data. In this tutorial, we'll use a dataset of English to Vietnamese phrases. Think of this as learning Vietnamese or English using flashcards. The dataset can be download [here](https://www.kaggle.com/datasets/hungnm/englishvietnamese-translation). To prepare the dataset for modeling, we'll perform the following:

1. Start by reading in the associated data and scan through it
2. Cleanup punctuation
3. Process upper and lowercase words
4. Processing special characters
5. Handle duplciate phrases in English with different translations in Vietnamese

In [119]:
!pip install underthesea --quiet

Collecting underthesea
  Downloading underthesea-6.8.4-py3-none-any.whl.metadata (15 kB)
Collecting python-crfsuite>=0.9.6 (from underthesea)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting scikit-learn (from underthesea)
  Downloading scikit_learn-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting underthesea-core==1.0.4 (from underthesea)
  Downloading underthesea_core-1.0.4-cp310-cp310-manylinux2010_x86_64.whl.metadata (1.7 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->underthesea)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading underthesea-6.8.4-py3-none-any.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading underthesea_core-1.0.4-cp310-cp310-manylinux2010_x86_64.whl (657 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━

In [121]:
# import python libraries
import re
import string
import math
import numpy as np
from unicodedata import normalize
from tqdm.notebook import tqdm

# NLP libraries
from gensim.models import KeyedVectors
from underthesea import word_tokenize  # Vietnamese NLP Toolkit

# NN Libraries
from transformers import AutoTokenizer

In [122]:
## Parameters
DATA_DIR = './data/'

In [123]:
## Helper Functions
def load_data(file_path:str) -> list:
  """
    Function to load data from a text file
    Read it line by line and return as a list of strings
    Inputs:
      - file_path {string}: path to the file to be read
    Outputs:
      - data {list}: list of strings
  """

  data = []
  with open(file_path, 'rt', encoding='utf-8') as file:
    # read file line by line
    for line in file:
      # remove leading and trailing whitespaces
      line = line.strip()
      # append to data list
      data.append(line)
    # close file
    file.close()


  return data


def to_pairs(doc1: list, doc2: list) -> list:
  """
    Function to convert join two lists of strings into a list of pairs
    Inputs:
      - doc1 {list}: list of strings
      - doc2 {list}: list of strings
    Outputs:
      - pairs {list}: list of pairs
  """
  # initialize list of pairs
  pairs = []
  for i in range(0, len(doc1)):

    # append pair of strings
    pairs.append([doc1[i], doc2[i]])

  return pairs


# clean a list of lines
def clean_pairs(lines: list) -> np.array:
  """
    Function to clean a list of pairs of strings
    Inputs:
      - lines {list}: list of pairs of strings
    Outputs:
      - cleaned {list}: list of cleaned pairs of
  """

  # delcare list and prepare regex for char filtering
  # also prepare translation table for removing punctuation
  cleaned = list()
  table = str.maketrans('', '', string.punctuation)

  for pair in tqdm(lines):
    clean_pair = list()
    # for each pari, perform the following operations
    # 1. tokenize on white space
    # 2. convert to lowercase
    # 3. remove punctuation from each token 
    # 4. remove extra whitespaces
    # 5. remove tokens with numbers in them
    # 6. store as string
    for line in pair:
      line = line.split()
      line = [word.lower() for word in line]
      line = [word.translate(table) for word in line]
      line = [re.sub("\s+", " ", w) for w in line]
      line = [word for word in line if word.isalpha()]
      clean_pair.append(' '.join(line))
      cleaned.append(clean_pair)
  return np.array(cleaned)

In [124]:
# Read in the data
# From initial inspection, the data between the English and Vietnamese sentences are aligned
# So we can read them in as pairs
english_text = load_data(DATA_DIR + 'raw/en_sents.txt')
vietnamese_text = load_data(DATA_DIR + 'raw/vi_sents.txt')
print(english_text[:5]), print(vietnamese_text[:5]), len(english_text), len(vietnamese_text)

['Please put the dustpan in the broom closet', 'Be quiet for a moment.', 'Read this', 'Tom persuaded the store manager to give him back his money.', 'Friendship consists of mutual understanding']
['xin vui lòng đặt đồ hốt rác trong tủ chổi', 'im lặng một lát', 'đọc này', 'tom thuyết phục người quản lý cửa hàng trả lại tiền cho anh ta.', 'tình bạn bao gồm sự hiểu biết lẫn nhau']


(None, None, 254090, 254090)

In [125]:
# convert to pairs
sentence_pairs = to_pairs(english_text, vietnamese_text)
sentence_pairs[:5]

[['Please put the dustpan in the broom closet',
  'xin vui lòng đặt đồ hốt rác trong tủ chổi'],
 ['Be quiet for a moment.', 'im lặng một lát'],
 ['Read this', 'đọc này'],
 ['Tom persuaded the store manager to give him back his money.',
  'tom thuyết phục người quản lý cửa hàng trả lại tiền cho anh ta.'],
 ['Friendship consists of mutual understanding',
  'tình bạn bao gồm sự hiểu biết lẫn nhau']]

In [126]:
# preprocessed data pairs
cleaned_pairs = clean_pairs(sentence_pairs)
cleaned_pairs[:5]

  0%|          | 0/254090 [00:00<?, ?it/s]

array([['please put the dustpan in the broom closet',
        'xin vui lòng đặt đồ hốt rác trong tủ chổi'],
       ['please put the dustpan in the broom closet',
        'xin vui lòng đặt đồ hốt rác trong tủ chổi'],
       ['be quiet for a moment', 'im lặng một lát'],
       ['be quiet for a moment', 'im lặng một lát'],
       ['read this', 'đọc này']], dtype='<U265')

## Tokenizer
Now that we have prepared the data, it is time to tokenize it. Tokenization is the process of breaking down a sentence into indivial word, called token, and then assign a numerical value to it. A vocabulary is also created in this process to keep tract of the word to number consersion as well as the total number of unique words in our sample

In [127]:
# Create source and target lanauge
SRC_LANG = 'en'
TGT_LANG = 'vi'

# Declare word to number and number to word dictionary
token_index = {}
index_token = {}
