# Neural Machine Translation
While it might not seem like it, one of the very first development in Artificial Intellience is Neural Machine Translation. Traditionally, machine translation is a challenging task that involves large statistical models developed using high sophisticated linguistic knowledge. In neural machine translation, deep neural networks are developed for the problem. Advancing toward using Artificial Intelligence in machine translation task, AI to look for patterns in the input language and provide the target language representations as output. 

## Data Preparation
Like many machine learning task, we need to start with the data. In this tutorial, we'll use a dataset of English to Vietnamese phrases. Think of this as learning Vietnamese or English using flashcards. The dataset can be download [here](https://www.kaggle.com/datasets/hungnm/englishvietnamese-translation). To prepare the dataset for modeling, we'll perform the following:

1. Start by reading in the associated data and scan through it
2. Cleanup punctuation
3. Process upper and lowercase words
4. Processing special characters
5. Handle duplciate phrases in English with different translations in Vietnamese

In [11]:
# import python libraries
import re
import numpy as np
from unicodedata import normalize

In [15]:
## Parameters
DATA_DIR = './data/'

In [21]:
## Helper Functions
def load_data(file_path:str) -> list:
  """
    Function to load data from a text file
    Read it line by line and return as a list of strings
    Inputs:
      - file_path {string}: path to the file to be read
    Outputs:
      - data {list}: list of strings
  """

  data = []
  with open(file_path, 'rt', encoding='utf-8') as file:
    # read file line by line
    for line in file:
      # remove leading and trailing whitespaces
      line = line.strip()
      # append to data list
      data.append(line)
    # close file
    file.close()


  return data


def to_pairs(doc1, doc2):
  """
    Function to convert join two lists of strings into a list of pairs
    Inputs:
      - doc1 {list}: list of strings
      - doc2 {list}: list of strings
    Outputs:
      - pairs {list}: list of pairs
  """
  # initialize list of pairs
  pairs = []
  for i in range(0, len(doc1)-1):

    # append pair of strings
    pairs.append([doc1[i], doc2[i]])

  return pairs

In [20]:
# Read in the data
# From initial inspection, the data between the English and Vietnamese sentences are aligned
# So we can read them in as pairs
english_text = load_data(DATA_DIR + 'raw/en_sents.txt')
vietnamese_text = load_data(DATA_DIR + 'raw/vi_sents.txt')
print(english_text[:5]), print(vietnamese_text[:5]), len(english_text), len(vietnamese_text)

['Please put the dustpan in the broom closet', 'Be quiet for a moment.', 'Read this', 'Tom persuaded the store manager to give him back his money.', 'Friendship consists of mutual understanding']
['xin vui lòng đặt đồ hốt rác trong tủ chổi', 'im lặng một lát', 'đọc này', 'tom thuyết phục người quản lý cửa hàng trả lại tiền cho anh ta.', 'tình bạn bao gồm sự hiểu biết lẫn nhau']


(None, None, 254090, 254090)

In [23]:
# convert to pairs
sentence_pairs = to_pairs(english_text, vietnamese_text)
sentence_pairs[:5]

[['Please put the dustpan in the broom closet',
  'xin vui lòng đặt đồ hốt rác trong tủ chổi'],
 ['Be quiet for a moment.', 'im lặng một lát'],
 ['Read this', 'đọc này'],
 ['Tom persuaded the store manager to give him back his money.',
  'tom thuyết phục người quản lý cửa hàng trả lại tiền cho anh ta.'],
 ['Friendship consists of mutual understanding',
  'tình bạn bao gồm sự hiểu biết lẫn nhau']]