<a href="https://colab.research.google.com/github/MinhDg00/en-vi-translation/blob/master/en_vi_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation Eng-Viet

The Project uses the EVBCorpus - a English and Vietnamese parallel translations and bitexts 

More information about the data can be found [here](https://github.com/qhungngo/EVBCorpus)

### Load Files and Create Dataset

In [1]:
!pwd

/content


In [0]:
# Packages for loading file
import os
from zipfile import ZipFile
from bs4 import BeautifulSoup
import re
import glob
import warnings
warnings.filterwarnings('ignore')

In [0]:
# Set seed
import numpy as np
import random
SEED = 46
np.random.seed(SEED)
random.seed(SEED)

In [0]:
with ZipFile('data.zip', 'r') as data:
   # Extract all the contents of zip file in current directory
   data.extractall()

In [0]:
data = []
path = 'data/*.sgml'
files = glob.glob(path)
for file in files:
    data.append(BeautifulSoup(open(file), 'lxml'))

In [0]:
# Create 2 list contains english and vietnamese texts
english_sentences = []
vietnamese_sentences = []
for d in data:
    i = 0
    for text in d.find_all('s'):
        if i%2 == 0:
            english_sentences.append(text.text)
        else:
            vietnamese_sentences.append(text.text)
        i += 1 

In [7]:
print(vietnamese_sentences[0])
print(english_sentences[0])

ADN của ông Dominique Strauss-Kahn " có dính líu đến cô phục vụ phòng "
Dominique Strauss-Kahn DNA " linked to maid "


### Vocabulary

In [0]:
# Create a counter
import collections

english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
vietnamese_words_counter = collections.Counter([word for sentence in vietnamese_sentences for word in sentence.split()])

In [9]:
# Inverstigate unique and most common words in both texts
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')


880216 English words.
40903 unique English words.
10 Most common words in the English dataset:
"." "," "the" "to" "of" "and" "a" "in" """ "is"


In [10]:
print('{} Vietnamese words.'.format(len([word for sentence in vietnamese_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(vietnamese_sentences)))
print('10 Most common words in the Vietnamese dataset:')
print('"' + '" "'.join(list(zip(*vietnamese_words_counter.most_common(10)))[0]) + '"')

1201737 Vietnamese words.
45308 unique French words.
10 Most common words in the Vietnamese dataset:
"." "," "và" "có" "của" "là" "một" """ "cho" "các"


###Sentiment Classification


In [0]:
# install transformers and tensorflow if havent
!pip install transformers==2.3.0
!pip install tensorflow==2.1.0

In [0]:
from transformers import pipeline
import pandas as pd

In [0]:
sentiment_classifier = pipeline('sentiment-analysis')

We will only investigate the first 100 english sentences

In [0]:
s = ([sentiment_classifier(sentence) for sentence in english_sentences[:100]])

In [0]:
sentiment_df = pd.DataFrame({'English_text':english_sentences[:100], 'Sentiment': s}, index = range(1,101))

In [67]:
english_sentences

['Dominique Strauss-Kahn DNA " linked to maid "',
 'Dominique Strauss-Kahn is being held under house arrest in New York',
 'DNA found on the clothes of a New York hotel maid who accused Dominique Strauss-Kahn of sexually assaulting her matches that of the former IMF chief , US media reports say .',
 'These unconfirmed reports cited sources close to the investigation .',
 'More tests from the room where the alleged attack took place are pending .',
 'Mr Strauss-Kahn denies the charges , and resigned as head of the International Monetary Fund last week to defend himself .',
 'He is under house arrest in a New York apartment , after a judge granted him a $ 1m ( £620,000 ) bail last week .',
 'Further tests',
 'Reports about the DNA samples came after authorities analysed the work clothes of the 32-year-old hotel maid who says she was assaulted in the New York Sofitel near Times Square on 14 May .',
 'Police and judicial spokespeople have declined to confirm the reports , carried by the As

In [16]:
sentiment_df.head()

Unnamed: 0,English_text,Sentiment
1,"Dominique Strauss-Kahn DNA "" linked to maid ""","[{'label': 'POSITIVE', 'score': 0.6252788}]"
2,Dominique Strauss-Kahn is being held under hou...,"[{'label': 'NEGATIVE', 'score': 0.9738129}]"
3,DNA found on the clothes of a New York hotel m...,"[{'label': 'NEGATIVE', 'score': 0.98521155}]"
4,These unconfirmed reports cited sources close ...,"[{'label': 'NEGATIVE', 'score': 0.8924382}]"
5,More tests from the room where the alleged att...,"[{'label': 'NEGATIVE', 'score': 0.99367356}]"


### Named Entity Recognition
For performance sake, I will only investigate the first 100 english sentences

In [0]:
speech_tagging = pipeline('ner', model= 'bert-base-cased')

In [0]:
tag = ([speech_tagging(sentence) for sentence in english_sentences[:100]])

In [20]:
print('The English text: {}'.format(english_sentences[0]))
print('\nNamed Entity recognized from the text:\n')
tag[0]

The English text: Dominique Strauss-Kahn DNA " linked to maid "

Named Entity recognized from the text:



[{'entity': 'LABEL_0', 'score': 0.5934121608734131, 'word': '[CLS]'},
 {'entity': 'LABEL_0', 'score': 0.5186386108398438, 'word': 'Dominique'},
 {'entity': 'LABEL_0', 'score': 0.684639036655426, 'word': 'Strauss'},
 {'entity': 'LABEL_1', 'score': 0.5115693211555481, 'word': '-'},
 {'entity': 'LABEL_0', 'score': 0.5856021642684937, 'word': 'Kahn'},
 {'entity': 'LABEL_0', 'score': 0.555425763130188, 'word': 'DNA'},
 {'entity': 'LABEL_1', 'score': 0.5249403715133667, 'word': '"'},
 {'entity': 'LABEL_1', 'score': 0.5365234017372131, 'word': 'linked'},
 {'entity': 'LABEL_1', 'score': 0.5083389282226562, 'word': 'to'},
 {'entity': 'LABEL_0', 'score': 0.5742794871330261, 'word': 'maid'},
 {'entity': 'LABEL_0', 'score': 0.539033055305481, 'word': '"'},
 {'entity': 'LABEL_1', 'score': 0.5839194655418396, 'word': '[SEP]'}]

In [0]:
tag_df = pd.DataFrame({'English_text':english_sentences[:100], 'Named Entity': tag}, index = range(1,101))

In [22]:
tag_df.head()

Unnamed: 0,English_text,Named Entity
1,"Dominique Strauss-Kahn DNA "" linked to maid ""","[{'word': '[CLS]', 'score': 0.5934121608734131..."
2,Dominique Strauss-Kahn is being held under hou...,"[{'word': '[CLS]', 'score': 0.6303530335426331..."
3,DNA found on the clothes of a New York hotel m...,"[{'word': '[CLS]', 'score': 0.6918258666992188..."
4,These unconfirmed reports cited sources close ...,"[{'word': '[CLS]', 'score': 0.6406530737876892..."
5,More tests from the room where the alleged att...,"[{'word': '[CLS]', 'score': 0.6268342137336731..."


### Preprocess

In [23]:
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [0]:
def tokenize(x):
    x_tk = Tokenizer()
    x_tk.fit_on_texts(x)
    
    return x_tk.texts_to_sequences(x), x_tk

In [0]:
def pad(x, length = None):

    if length is None:
        length = max([len(sentence) for sentence in x])
    
    return pad_sequences(x, maxlen = length, padding = 'post')

In [0]:
def preprocess(x, y):
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

In [0]:
preproc_english_sentences, preproc_vietnamese_sentences, english_tokenizer, vietnamese_tokenizer =\
    preprocess(english_sentences, vietnamese_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_vietnamese_sequence_length = preproc_vietnamese_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
vietnamese_vocab_size = len(vietnamese_tokenizer.word_index)

In [28]:
print(max_english_sequence_length)
print(max_vietnamese_sequence_length)
print(english_vocab_size)
print(vietnamese_vocab_size)

128
170
30745
13955


### Machine Translation

In [0]:
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from tensorflow.python.keras.layers.embeddings import Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.losses import sparse_categorical_crossentropy
tf.compat.v1.disable_eager_execution()


In [0]:
# Convert word ids to text
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

In [0]:
def model(input_shape, output_sequence_length, english_vocab_size, vietnamese_vocab_size):
    """
    Build and train a model that incorporates embedding, and bidirectional RNN on x and y
    input_shape: Tuple of input shape
    output_sequence_length: Length of output sequence
    english_vocab_size: Number of unique English words in the dataset
    french_vocab_size: Number of unique French words in the dataset
    """
    embedding_size = 64
    gru_dim = 128
    learning_rate = 0.01
    
    input_seq = Input(shape = input_shape[1:])
    embedding = Embedding(input_dim = english_vocab_size,
                         output_dim = embedding_size,
                         input_length = output_sequence_length)(input_seq)
    birnn = Bidirectional(GRU(gru_dim, return_sequences= True))(embedding)
    logits = TimeDistributed(Dense(units = vietnamese_vocab_size))(birnn)
    model = Model(input_seq, Activation('softmax')(logits))
    model.compile(loss = sparse_categorical_crossentropy,
                 optimizer = Adam(lr = learning_rate),
                 metrics = ['accuracy'])
    return model



In [34]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())


[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 1411228472727408957
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14764367242000809935
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 10246933403474899074
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 15505902797
locality {
  bus_id: 1
  links {
  }
}
incarnation: 13607807847511005210
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"
]


In [0]:
x = pad(preproc_english_sentences, max_vietnamese_sequence_length)
x = x.reshape((-1, preproc_vietnamese_sentences.shape[-2])) 


In [38]:
translation_model = model(
    x.shape,
    max_vietnamese_sequence_length,
    english_vocab_size,
    vietnamese_vocab_size)

translation_model.fit(x, preproc_vietnamese_sentences, batch_size= 128, epochs= 5, validation_split=0.2)

Train on 36246 samples, validate on 9062 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f297c61bc18>

In [81]:
y_id_to_word = {value: key for key, value in vietnamese_tokenizer.word_index.items()}
y_id_to_word[0] = '<PAD>'

sentence = 'eye and ear'
sentence = [english_tokenizer.word_index[word] for word in sentence.split()]
sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
sentences = np.array([sentence[0], x[0]])
predictions = translation_model.predict(sentences, len(sentences))

print('Sample 1:')
print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))

Sample 1:
mắt và tai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <