<a href="https://colab.research.google.com/github/AliKarimiENT/Machine_Translation_EN_FA/blob/main/Machine_Translation_EN_FA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Translation Project

# Install libraries

In [1]:
!pip install -U -q PyDrive
!pip install tensorflow
!pip install -U numpy==1.21
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.21
  Downloading numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 26.6 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Dataset




> The data are located in Google Drive. The TEP.en-fa.en file contains English sentences with their Farsi translation in the TEP.en-fa.fa file. 

Load the English and Farsi data from these files from running the cell below.

In [2]:
import os

def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')

In [9]:
import helper
from keras.layers import GRU , LSTM , Input , Dense , TimeDistributed , Bidirectional
from keras.models import Model
from keras.layers import Activation
from tensorflow.keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.layers import RepeatVector
from keras.layers.embeddings import Embedding

# Connect to Google Drive to load data
from google.colab import drive
drive.mount('/content/drive')

# Load English data
english_sentences = load_data('/content/drive/MyDrive/University Tehran /TEP.en-fa.en')[0:13000]

# Load Farsi data 
farsi_sentences = load_data('/content/drive/MyDrive/University Tehran /TEP.en-fa.fa')[0:13000]

print('Dataset Loaded')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset Loaded


# Files

Each sentence in TEP.en-fa.en contains an English sentence with respective translation in each line of TEP.en-fa.fa. For example check these two lines of them 

In [10]:
for sample in range(2):
  print('TEP.en-fa.en Line {}:  {}'.format(sample + 1, english_sentences[sample]))
  print('TEP.en-fa.fa Line {}:  {}'.format(sample + 1, farsi_sentences[sample]))

TEP.en-fa.en Line 1:  raspy breathing .
TEP.en-fa.fa Line 1:  صداي خر خر .
TEP.en-fa.en Line 2:  dad .
TEP.en-fa.fa Line 2:  پدر .


# Complexity of the vocabulary

In [11]:
import collections
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
farsi_words_counter = collections.Counter([word for sentence in farsi_sentences for word in sentence.split()])

print('{} English words.'.format(len(english_words_counter)))
print('{} Farsi words.'.format(len(farsi_words_counter)))

7412 English words.
10816 Farsi words.


# Tokenize

In [12]:
from keras.preprocessing.text import Tokenizer

def tokenize(x):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(x)
  sequences = tokenizer.texts_to_sequences(x)
  return sequences,tokenizer

eng_text_tokenized , en_text_tokenizer = tokenize(english_sentences)
print(en_text_tokenizer.word_index)

fa_text_tokenized , fa_text_tokenizer = tokenize(farsi_sentences)
print(fa_text_tokenizer.word_index)

{'،': 1, 'را': 2, 'من': 3, 'به': 4, 'تو': 5, 'از': 6, 'که': 7, 'اين': 8, 'و': 9, 'يک': 10, 'اون': 11, 'ما': 12, 'كه': 13, 'در': 14, 'با': 15, 'نه': 16, '؛': 17, 'بايد': 18, 'براي': 19, 'چي': 20, 'تا': 21, 'بود': 22, 'هم': 23, 'اينجا': 24, 'باشه': 25, 'خوب': 26, 'همه': 27, 'خيلي': 28, 'کن': 29, 'اونا': 30, 'شما': 31, 'نيست': 32, 'فقط': 33, 'چيزي': 34, 'ديگه': 35, 'داره': 36, 'فکر': 37, 'دارم': 38, 'چه': 39, 'آره': 40, 'حالا': 41, 'ميشه': 42, 'کار': 43, 'شده': 44, 'اي': 45, 'ميکنم': 46, 'اما': 47, 'کنم': 48, 'داري': 49, 'هيچ': 50, 'چرا': 51, 'يه': 52, 'اگر': 53, 'ميخوام': 54, 'وقتي': 55, 'ـ': 56, 'هر': 57, 'منو': 58, 'بده': 59, 'توي': 60, 'هست': 61, 'الان': 62, 'آقاي': 63, 'شد': 64, 'هاي': 65, 'داريم': 66, 'پس': 67, 'يا': 68, 'بيرون': 69, 'هي': 70, 'انجام': 71, 'ميکني': 72, 'اوه': 73, 'اگه': 74, 'بهش': 75, 'مثل': 76, 'خوبه': 77, 'پيدا': 78, 'دوست': 79, 'اونجا': 80, 'کنيم': 81, 'راه': 82, 'بريم': 83, 'ولي': 84, 'کني': 85, 'همين': 86, 'ي': 87, 'خب': 88, 'دست': 89, 'بيا': 90, 'ميدوني': 91, 

# Padding

When we are combining sequence of words together, each sequence needs to be the same length. We add padding to the end of the sequences to make them the same length.

In [13]:
import numpy as np 
from keras.preprocessing.sequence import pad_sequences

def pad(x, length = None):
  # x is list of sequences
  # length: Length to pad the sequence to. If it is None , use length of longest sequence in x 

  # It will return the padded numpy array of sequences
  if length == None:
    longest_sequence = max(x,key= len)
    return pad_sequences(x,len(longest_sequence),padding = 'post')
  return pad_sequences(x,length , padding= 'post')



# Pad tokenized output
en_padded = pad(eng_text_tokenized)
fa_padded = pad(fa_text_tokenized)


# Preprocess

In [14]:
def preprocess(x,y):
  """
    x: Feature list of sentences
    y: Label list of sentences

    It returns Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
  """

  preprocess_x , x_tk = tokenize(x)
  preprocess_y , y_tk = tokenize(y)

  preprocess_x = pad(preprocess_x)
  preprocess_y = pad(preprocess_y)

  # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
  preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
  return preprocess_x, preprocess_y, x_tk, y_tk
 
preproc_english_sentences,preproc_farsi_sentences,english_tokenizer,farsi_tokenizer = preprocess(english_sentences,farsi_sentences)

Data preprocess done !

In [15]:
print(preproc_english_sentences.shape)
print(preproc_farsi_sentences.shape)

(13000, 34)
(13000, 30, 1)


# 1.  Use **RNN** model 

A basic RNN model is a good baseline for sequence data.
We are going to build a RNN that translates English to

In [16]:
def rnn_model(input_shape,farsi_vocab_size):
  """
    input_shape: Tuple of input shape
    farsi_voca_size: Number of unique 
  """
  learning_rate = 0.01
  inputs = Input(shape=input_shape[1:])
  x = GRU(512,return_sequences=True)(inputs)
  x = TimeDistributed(Dense(farsi_vocab_size,activation='relu'))(x)
  predictions = Activation('softmax')(x)
  model = Model(inputs = inputs,outputs = predictions)
  model.compile(loss= sparse_categorical_crossentropy,
                optimizer= Adam(learning_rate),
                metrics = ['accuracy'])
  return model


# Reshape the input to work with a basic RNN
temp_x = pad(preproc_english_sentences,preproc_farsi_sentences.shape[1])
temp_x = temp_x.reshape((-1,preproc_farsi_sentences.shape[-2],1))

# Train the neural network
simple_rnn_model = rnn_model(
    temp_x.shape,
    # preproc_farsi_sentences.shape[1],
    # len(english_tokenizer.word_index) + 1, # Add 1 because padding introduces 0
    len(farsi_tokenizer.word_index) + 1 # Add 1 because padding introduces 0
)

simple_rnn_model.summary()
simple_rnn_model.fit(temp_x, preproc_farsi_sentences, batch_size=1024, epochs=10, validation_split=0.2)


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 30, 1)]           0         
                                                                 
 gru (GRU)                   (None, 30, 512)           791040    
                                                                 
 time_distributed (TimeDistr  (None, 30, 10794)        5537322   
 ibuted)                                                         
                                                                 
 activation (Activation)     (None, 30, 10794)         0         
                                                                 
Total params: 6,328,362
Trainable params: 6,328,362
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f3c78925350>

In [17]:
simple_rnn_model_scores = simple_rnn_model.evaluate(temp_x,preproc_farsi_sentences,verbose=0)
# By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.
print("Model Accuracy: %.2f%%" % (simple_rnn_model_scores[1]*100)) 

Model Accuracy: 77.33%


# 2. Use **Embedding** Model

Using words in embedding model. An Embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors.