<a href="https://colab.research.google.com/github/AliKarimiENT/Machine_Translation_EN_FA/blob/main/Machine_Translation_EN_FA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Translation Project

# Install libraries

In [1]:
!pip install -U -q PyDrive
!pip install tensorflow
!pip install -U numpy==1.21
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.21
  Downloading numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 38.5 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Dataset




> The data are located in Google Drive. The TEP.en-fa.en file contains English sentences with their Farsi translation in the TEP.en-fa.fa file. 

Load the English and Farsi data from these files from running the cell below.

In [2]:
import os

def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')

In [3]:
import helper
from keras.layers import GRU , LSTM , Input , Dense , TimeDistributed , Bidirectional
from keras.models import Model
from keras.layers import Activation
from tensorflow.keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.layers import RepeatVector
from keras.layers.embeddings import Embedding

# Connect to Google Drive to load data
from google.colab import drive
drive.mount('/content/drive')

# Load English data
english_sentences = load_data('/content/drive/MyDrive/University Tehran /TEP.en-fa.en')

# Load Farsi data 
farsi_sentences = load_data('/content/drive/MyDrive/University Tehran /TEP.en-fa.fa')

print('Dataset Loaded')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset Loaded


# Files

Each sentence in TEP.en-fa.en contains an English sentence with respective translation in each line of TEP.en-fa.fa. For example check these two lines of them 

In [4]:
for sample in range(2):
  print('TEP.en-fa.en Line {}:  {}'.format(sample + 1, english_sentences[sample]))
  print('TEP.en-fa.fa Line {}:  {}'.format(sample + 1, farsi_sentences[sample]))

TEP.en-fa.en Line 1:  raspy breathing .
TEP.en-fa.fa Line 1:  صداي خر خر .
TEP.en-fa.en Line 2:  dad .
TEP.en-fa.fa Line 2:  پدر .


# Complexity of the vocabulary

In [5]:
import collections
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
farsi_words_counter = collections.Counter([word for sentence in farsi_sentences for word in sentence.split()])

print('{} English words.'.format(len(english_words_counter)))
print('{} Farsi words.'.format(len(farsi_words_counter)))

108149 English words.
149727 Farsi words.


# Tokenize

In [6]:
from keras.preprocessing.text import Tokenizer

def tokenize(x):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(x)
  sequences = tokenizer.texts_to_sequences(x)
  return sequences,tokenizer

eng_text_tokenized , en_text_tokenizer = tokenize(english_sentences)
print(en_text_tokenizer.word_index)

fa_text_tokenized , fa_text_tokenizer = tokenize(farsi_sentences)
print(fa_text_tokenizer.word_index)



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



# Padding

When we are combining sequence of words together, each sequence needs to be the same length. We add padding to the end of the sequences to make them the same length.

In [9]:
import numpy as np 
from keras.preprocessing.sequence import pad_sequences

def pad(x, length = None):
  # x is list of sequences
  # length: Length to pad the sequence to. If it is None , use length of longest sequence in x 

  # It will return the padded numpy array of sequences
  if length == None:
    longest_sequence = max(x,key= len)
    return pad_sequences(x,len(longest_sequence),padding = 'post')
  return pad_sequences(x,length , padding= 'post')



# Pad tokenized output
en_padded = pad(eng_text_tokenized)
fa_padded = pad(fa_text_tokenized)


# Preprocess

In [14]:
def preprocess(x,y):
  """
    x: Feature list of sentences
    y: Label list of sentences

    It returns Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
  """

  preprocess_x , x_tk = tokenize(x)
  preprocess_y , y_tk = tokenize(y)

  preprocess_x = pad(preprocess_x)
  preprocess_y = pad(preprocess_y)

  # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
  preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
  return preprocess_x, preprocess_y, x_tk, y_tk
 
preproc_english_sentences,preproc_farsi_sentences,english_tokenizer,farsi_tokenizer = preprocess(english_sentences,farsi_sentences)

Data preprocess done !

In [16]:
print(preproc_english_sentences.shape)
print(preproc_farsi_sentences.shape)

(612087, 34)
(612087, 33, 1)


# Use RNN model 
A basic RNN model is a good baseline for sequence data.
We are going to build a RNN that translates English to Farsi