<a href="https://colab.research.google.com/github/AlvinManojAlex/NLP_Tamil_Hindi/blob/main/machine_translation_model1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Mounting the GDrive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1. Reading datasets collected

## 1.1 Reading the Hindi-English Parallel Corpus

In [3]:
train_en_hi = []
train_hi_hi = []

with open('/content/drive/MyDrive/corpus/train/iitb_train.en.txt', 'r') as file:
  for line in file:

    # Appending the stripped english sentences into the list
    train_en_hi.append(line.strip())

with open('/content/drive/MyDrive/corpus/train/iitb_train.hi.txt', 'r') as file:
  for line in file:
    
    # Appending the stripped hindi sentences into the list
    train_hi_hi.append(line.strip())


print(f'{len(train_en_hi)} english lines read from IITB_English_Hindi Corpus')
print(f'{len(train_hi_hi)} hindi lines read from IITB_English_Hindi Corpus')

1603080 english lines read from IITB_English_Hindi Corpus
1603080 hindi lines read from IITB_English_Hindi Corpus


## 1.2 Reading the Tamil-English Parallel corpus

In [4]:
train_en_ta = []
train_ta_ta = []

with open('/content/drive/MyDrive/corpus/train/cvit_train.en.txt', 'r') as file:
  for line in file:

    # Appending the stripped english sentences into the list
    train_en_ta.append(line.strip())

with open('/content/drive/MyDrive/corpus/train/cvit_train.ta.txt', 'r') as file:
  for line in file:

    # Appending the stripped tamil sentences into the list
    train_ta_ta.append(line.strip())

print(f'{len(train_en_ta)} english lines read from PIB_English_Tamil Corpus')
print(f'{len(train_ta_ta)} tamil lines read from PIB_English_Tamil Corpus')

115968 english lines read from PIB_English_Tamil Corpus
115968 tamil lines read from PIB_English_Tamil Corpus


# 2. Data preprocessing

Installing `tensorflow` and `einops` (Einstein Inspired Notation)

In [6]:
# Tensorflow package for text related operations and modules
!pip install tensorflow-text

# Installing einops for writing deep learning code better and more efficiently
!pip install einops

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-text
  Downloading tensorflow_text-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.11.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 KB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.0


Importing the packages

In [7]:
import numpy as np

import einops
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import tensorflow as tf
import tensorflow_text as tf_text

Every sentence should be treated as a `tf.string`, since we are trying to export this model as `tf.saved_model`.

## 2.1 Normalizing the sentences

P.S. Initially training the model without removing the bracketed words from the hindi corpus

### 2.1.1 Unicode Normalization

UNICODE Normalization is essential to maintain accuracy and efficiency in language translation models. For both Hindi and Tamil languages we will be using `NFC` (Normalization Form - C), which is a 'Canonical Decomposition followed by Canonical Composition'. This ensures that equivalent characters are represented in a consistent way.

Reference: https://unicode.org/reports/tr15/

In [22]:
# Function that takes in a tensor and normalizes it according to NFC and returns the text

def unicode_normalize(text):
  text = tf_text.normalize_utf8(text, 'NFC')
  return text

# temp = tf.constant(train_hi_hi[1])
# unicode_normalize(temp)
# print(temp)

### 2.1.2 Converting sentence to lowercase

This is done to eliminate ambiguity, since we are using English language as an intermediary so it is neccessary to convert the english corpus to its lowercase form so that the model will see 'Car' and 'car' as the same word.

In [24]:
# Function that takes in a tensor and converts it to lowercase

def lowercase(text):
  text = tf.strings.lower(text)
  return text

# temp = tf.constant(train_en_hi[1])
# temp = lowercase(temp)
# print(temp)

### 2.1.3 Replacing some special characters

Characters like `?`, `!`, `.`, `,` and ` ` &nbsp;must not be removed from the sentence, while the other special characters must be removed.

Finally, followed by keeping a white space between the punctuations.

In [56]:
# Function that takes in a tensor and keeps `?`, `!`, `.`, `,`, ` ` as such and replaces the other special characters with ``
# After that a white space is kept between the 'chosen' punctuations
# Account for regex with Hindi and Tamil
# Also account for more than 1 whitespace being generated

def punctuate(text):
  text = tf.strings.regex_replace(text, '[^ a-z.?!,]', '')
  # hindi and tamil regex to be implemented
  text = tf.strings.regex_replace(text, '[.,?!]', r' \0 ')
  text = tf.strings.strip(text)
  return text

### 2.1.4 Adding START and END tokens

This helps the model in understanding where the beginning and end of a sequence is. The model we will be making, operates on sequences of fixed length, so adding these tokens will help the model mark where the sentence begins or ends. 

Thus, improving model performance and translation quality.

# 3. Making the Model

# 4. Training the Model

# 5. Evaluating accuracy