<a href="https://colab.research.google.com/github/AlvinManojAlex/NLP_Tamil_Hindi/blob/main/machine_translation_model1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Mounting the GDrive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1. Reading datasets collected

## 1.1 Reading the Hindi-English Parallel Corpus

In [2]:
train_en_hi = []
train_hi_hi = []

with open('/content/drive/MyDrive/corpus/train/iitb_train.en.txt', 'r') as file:
  for line in file:

    # Appending the stripped english sentences into the list
    train_en_hi.append(line.strip())

with open('/content/drive/MyDrive/corpus/train/iitb_train.hi.txt', 'r') as file:
  for line in file:
    
    # Appending the stripped hindi sentences into the list
    train_hi_hi.append(line.strip())


print(f'{len(train_en_hi)} english lines read from IITB_English_Hindi Corpus')
print(f'{len(train_hi_hi)} hindi lines read from IITB_English_Hindi Corpus')

1603080 english lines read from IITB_English_Hindi Corpus
1603080 hindi lines read from IITB_English_Hindi Corpus


## 1.2 Reading the Tamil-English Parallel corpus

In [3]:
train_en_ta = []
train_ta_ta = []

with open('/content/drive/MyDrive/corpus/train/cvit_train.en.txt', 'r') as file:
  for line in file:

    # Appending the stripped english sentences into the list
    train_en_ta.append(line.strip())

with open('/content/drive/MyDrive/corpus/train/cvit_train.ta.txt', 'r') as file:
  for line in file:

    # Appending the stripped tamil sentences into the list
    train_ta_ta.append(line.strip())

print(f'{len(train_en_ta)} english lines read from PIB_English_Tamil Corpus')
print(f'{len(train_ta_ta)} tamil lines read from PIB_English_Tamil Corpus')

115968 english lines read from PIB_English_Tamil Corpus
115968 tamil lines read from PIB_English_Tamil Corpus


In [4]:
print(train_hi_hi[0])


अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें


In [5]:
print(train_ta_ta[0])

முறையை அமல்படுத்துவதற்கு வசதியாக, பல்வேறு சரக்கு மற்றும் சேவைகளுக்கான மேல் வரி மற்றும் கூடுதல் வரியை நீக்கும் வகையில், சுங்கம் மற்றும் கலால் சட்டத்தில் திருத்தங்களைக் கொண்டுவர மத்திய அமைச்சரவை ஒப்புதல் பிரதமர் திரு.நரேந்திர மோடி தலைமையில் மத்திய அமைச்சரவைக் கூட்டம் நடைபெற்றது. இதில், கீழ்க்காணும் பரிந்துரைகளுக்கு அப்போது ஒப்புதல் அளிக்கப்பட்டது.


# 2. Data preprocessing

Installing `tensorflow-text` and `einops` (Einstein Inspired Notation)

In [6]:
# Tensorflow package for text related operations and modules
!pip install tensorflow-text

# Installing einops for writing deep learning code better and more efficiently
!pip install einops

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-text
  Downloading tensorflow_text-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.11.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 KB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.0


Importing the packages

In [7]:
import numpy as np

import einops
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import tensorflow as tf
import tensorflow_text as tf_text

Every sentence should be treated as a `tf.string`, since we are trying to export this model as `tf.saved_model`.

## 2.1 Normalizing the sentences

P.S. Initially training the model without removing the bracketed words from the hindi corpus

### 2.1.1 Unicode Normalization

UNICODE Normalization is essential to maintain accuracy and efficiency in language translation models. For both Hindi and Tamil languages we will be using `NFC` (Normalization Form - C), which is a 'Canonical Decomposition followed by Canonical Composition'. This ensures that equivalent characters are represented in a consistent way.

<b>Reference:</b>

https://unicode.org/reports/tr15/

In [8]:
# Function that takes in a tensor and normalizes it according to NFC and returns the text

def unicode_normalize(text):
  text = tf_text.normalize_utf8(text, 'NFD')
  return text

# temp = tf.constant(train_hi_hi[1])
# unicode_normalize(temp)
# print(temp)

### 2.1.2 Converting sentence to lowercase

This is done to eliminate ambiguity, since we are using English language as an intermediary so it is neccessary to convert the english corpus to its lowercase form so that the model will see 'Car' and 'car' as the same word.

In [9]:
# Function that takes in a tensor and converts it to lowercase

def lowercase(text):
  text = tf.strings.lower(text)
  return text

# temp = tf.constant(train_en_hi[1])
# temp = lowercase(temp)
# print(temp)

### 2.1.3 Replacing some special characters

Characters like `?`, `!`, `.`, `,` and ` ` &nbsp;must not be removed from the sentence, while the other special characters must be removed. This is done by using regex to filter out the unwanted characters. Hence, we have to make a regex that does not filter out the Hindi and Tamil characters.

<b>References:</b>

https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)#:~:text=Devanagari%20is%20a%20Unicode%20block,from%20the%201988%20ISCII%20standard

https://en.wikipedia.org/wiki/Tamil_(Unicode_block)

<br>

Finally, followed by keeping a white space between the punctuations.

In [10]:
# Defining the Hindi and Tamil characters using UNICODE and then including that in the regex

# UNICODE for Hindi characters are stored continously, so we will use a loop to make our list of Hindi characters

hindi_characters = 128
hindi_unicode_shift = 0x0900

hindi_alphabets = []

for i in range(0, hindi_characters):
  hindi_alphabets.append('\\u0'+hex(hindi_unicode_shift+i)[2:])

# UNICODE for Tamil characters are not stored continuously since they have some reserved UNICODE characters in between, so we will manually add them to our list

tamil_alphabets = ['\\u0b82', '\\u0b83', '\\u0b85', '\\u0b86', '\\u0b87', '\\u0b88', '\\u0b89', '\\u0b8a', '\\u0b8e', '\\u0b8f', '\\u0b90', '\\u0b92', '\\u0b93', 
                   '\\u0b94', '\\u0b95', '\\u0b99', '\\u09b9a', '\\u0b9c', '\\u0b9e', '\\u0b9f', '\\u0ba3', '\\u0ba4', '\\u0ba8', '\\u0ba9', '\\u0baa', '\\u0bae'
                   '\\u0baf', '\\u0bb0', '\\u0bb1', '\\u0bb2', '\\u0bb3', '\\u0bb4', '\\u0bb5', '\\u0bb6', '\\u0bb7', '\\u0bb8', '\\u0bb9', '\\u0bbe', '\\u0bbf',
                   '\\u0bc0', '\\u0bc1', '\\u0bc2', '\\u0bc6', '\\u0bc7', '\\u0bc8', '\\u0bca', '\\u0bcb', '\\u0bcc', '\\u0bcd', '\\u0bd0', '\\u0bd7', '\\u0be6',
                   '\\u0be7', '\\u0be8', '\\u0be9', '\\u0bea', '\\u0beb', '\\u0bec', '\\u0bed', '\\u0bee', '\\u0bef', '\\u0bf0', '\\u0bf1', '\\u0bf2', '\\u0bf3',
                   '\\u0bf4', '\\u0bf5', '\\u0bf6', '\\u0bf7', '\\u0bf8', '\\u0bf9', '\\u0bfa']

In [37]:
# Function that takes in a tensor and keeps `?`, `!`, `.`, `,`, ` ` as such and replaces the other special characters with ``
# After that a white space is kept between the 'chosen' punctuations
# Account for regex with Hindi and Tamil
# Also account for more than 1 whitespace being generated

import re

def hindi_punctuate(text):
  regex_pattern = r"[^,.?! \u0900-\u097F]+"
  processed_string = re.sub(regex_pattern, "", text)
  processed_string = re.sub('([,.?!])', r' \1', processed_string)
  return processed_string

def tamil_punctuate(text):
  regex_pattern = r"[^,.?! \u0b82\u0b83\u0b85\u0b86\u0b87\u0b88\u0b89\u0b8a\u0b8e\u0b8f\u0b90\u0b92\u0b93\u0b94\u0b95\u0b99\u09b9a\u0b9c\u0b9e\u0b9f\u0ba3\u0ba4\u0ba8\u0ba9\u0baa\u0bae\u0baf\u0bb0\u0bb1\u0bb2\u0bb3\u0bb4\u0bb5\u0bb6\u0bb7\u0bb8\u0bb9\u0bbe\u0bbf\u0bc0\u0bc1\u0bc2\u0bc6\u0bc7\u0bc8\u0bca\u0bcb\u0bcc\u0bcd\u0bd0\u0bd7\u0be6\u0be7\u0be8\u0be9\u0bea\u0beb\u0bec\u0bed\u0bee\u0bef\u0bf0\u0bf1\u0bf2\u0bf3\u0bf4\u0bf5\u0bf6\u0bf7\u0bf8\u0bf9\u0bfa]+"
  processed_string = re.sub(regex_pattern, "", text)
  processed_string = re.sub('([,.?!])', r' \1', processed_string)
  return processed_string

def english_punctuate(text):
  text = tf.strings.regex_replace(text, '[^ a-z.?!,]', '')
  text = tf.strings.regex_replace(text, '[.,?!]', r' \0 ')
  text = tf.strings.strip(text)
  return text

# text = train_hi_hi[1]+'?'
# print(text)
# text = hindi_punctuate(text)
# print(text)
# temp = tf.constant(text)
# print(temp)
# temp = unicode_normalize(temp)

# text = train_ta_ta[1]
# print(text)
# text = tamil_punctuate(text)
# print(text)
# temp = tf.constant(text)
# print(temp)
# temp = unicode_normalize(temp)
# print(temp)


மேற்கண்ட பரிந்துரைகளால், கீழ்க்காணும் பலன்கள் கிடைக்கும்: சுங்கங்கள் சட்டம் 1962-ல் 108ஏ, 108பி பிரிவுகள் சேர்க்கப்படுகின்றன.
மேற்கண்ட பரிந்துரைகளால் , கீழ்க்காணும் பலன்கள் கிடைக்கும் ுங்கங்கள் ட்டம் ல் ஏ , பி பிரிவுகள் ேர்க்கப்படுகின்றன .


### 2.1.4 Adding START and END tokens

This helps the model in understanding where the beginning and end of a sequence is. The model we will be making, operates on sequences of fixed length, so adding these tokens will help the model mark where the sentence begins or ends. 

Thus, improving model performance and translation quality.

# 3. Making the Model

# 4. Training the Model

# 5. Evaluating accuracy