<a href="https://colab.research.google.com/github/HegdeSiddesh/cs6910_Assignment3/blob/main/Extras/Assignment_3_hindi_english_data_preprocess_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: 

The goal of this assignment is fourfold: 
- learn how to model sequence to sequence learning problems using Recurrent Neural Networks 
- Compare different cells such as vanilla RNN, LSTM and GRU 
- Understand how attention networks overcome the limitations of vanilla seq2seq models 
- Visualise the interactions between different components in a RNN based model.

In this assignment we experiment with the Dakshina dataset released by Google. This dataset contains pairs of the following form: 


x .  y

ajanabee अजनबी.



i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). Given many such pairs, the goal is to train a model which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर). 



This is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. Notice that this is a scaled down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to sequence of characters here).






### Import required packages

In [32]:
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
import random
np.random.seed(137) # To ensure that the random number generated are the same for every iteration
import warnings
warnings.filterwarnings("ignore")
!pip install --upgrade wandb
import wandb
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
from keras.models import Sequential
from keras_preprocessing.image import ImageDataGenerator
import os
from keras.datasets import fashion_mnist
from keras.layers.convolutional import Conv2D
from keras.layers import Dense, Flatten, InputLayer
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Activation
from wandb.keras import WandbCallback
from google.colab.patches import cv2_imshow
#@title 
!pip install uniseg
import matplotlib.ticker as ticker
import unicodedata
import re
import os
import io
import time
from matplotlib.font_manager import FontProperties
tf.random.set_seed(137)



In [None]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Downloading the data and reading train, test and validation files

In [33]:
!curl https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar --output daksh.tar
# Extract the downloaded tar file
!tar -xvf  'daksh.tar'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1915M  100 1915M    0     0   192M      0  0:00:09  0:00:09 --:--:--  196M
dakshina_dataset_v1.0/bn/
dakshina_dataset_v1.0/bn/lexicons/
dakshina_dataset_v1.0/bn/lexicons/bn.translit.sampled.test.tsv
dakshina_dataset_v1.0/bn/lexicons/bn.translit.sampled.train.tsv
dakshina_dataset_v1.0/bn/lexicons/bn.translit.sampled.dev.tsv
dakshina_dataset_v1.0/bn/native_script_wikipedia/
dakshina_dataset_v1.0/bn/native_script_wikipedia/bn.wiki-filt.valid.text.shuf.txt.gz
dakshina_dataset_v1.0/bn/native_script_wikipedia/bn.wiki-full.info.sorted.tsv.gz
dakshina_dataset_v1.0/bn/native_script_wikipedia/bn.wiki-filt.train.info.sorted.tsv.gz
dakshina_dataset_v1.0/bn/native_script_wikipedia/bn.wiki-filt.train.text.sorted.tsv.gz
dakshina_dataset_v1.0/bn/native_script_wikipedia/bn.wiki-filt.train.text.shuf.txt.gz
dakshina_dataset_v1.0/bn/native_script

In [59]:
# Read the required files
train_file = pd.read_csv("/content/dakshina_dataset_v1.0/hi/lexicons/hi.translit.sampled.train.tsv", sep = '\t', header = None, names = ['hindi','english', 'attestations'])
val_file = pd.read_csv("/content/dakshina_dataset_v1.0/hi/lexicons/hi.translit.sampled.dev.tsv", sep = '\t', header = None, names = ['hindi','english', 'attestations'])
test_file = pd.read_csv("/content/dakshina_dataset_v1.0/hi/lexicons/hi.translit.sampled.test.tsv", sep = '\t', header = None, names = ['hindi','english', 'attestations'])

##Data processing

In [60]:
train_file.english = train_file.english.astype(str)
train_file.hindi = train_file.hindi.astype(str)

In [61]:
# Lowercase all characters
train_file.english=train_file.english.apply(lambda x: x.lower())
train_file.hindi=train_file.hindi.apply(lambda x: x.lower())

In [62]:
# Remove quotes
import string
train_file.english=train_file.english.apply(lambda x: re.sub("'", '', x))
train_file.hindi=train_file.hindi.apply(lambda x: re.sub("'", '', x))
exclude = set(string.punctuation) # Set of all special characters

In [63]:
# Remove all the special characters
train_file.english=train_file.english.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
train_file.hindi=train_file.hindi.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [64]:
# Add start and end tokens to target sequences
train_file.hindi = train_file.hindi.apply(lambda x : '\t'+ x + '\n')

In [65]:
english_characters =set()
for word in train_file.english:
    for char in word:
        english_characters.add(char)

hindi_characters =set()
for word in train_file.hindi:
    for char in word:
        hindi_characters.add(char)

In [66]:
english_characters.add(' ')
hindi_characters.add(' ')

In [67]:
english_characters       = sorted(list(english_characters))
hindi_characters      = sorted(list(hindi_characters))
num_encoder_characters = len(english_characters)
num_decoder_characters = len(hindi_characters)

max_encoder_seq_length = max([len(word) for word in train_file.english])
max_decoder_seq_length = max([len(word) for word in train_file.hindi])


In [68]:
english_char_index = dict([(char, i) for i, char in enumerate(english_characters)])
hindi_char_index = dict([(char, i) for i, char in enumerate(hindi_characters)])

reverse_english_char_index = dict((i, char) for char, i in english_char_index.items())
reverse_hindi_char_index = dict((i, char) for char, i in hindi_char_index.items())

In [69]:
encoder_train_input_data = np.zeros((len(train_file.english), max_encoder_seq_length), dtype="float32")
decoder_train_input_data = np.zeros((len(train_file.english), max_decoder_seq_length), dtype="float32")
decoder_train_target_data = np.zeros((len(train_file.english), max_decoder_seq_length, num_decoder_characters ), dtype="float32")

In [70]:
for i, (input_word, target_word) in enumerate(zip(train_file.english, train_file.hindi)):
    for t, char in enumerate(input_word):
        encoder_train_input_data[i, t] = english_char_index[char]
    encoder_train_input_data[i, t + 1 :] = hindi_char_index[' ']
    
    for t, char in enumerate(target_word):
        decoder_train_input_data[i, t] = hindi_char_index[char]
        if t > 0:
            # decoder target sequence (one hot encoded)
            # does not include the START_ token
            # Offset by one timestep
            decoder_train_target_data[i, t - 1, hindi_char_index[char]] = 1.0
    decoder_train_input_data[i, t + 1 :] = hindi_char_index[' ']
    decoder_train_target_data[i, t :, hindi_char_index[' ']] = 1.0

In [71]:
decoder_train_target_data

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0.

##Question 1 & 2

Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari). 



The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.



Now train your model using any one language from the Dakshina dataset. Use the standard train, dev, test set from the folder dakshina_dataset_v1.0/hi/lexicons/ (replace hi by the language of your choice)



Using the sweep feature in wandb find the best hyperparameter configuration. 





In [None]:
wandb.init(project="Assignment_3", name="Question_1&2")