# Day 1

1. re (Regular Expressions): Helps in pattern matching and text preprocessing.
2. sklearn.utils.shuffle: Randomly shuffles datasets to avoid biases in training.
3. tensorflow.keras.layers:
    1. Input: Defines input layers for neural networks.
    2. LSTM: Long Short-Term Memory layer for sequential data processing (e.g., NLP).
    3. Embedding: Converts categorical data into dense vector representations.
    4. Dense: Fully connected layer for deep learning models.
    5. Bidirectional: Wraps LSTM to process input in both forward and backward directions.
4. tensorflow.keras.models.Model: Defines and compiles deep learning models.
5. string & digits (from string module): Used for string operations, including handling punctuation and digits.

In [48]:
import pandas as pd
import numpy as np
import string
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Input,LSTM,Embedding,Dense,Bidirectional
from tensorflow.keras.models import Model

In [49]:
data=pd.read_csv(r'C:\Users\User\Downloads\Dataset_English_Hindi.csv\Dataset_English_Hindi.csv')

In [50]:
data.head()

Unnamed: 0,English,Hindi
0,Help!,बचाओ!
1,Jump.,उछलो.
2,Jump.,कूदो.
3,Jump.,छलांग.
4,Hello!,नमस्ते।


# Let us analyze the data!

In [51]:
data['English'].isnull().value_counts()

English
False    130474
True          2
Name: count, dtype: int64

In [52]:
data['Hindi'].isnull().value_counts()

Hindi
False    130164
True        312
Name: count, dtype: int64

In [53]:
data.dropna(inplace=True)

# Preprocessing 

In [54]:
# Convert everything into LOWER case
data.English=data.English.apply(lambda x: x.lower())
data.Hindi=data.Hindi.apply(lambda x: x.lower())

# Remove Quotes
data.English=data.English.apply(lambda x: re.sub("'",'',x))
data.Hindi=data.Hindi.apply(lambda x: re.sub("'",'',x))

# Set of all special characters
exclude=set(string.punctuation)

# Remove all Special characters stored in the above set 'exclude'
data.English=data.English.apply(lambda x:''.join(ch for ch in x if ch not in exclude))
data.Hindi=data.Hindi.apply(lambda x:''.join(ch for ch in x if ch not in exclude))

#Removing Hindi punctuations explicitly
hindi_punctuation = '।॥“”‘’'
data.Hindi = data.Hindi.apply(lambda x: re.sub(f"[{hindi_punctuation}]", "", x))


# Remove all numbers from the text
remove_digits=str.maketrans('','',digits)#creates a translation table that removes all digits (0-9) from a string when used with .translate().
data.English=data.English.apply(lambda x: x.translate(remove_digits))
data.Hindi = data.Hindi.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

# Remove all extra spaces
data.English=data.English.apply(lambda x: x.strip())
data.Hindi=data.Hindi.apply(lambda x: x.strip())
data.English=data.English.apply(lambda x : re.sub(' +',' ',x))
data.Hindi=data.Hindi.apply(lambda x : re.sub(' +',' ',x))

data.Hindi=data.Hindi.apply(lambda x: 'START_+ '+x+' _END')


Text Preprocessing Summary
1. Lowercasing – Converts all text to lowercase.
2. Remove Quotes & Special Characters – Eliminates punctuation and single quotes.
3. Remove Numbers – Deletes English digits (0-9) and specific Hindi numerals.
4. Remove Extra Spaces – Strips leading/trailing spaces and replaces multiple spaces with a single one.
5. Add Start & End Tokens (Hindi) – Wraps Hindi text with "START_+ " and " _END" for sequence modeling.

In [55]:
data.sample(5)

Unnamed: 0,English,Hindi
51215,history,START_+ इतिहास _END
13617,january nd,START_+ जनवरी में _END
78505,given below are the names of the speakers or p...,START_+ जो प्रतिष्ठित व्यक़्ति अध्यक्ष पद पर य...
42106,english govt has not approved the order,START_+ अंग्रेज़ सरकार ने यह मांग पूरी नहीं की...
99618,the hair of the body are jet black and lustrous,START_+ शरीर के बाल गहरे काले और चमकीले होते ह...


The whole vocabulary of Hindi and English present in the dataset 

In [56]:
# English Vocabulary
english_vocab = set()
for sentence in data.English:
    for word in sentence.split():
        english_vocab.add(word)  # No need for the extra 'if' check

# Hindi Vocabulary
hindi_vocab = set()
for sentence in data.Hindi:
    for word in sentence.split():
        hindi_vocab.add(word)  # No need for the extra 'if' check


In [57]:
# Max Length of Source sequence
length_list=[]
for l in data.English:
    length_list.append(len(l.split(' ')))
max_length_src=np.max(length_list)

# Max Length of Target sequence
length_list=[]
for l in data.Hindi:
    length_list.append(len(l.split(' ')))
max_length_tar=np.max(length_list)


In [58]:
max_length_tar

419

In [59]:
input_words=sorted(list(english_vocab))
target_words=sorted(list(hindi_vocab))

num_encoder_tokens=len(english_vocab)
num_decoder_tokens=len(hindi_vocab)
num_encoder_tokens,num_decoder_tokens

# for 0 padding
num_decoder_tokens+=1


In [60]:
num_decoder_tokens

84207

In [61]:
input_token_index=dict([word,i+1] for i,word in enumerate(input_words))
target_token_index=dict([word,i+1] for i,word in enumerate(target_words))

reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [62]:
for i in ['hello','i','am','learning','translation']:
    print(input_token_index[i])
    

27053
28551
2279
35181
64814


In [63]:
data=shuffle(data)
data.head()

Unnamed: 0,English,Hindi
86628,the system of opening banks in villages by the...,START_+ ग्रमीण बैंकों के खुलने से इस उपराध से ...
28132,harvest baby teeth,START_+ दूध के दाँतों की फसल काट लो _END
115272,the northern face of the rock is roughly carve...,START_+ चट्टान के उत्तरी फलक पर एक बड़े बैठे ह...
109362,australia has received its first championship ...,START_+ ऑस्ट्रेलिया ने अपनी राष्ट्रीय प्रथम श्...
38569,lrb pradeep jain v union of india air sc scc rrb,START_+ प्रदीप जैन बनाम भारत संघ ए आई आर 1984 ...


# Preprocessing complete

In [64]:
import pickle

In [65]:
# Save as pickle
with open("preprocessed_data.pkl", "wb") as f:
    pickle.dump(data, f)

# Day 2

We are going to load the preprocessed data stored as pickle file.

In [66]:
import pickle


# Load it back
with open("preprocessed_data.pkl", "rb") as f:
    data = pickle.load(f)

data.head()


Unnamed: 0,English,Hindi
86628,the system of opening banks in villages by the...,START_+ ग्रमीण बैंकों के खुलने से इस उपराध से ...
28132,harvest baby teeth,START_+ दूध के दाँतों की फसल काट लो _END
115272,the northern face of the rock is roughly carve...,START_+ चट्टान के उत्तरी फलक पर एक बड़े बैठे ह...
109362,australia has received its first championship ...,START_+ ऑस्ट्रेलिया ने अपनी राष्ट्रीय प्रथम श्...
38569,lrb pradeep jain v union of india air sc scc rrb,START_+ प्रदीप जैन बनाम भारत संघ ए आई आर 1984 ...


In [67]:
data['inp_len']=data.English.apply(lambda x:len(x.split()))
data['tar_len']=data.Hindi.apply(lambda x:len(x.split()))

In [68]:
data.sample(5)

Unnamed: 0,English,Hindi,inp_len,tar_len
76367,need to be accompanied by a mystical feeling,START_+ किसी आध्यात्मिक भावना से जोडा जाये _END,8,8
95640,thereafter the approximate age can be determin...,START_+ इसके बाद तो आयु का अनुमान दांतों के घि...,17,26
83403,the effect of the morphine injection lasts jus...,START_+ मॉर्फीन इंजेक्शन का असर चार घंटे रहता ...,10,10
42348,i nearly died was in a coma,START_+ मै लगभग मर गयी थी कोमा में थी _END,7,10
121535,halva type of sweet,START_+ लप्सी _END,4,3


In [69]:
X,y=data.English,data.Hindi
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
X_train.shape,X_test.shape

((91113,), (39049,))

In [70]:
X_train.to_pickle('X_train.pkl')
X_test.to_pickle('X_test.pkl')

In [71]:
def generate_batch(X=X_train,y=y_train,batch_size=128):
    '''Method to Generate a Batch of Data'''
    # default batch size=128
    for j in range(0,len(X),128):
        encoder_input_data=np.zeros((batch_size,max_length_src),dtype='float32')
        decoder_input_data=np.zeros((batch_size,max_length_tar),dtype='float32')
        decode_target_data=np.zeros((batch_size,max_length_tar,num_decoder_tokens),dtype='float32')

        for i,(input_text,target_text) in enumerate(zip(X[j:j+batch_size],y[j:j+batch_size])):
            
            for t,word in enumerate(input_text.split()):
                encoder_input_data[i,t]=input_token_index[word]
            for t,word in enumerate(target_text.split()):
                if(t<len(target_text.split())-1):
                    decoder_input_data[i,t]=target_token_index[word]
                    #Decoder input sequence for -
                    '''Teacher Forcing'''
                if(t>0):
                    #decoder target sequence OHE
                    #It doesn't include the START token
                    # therefore, Offset by 1 timestep
                    decoder_target_data[i,t-1,target_token_index[word]]=1
        yield([encoder_input_data, decoder_input_data],decoder_target_data)      



In [72]:
latent_dim=20

In [73]:
#Encoder
encoder_inputs=Input(shape=(None,))
enc_emb=Embedding(input_dim=num_encoder_tokens,output_dim=latent_dim,mask_zero=True)(encoder_inputs)
encoder_lstm=LSTM(units=latent_dim, return_state=True)
encoder_outputs,state_h,state_c=encoder_lstm(enc_emb)
#We only need the states of the encoder
encoder_states=[state_h,state_c]

In [78]:
# Set up the decoder using 'encoder_states' as initial state
decoder_inputs=Input(shape=(None,))
decoder_emb_layer=Embedding(input_dim=num_decoder_tokens,output_dim=latent_dim,mask_zero=True)
#By mask_zero = True, the model will ignore the 0s as it has 0 contribution in the target sentence
dec_emb=decoder_emb_layer(decoder_inputs)
'''We set up our decoder to return full output sequences, 
and to return internal states as well. We don't use the
return states in the training model but we will use 
it for inference'''
decoder_lstm=LSTM(units=latent_dim,return_sequences=True,return_state=True)
decoder_outputs,_,_=decoder_lstm(dec_emb,initial_state=encoder_states)
decoder_dense=Dense(units=num_decoder_tokens,activation='softmax')
decoder_outputs=decoder_dense(decoder_outputs)

''' Define the model that will turn 'encoder_input_data' &
'decoder_input_data' to 'decoder_target_data' '''

model=Model([encoder_inputs,decoder_inputs],decoder_outputs)

In [79]:
model.compile(optimmizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

TypeError: Invalid keyword argument(s) in `compile()`: ({'optimmizer'},). Valid keyword arguments include "cloning", "experimental_run_tf_function", "distribute", "target_tensors", or "sample_weight_mode".