<a href="https://colab.research.google.com/github/Rehan6541/AI/blob/main/Language_translation_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,LSTM,Dense,Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

Step 2: Dataset Definition

This is dataset where each tuple consists of a simple English phrase and its French translation.This is a small toy dataset for the purpose of demonstration

In [None]:
data = [
    ("hello", "bonjour"),
    ("how are you", "comment ça va"),
    ("thank you", "merci"),
    ("good morning", "bonjour"),
    ("good night", "bonne nuit"),
    ("see you later", "à plus tard"),
    ("I love you", "je t'aime"),
]

Step 3: Text Preparation

zip(*data): Separates the data tuples into two separate lists: one for input_texts (English) and one for target_texts (French).

In [None]:
input_texts,target_texts = zip(*data)

Step 4: Tokenization

Tokenizer(): Creates a tokenizer that will convert text into sequences of integers.

fit_on_texts(): This method creates a vocabulary from the input_texts and target_texts and assigns a unique integer to each word.

In [None]:
input_tokenizer = Tokenizer()
target_tokenizer=Tokenizer()

input_tokenizer.fit_on_texts(input_texts)
target_tokenizer.fit_on_texts(target_texts)

texts_to_sequences(): Converts each text (sentence) into a sequence of integers. Each word in the text is
replaced by its corresponding integer from the vocabulary.

In [None]:
input_sequences = input_tokenizer.texts_to_sequences(input_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)


Step 5: Vocabulary and Sequence Length Calculation

word_index: This dictionary holds the integer mappings for each word. We add 1 to account for the 0-based indexing of sequences.

input_vocab_size and target_vocab_size: Store the size of the vocabulary for the input and target languages.

In [None]:
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

max_input_len and max_target_len:store the maximum length of sequences in the input and target languages,respectively.This helps with padding the sequences to a uniform length

In [None]:
max_input_len = max(len(seq) for seq in input_sequences)
max_target_len = max(len(seq) for seq in target_sequences)

Step 6:Padding Sequence

pad_sequence():Pads each sequence to ensure that all sequence have the same length.Padding is applied to the end of the sequences (padding="post")

In [None]:
encoder_input_data = pad_sequences(input_sequences,maxlen=max_input_len,padding="post")
decoder_input_data = pad_sequences(target_sequences,maxlen=max_target_len,padding="post")

Step 7 : One-Hot Encoding Target Sequences
np.zeros():Creates a zero matrix where each row corresponds to a sentence and each column corresponding to a time step in sequence .The depth correspondes to the size of the vacab (for on hot encoding )

for loop:Lops over the target sequneces and creates onehot encoded verctors where only the index corresponding to the word is 1. The shift by one ensures that data starts predicting from the second word.

In [None]:
decoder_target_data=np.zeros((len(target_texts),max_target_len,target_vocab_size),dtype="float32")
for i,seq in enumerate(target_sequences):
   for t,word in enumerate(seq):
    if t>0:#Target sequence shifted by one
      decoder_target_data[i,t-1,word]=1.0

Step 8 : Splitting the Data

train_test_split():Splits the input data(encoder and decoder inputs) and target data into training and testing

sets.test_size=0.2 means 20% of the data is used for testing and 80% for training

In [None]:
X_train,X_test,y_train,y_test,decoder_input_train,decoder_input_test=train_test_split(encoder_input_data,decoder_target_data,decoder_input_data,test_size=0.2)

Step 9: Model Architecture

In [None]:
#embedding_dim=128#or other value you'd like,typically 50,100, or 300
#Define hyperparameters
latent_dim=128#Numbers of unit in LSTM
embedding_dim=128

Input(shape=(max_input_len)):Defines the input shape for the encoder(input sentence length).

Embedding():Maps the input word indices to dense vectors of size embedding_dim.

LSTM():the LSTM layer processes the input embedding and returns two things the final hidden

In [None]:
encoder_inputs=Input(shape=(max_input_len,))
encoder_embedding=Embedding(input_vocab_size,embedding_dim)(encoder_inputs)
encoder_lstm=LSTM(latent_dim,return_state=True)
encoder_outputs,state_h,state_c=encoder_lstm(encoder_embedding)

Similar to the encoder the decoder also has an embedding layer followed by an LSTM received the encoder's final state(state_h,state_c) initilaze states for the decoding process.

return_sequences=True ensures that the decoder products a sequence of outputs rather than just the last output

In [None]:
decoder_inputs = Input(shape=(max_target_len,)) # Changed the shape to a tuple
decoder_embedding = Embedding(target_vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

Dense layer

Dense():A fully connected layer that outputs a probablity distribution over the target vocabulary(for each word in the sequence).

softmax:Ensured the output is a probability distribution

In [None]:
decoder_dense = Dense(target_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

Step 10: Defining the model


In [None]:
model=Model([encoder_inputs,decoder_inputs],decoder_outputs)
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
#Train the model
model.fit([X_train,decoder_input_train],y_train,batch_size=32,epochs=100,validation_data=([X_test,decoder_input_test],y_test))

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step - accuracy: 0.0667 - loss: 0.3415 - val_accuracy: 0.0000e+00 - val_loss: 1.7158
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.1333 - loss: 0.3393 - val_accuracy: 0.0000e+00 - val_loss: 1.7170
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - accuracy: 0.1333 - loss: 0.3370 - val_accuracy: 0.0000e+00 - val_loss: 1.7183
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step - accuracy: 0.1333 - loss: 0.3347 - val_accuracy: 0.0000e+00 - val_loss: 1.7196
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.1333 - loss: 0.3323 - val_accuracy: 0.0000e+00 - val_loss: 1.7210
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 119ms/step - accuracy: 0.1333 - loss: 0.3299 - val_accuracy: 0.0000e+00 - val_loss: 1.7225
Epoch 7/100
[1m1/1

<keras.src.callbacks.history.History at 0x7a0b2ec2a140>

In [None]:
#Purpose of Inference Model
#after the model has been trained we need to define the interference process to actually generate transition.
#In the training process both the encoder and decoder receive complete sequence.
#However,during inference(prediction) ,we only have the input sequence,and the decoder must generate the output word by word one step at a time.
#Thus we create two separate models for inferenve

#Encoder Model:Converts the input sentence into internal states(hidden amd cell states) that are  passed to the decoder

#Decoder Model:Takes internal states and generates the output sequence word vy word.'''


#Define inference models for transition
#Encoder Model->
encoder_model=Model(encoder_inputs,[state_h,state_c])

#Purpose:The encoder process the input sequence and outputs its final internal states(hidden state state_h and cell state state_c).
#These states will be passed to the decoder during inference.
#encoder_inputs:The input sequence for the encoder(which is padded).
#[state_h,state_c]:The encoder's final states that the decoder will use to start generating the output sequence.



#Decoder model
decoder_state_input_h=Input(shape=(latent_dim,))
decoder_state_input_c=Input(shape=(latent_dim,))
#decoder_states=[decoder_state_input_h,decoder_state_input_c]:Inputsto the decoder
#These are the hidden state(state_h) and cell state(state_c)
#that were produced by the encoder.
#In inference we do not have these fot the decoder.
#so they are taken as inputs for the decoder.
decoder_lstm_outputs,decoder_state_h,decoder_state_c=decoder_lstm(decoder_embedding,initial_state=[decoder_state_input_h,decoder_state_input_c])
decoder_outputs=decoder_dense(decoder_lstm_outputs)
decoder_model=Model([decoder_inputs,decoder_state_input_h,decoder_state_input_c],
                    [decoder_outputs,decoder_state_h,decoder_state_c])

#The decoder LSTM takes in the current word(embedded using the decoder_embedding layer)
#along with the hidden and cell states(decoder_state_input_h and decoder_state_input_c)
#as initial states
#decoder_lstm_outputs:The LSTM output for the current time step
#(which represents the proprabilities for each word in the vocabulary).
#decoder_state_input_h,decoder_state_input_c:The updated hidden and cells after
#processing the current word
#these states will passed back to LSTM for the next time step.

#The Function to decode a sequence using the trained model
#The function takes an input sequence(from a source language,for example)
# and uses an encoder-decoder model to generate a translated sequence (target language).
#It performs this in an iterative manner,predicting one word at a time,
#until it either predicts the end-of-sequence token or reaches a specified maximum length.
def decode_sequence(input_seq):
  #Encode the input as state vectors
  states_value=encoder_model.predict(input_seq)
  #input seq: This seaquence that you want to translate
  #the encoder model process the input seaquence and returns the state value
  #(hidden and cell states) that represents the context learned from the input seaquence
  #these states are used as the inintial states for the decoder.
  target_seq=np.zeros((1,1))
  #Generate empty target sequence of length 1
  #target_seq:This starts as an array of zeroes because at the beginning,
  #there is no input  to the decoder.As the decoder predicts words,
  #this array will hold the index of the word generated at the previous step.

  stop_condition=False
  decoded_sentence=""
  #decoded_senetence : An empty string that will hold the generated translation
  #stop_condition: a flag to indicate when the decoded process should stop
  #decoded senetence : this string will store the predicted translatipon
  while not stop_condition:
    #the loop continues until the translation is complete.
    #(i.e. when the decoder generates the end token or exceed the allowed length)
   # output_tokens,h,c=decoder_model.predict([target_seq]+states_value)
    output_tokens,h,c=decoder_model.predict([target_seq]+states_value) # Pass states_value[0] and states_value[1] separately

    #decoder_model uses the current target sequence(target_seq)
   #and the encoder's final states(state_value) to predict the next word.
   #output_token:The predicted probabilities of the next word.
    #h,c:The updated hidden and cell states.These states are passes to the next iteration to ensure continuity in generating coherent sentences.
    sampled_token_index=np.argmax(output_tokens[0,-1,:])
    sampled_word=target_tokenizer.index_word.get(sampled_token_index,"")
    #output token=[0,-1,:] :
    #the output tokens array contains the predicted probablities for each possible word in the vocab
    #The shape of output token is typically (batch_size,seaquence_len,vocab_size)
    #In this case batch_size is 1, because we are decoding one senetence
    #seaquence_length is 1 beacuse at each time step only one word is generated
    #vocablary_size is number of possible words in the target vacabulary
    #output_tokens[0,1,:] selects the predicted probablities of words at the current time stepfrom the vocabulary.
    #Illustration:Suppose the vocublary has 5 words:{0:'hello',1:'world',2:'how',3:'are',4:'you',5:''}
    #the output_tokens might look something like this:
    #output_tokens[0,-1,:]=[0.1,0.6,0.05,0.15,0.1]
    #sampled_token_index:np.argmax(output_tokens[0,-1,:]):
    #np.argmax() finds the index of the highest probablity from yhe output_tokens array.
    #in this case it will select the index 1 because the highset probability(0.6).
    #corresponds to word world.
    #now using the sample_token_index=1:
    #sampled_word=target_tokenizer.index_word(1,"")
    #sampled_word ="world"
    #Putting it all together:
    #After running np.argmax() the most likely words index (1 in this case ) is selected
    #this index is then used to retericve the corresponding word('world' in this case)
    #from the tokenizerds dictionary
    decoded_sentence+=sampled_word+" "
    #the predicted word is apended to the decoded senetnce string
    if sampled_word=="<end>" or len(decoded_sentence)>max_target_len:
      stop_condition=True
      #The decoding process stops when the <end > token is predicted
      #or if the senetence excdedds the maximum allowed length (max_target_len)
      #updated target seaquence for the next iteration :
    target_seq=np.zeros((1,1))
    #This line creates a 2D NumPy array filled with zeroes,with shape(1,1).
    #in the context of sequence-to-sequence models(such as machine translation)
    #this is used to hold the token(word index) that will be fed as input into the decoder at the next time step.
    target_seq[0,0]=sampled_token_index
    #target_seq[0,0]=sampled_token_index:
    #This line assigns the value of the sampled_token_index(which is the index of t he word predicted by the decoder in the prevvious step) to the target_seq.
    #The value is placed at position [0,0] beacuse it's 1x1 array and[0,0]
    #refers to only element in that array.
    #sample_token_index =1 : (frpom the previous word prediction step)
    #After this aaignmenet ,the target_seq will look like this :
    #target_seq[0,0]=1
    #Result : target_seq=[[1.]]
    #Purpose :
    #The traget_seq is used to input for the decoder at the next time step
    #At each decoding step decoder needs to fed the token (or word)predicted
    #in the previous time step . So this array is updated with the index of the last
    #prdicted word(sampled_token_index) and then passed to the decoder for the next prediction
    states_value=[h,c]
    #The updated hidden and cell states (h and c) are passed back into the decoder
    #To maintain the flow of information across time steps.
  return decoded_sentence
    #translate(sentence): This function translates a given sentence.
    #input tokenizer.texts_to_sequences([sentence]): Converts the input sentence into a sequence of #pad sequences(): Pads the input sequence to the maximun length (since the nodel expects Inputs
#decode sequence(): Calls the decoding function to generate the translation for the given input
#Translate a sentence


In [None]:
def translate(sentence):
  sequence=input_tokenizer.texts_to_sequences([sentence])
  sequence=pad_sequences(sequence,maxlen=max_input_len,padding="post")
  translation=decode_sequence(sequence)
  return translation

#Example usage :
translated_sentence=translate("hello")
print("Translated Sentence:",translated_sentence)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Translated Sentence: va nuit 
