Now a days , people are not using LSTM model for translation of language , now use the LLM

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense,Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

## Step-2 Dataset Definition
This is the data set where each tuple consists of smaple engilish text and its French translation , small toy dataset for demonstration purpose

In [None]:
data = [
         ("hello", "bonjour"),
         ("how are you", "comment ça va"),
         ("thank you", "merci"),
         ("good morning", "bonjour"),
         ("good night", "bonne nuit"),
         ("see you later", "à plus tard"),
         ("I love you", "je t'aime"),
]

## Step 3: Text Preparation
32°C ✓ Mostly sunny Comment Share RAM Disk ^ zip(*data): Separates the data tuples into two separate lists: one for input_texts (English) and one for target_texts (French).

In [None]:
input_texts, target_texts = zip(*data)

## Step 4: Tokenization
RAM Disk Tokenizer(): Creates a tokenizer that will convert text into sequences of integers. fit_on_texts(): This method creates a vocabulary from the input_texts and target_texts and assigns a unique integer to each word.

In [None]:
input_tokenizer = Tokenizer()
target_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts (input_texts)
target_tokenizer.fit_on_texts (target_texts)

texts_to_sequences(): Converts each text (sentence) into a sequence of integers. Each word in the text is replaced by its corresponding integer from the vocabulary.

In [None]:
input_sequences = input_tokenizer.texts_to_sequences (input_texts)
target_sequences = target_tokenizer.texts_to_sequences (target_texts)

## Step 5: Vocabulary and Sequence
 Length Calculation word_index: This dictionary holds the integer mappings for each word. We add 1 to account for the 0-based indexing of sequences. input_vocab_size and target_vocab_size: Store the size of the vocabulary for the input and target languages.

In [None]:
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

max_input_len and max_target_len: Store the maximum length of sequences in the input and target languages, respectively. This helps with padding the sequences to a uniform length.

In [None]:
max_input_len = max(len(seq) for seq in input_sequences)
max_target_len = max(len(seq) for seq in target_sequences)

## Step 6: Padding Sequences
pad_sequences(): Pads each sequence to ensure that all sequences have the same length. Padding is applied to the end of the sequences (padding="post").

In [None]:
encoder_input_data = pad_sequences (input_sequences, maxlen=max_input_len, padding="post")
decoder_input_data = pad_sequences (target_sequences, maxlen=max_target_len, padding="post")

Step 7: One-Hot Encoding Target Sequences

np.seroz(): creates a zero matrix where each row corresponds to a sentence and each column corresponds to a time step in the sequnce.
the depth corresponds to the size of the vocabulary
loop: loops over the target sequences and creates one-hot encoded vectors where only the index of the target word is marked as 1.
The shift by one ensures that the target data starts predicting from the second word.

In [None]:
decoder_target_data=np.zeros((len(target_texts), max_target_len,target_vocab_size),dtype='float32')
for i, seq in enumerate(target_sequences):
    for t, word in enumerate(seq):
      if t>0: # target sequence shifted by one
          decoder_target_data[i,t-1,word]=1.0

## Step 8: Splitting the data
train_test_split(): splits the input data(encoder and decoder inputs) and target data into training and testing
sets.test_size=0.2 means 20 % of the data is used for testing and 80 % for training  

In [None]:
X_train, X_test, y_train, y_test, decoder_input_train, decoder_input_test = train_test_split(
    encoder_input_data, decoder_target_data, decoder_input_data, test_size=0.2
)


## step 9: Model Architecture

In [None]:
#embedding_dim= 128 # or  any other value you would like, typically 50,100 or 300
#define hyperparameters
latent_dim=128 # number o units in LSTM
embedding_dim=128 #  size of word embeddings

Input(shape=(max_input_len,)): Defines the input shape for the encoder (input sentence length).Embedding():
maps the input word indices to dense vectors of size embedding_dim.LSTM(): The LSTM layer processes the input embeddings and returns two things: the final hidden state(state_h) and cell state (state_c). these states will be passed to the decoder

In [None]:
from tensorflow.keras.layers import Input, Embedding, LSTM
# Define the encoder input layer
encoder_inputs = Input(shape=(max_input_len,))

# Define the embedding layer with masking
encoder_embedding = Embedding(input_dim=input_vocab_size, output_dim=embedding_dim, mask_zero=True)(encoder_inputs)

# Define the LSTM layer for the encoder
encoder_lstm = LSTM(latent_dim, return_state=True)

# Get the encoder outputs and states
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)


Similar to the encoder, the decoder also has an embedding layer followed by an LSTM. the LSTM receives the encoder's final states(state_h, state_c)  as initial states for the decoding process. return_sequences=True ensures that the decoder produces a sequence of outputs rather than just the last output.

In [None]:
decoder_inputs= Input(shape=(max_target_len,))
decoder_embedding=Embedding(target_vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm=LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

##Dense Layer
Dense() : A Fully connected layer that outputs a probability distribution over the target vocabulary (for each word in sequence). softmax:Ensures the output is a probability distribution

In [None]:
from tensorflow.keras.layers import Dense
decoder_dense=Dense(target_vocab_size, activation="softmax")
decoder_outputs=decoder_dense(decoder_outputs)

## Step 10: Defining the model


In [None]:
model= Model([encoder_inputs,decoder_inputs], decoder_outputs)
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
#Train the model
model.fit([X_train, decoder_input_train], y_train, batch_size=32, epochs=100, validation_data=([X_test,decoder_input_test],y_test))

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step - accuracy: 0.0000e+00 - loss: 0.8545 - val_accuracy: 0.0000e+00 - val_loss: 0.4275
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 676ms/step - accuracy: 0.2000 - loss: 0.8498 - val_accuracy: 0.0000e+00 - val_loss: 0.4275
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 154ms/step - accuracy: 0.2667 - loss: 0.8451 - val_accuracy: 0.0000e+00 - val_loss: 0.4274
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step - accuracy: 0.2667 - loss: 0.8402 - val_accuracy: 0.0000e+00 - val_loss: 0.4273
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 126ms/step - accuracy: 0.2667 - loss: 0.8352 - val_accuracy: 0.0000e+00 - val_loss: 0.4273
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step - accuracy: 0.2667 - loss: 0.8299 - val_accuracy: 0.0000e+00 - val_loss: 0.4272
Epoch 7/1

<keras.src.callbacks.history.History at 0x7d86d62360e0>

In [None]:
# Purpose of Inference Models

#After the model has been trained, we need to define the inference process to actually generate translations.
#In the training process, both the encoder and decoder receive complete sequences. However, during inference (prediction), we only have the input sentence,
#and the decoder must generate the output word by word, one step at a time.
#Thus, we create two separate models for Inference:
#Encoder model: Converts the input sentence into internal states (hidden and cell states)
#that are passed to the decoder.
#Decoder model: Takes the encoder's internal states and generates the output sequence word by word
#Define Inference models for translation

#Encoder model
encoder_model = Model(encoder_inputs, [state_h, state_c])

#Purpose: The encoder processes the input sequence and outputs its final internal states
#(hidden state state_h and cell state state_c).
#These states will be passed to the decoder during inference.
#encoder_inputs: The input sequence for the encoder (which is padded).
#[state_h, state_c]: The encoder's final states that the decoder will use to start
#generating the output sequence.

# Decoder model
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))

#decoder_state_input_h and decoder_state_input_c: Inputs to the decoder.
#These are the hidden state (state_h) and cell state (state_c)
#that were produced by the encoder.
#In Inference, we don't have these states at the beginning.
#so they are taken as Inputs for the decoder.

decoder_lstm_outputs, decoder_state_h, decoder_state_c = decoder_lstm(decoder_embedding, initial_state=[decoder_state_input_h, decoder_state_input_c])
decoder_outputs = decoder_dense(decoder_lstm_outputs)
decoder_model = Model([decoder_inputs, decoder_state_input_h, decoder_state_input_c], [decoder_outputs, decoder_state_h, decoder_state_c])

#The decoder LSTH takes in the current word (embedded using the decoder embedding layer)
#along with the hidden and cell states (decoder_state_input_h_and_decoder_state_input_c)
#as initial states.
#decoder_lstm_outputs: The LSTM output for the current time step
#(which represents the probabilities for each word in the vocabulary).
#decoder_state_h, decoder_state_c: The updated hidden and cell states after
#processing the currert word.These states will be passed back into the LSTM for
#the next time step.

#Function to decode a sequence using the trained model
#The fuction takes an Input sequence (from a source language, for example)
#and uses an encoder-decoder model to generate a translated sequence ( target language).
#It perfores this in an iterative manner, predicting one word at a time,
#until it either predicts the end-of-sequence talken or reaches a specified maximum length.

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    #input_seq: This is the sequence that you want to translate.
    #The encoder nodel processes the Input sequence and returns the states_value
    #(Hidden and cell states) that represent the context learned from the input sequence.
    #These states are used as the initial state for the decoder.

    target_seq = np.zeros((1, 1))

    #target seq: This starts as an array of zeros because at the beginning,
    #there is no input to the decoder. As the decoder predicts words,
    #this array will hold the index of the word generated at the previous step.
    #decoded_sentence: An empty string that will hold the generated translation.

    stop_condition = False
    decoded_sentence = ""
    #decode_sentence:  An empty string that will hold the generated translation.
    #stop_condition:A flag to indicate when the decoding process should stop.


    while not stop_condition:
    #The loop continues until the translation is complete
    #i.e., when the decoder generates an end token or exceeds the allowed length).
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        #decoder model uses the current target sequence (target seq)
        #and the encoder's final states (states_value) to predict the next word.
        #output tokens: The predicted probabilities of the next word.
        #h, c: The updated hidden and cell states. These states are passed to the next iteration to ensure
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        #sampled_token_index: The index of the predicted word.
        sample_word=target_tokenizer.index_word.get(sampled_token_index,"")
        #sampled_char: The word corresponding to the predicted index.


        #output_tokens[0,-1,:]:
# the output_tokens array contains the predicted probabilities for each possible word in the vocal
#the shape of output_tokens is typically (batch_size, sequence_length, vacbulary_size).
#in this case batch_size is 1 beacause we are decoding one sentence
#sequence_length is 1 bcz, at each time step, only one word is generated
#vocabulary_size is the number of possible words in the target vocabulary.
#output_tokens[0,-1:] selects the predicted probabilities of words at the current time step
#from the vocabulary
#Illustration: suppose the vocabulary has 5 words: { 0:'hello',1:'world', 2:'how', 3:'are', 4:'you'}
#the output_tokens might look something like this:
#output_tokens[0,-1,:]=[0.1,0.6,0.05,0.15,0.1]
#sampled_token_index=np.argmax(output_tokens[0,-1,:]):

#np.argmax() finds the index of the highest probability from the output_tokens array.
#In this case , it will select the index 1 because the highest probability (0.6)
#corresponds to the word 'world'.
#now, using the sampled_toekn_index=1:
#sample_word=target_tokenizer.index_word.get(1,"")
#sampled_word="world"
#putting it all together
#after running np.argmax(), the most likely words index(1 in this case) is selected.
#this index is then used to retrieve the corresponding word('world' in this case)
#from the tokenizer's dictionary

        decoded_sentence += sampled_word + " "
        #the predicted word is appended to the decoded_sentence string:
        if sampled_word =="<end>" or len(decoded_sentence)> max_target_len:
          stop_condition =True
  # the decoding process stops when the < end> token is predicted
  # of if the sentence exceeds the maximum allowed length(max_target_len).
  #update the target sequence for the next iteratioon:
        target_seq=np.zeros((1,1))
    #this line created the 2D NumPy array filled with zeros, with the shape(1,1)
    #in the context of sequence-to-sequence models (such as machine translation).
    #this is used to hold the token (word index) that will be fed as input into the decoder
        target_seq[0,0]=sampled_token_index
    #target_seq[0,0] =sampled_token_index:
    #this line assigns the value of sampled_token_index (which is the index of the word predicted
    # by the decoder in the previous step) to the target_seq.
    #The value is placed at position [0,0] bcz, its a 1*1 array and[0,0]
    #refers to the only element in that array.
    #sampled_token_index=1(from the previous word prediction step)
    #after this assignment, the target_seq will look like this:
    #target_seq[0,0]=1
    #result:  target_seq=[[1.]]
    #purpose:
    #the target_seq is used as the input for the decoder at the next time step.
    #At each decoding step, the decoder needs to be fed the token (or word) predicted
    #in the previous time step, so this array is updated with the index of the last
    #predicted word (sampled_token_index) and then passed to the decoder for the next predictions.
        state_value=[h,c]
        #the updated hiddine and dell states ( h and c) are passed back into the decoder
        #to maintain the flow of information accross time steps.
    return decoded_sentence

def translate(sentence):
    sequence = input_tokenizer.texts_to_sequences([sentence])
    sequence = pad_sequences(sequence, maxlen=max_input_len, padding="post")
    translation = decode_sequence(sequence)
    return translation


translated_sentence = translate("hello")
print("Translated Sentence:", translated_sentence)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 229ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 185ms/step


NameError: name 'sampled_word' is not defined

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

# Assume input_tokenizer, encoder_model, decoder_model, and other necessary components are defined

def decode_sequence(input_seq):
    # Encode the input as state vectors
    states_value = encoder_model.predict(input_seq)

    # Create the target sequence with a start token (assuming 0 is the <start> token index)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    # Create variables to store the translation and stop condition
    stop_condition = False
    decoded_sentence = ""

    while not stop_condition:
        # Get the output tokens and updated states from the decoder model
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Get the index of the most likely next word
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = target_tokenizer.index_word.get(sampled_token_index, '')

        # Append the word to the decoded sentence
        decoded_sentence += ' ' + sampled_word

        # Check for the end token or if the sentence exceeds max length
        if (sampled_word == "<end>" or len(decoded_sentence.split()) > max_target_len):
            stop_condition = True

        # Update the target sequence to the last predicted word's index for the next time step
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update the states for the next iteration
        states_value = [h, c]

    return decoded_sentence

def translate(sentence):
    # Convert the input sentence into a sequence
    sequence = input_tokenizer.texts_to_sequences([sentence])
    padded_sequence = pad_sequences(sequence, maxlen=max_input_len, padding="post")

    # Generate the translation using the decoder
    translation = decode_sequence(padded_sequence)

    return translation

# Example usage
translated_sentence = translate("hello")
print("Translated Sentence:", translated_sentence)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 329ms/step


KeyError: '<start>'

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

# Assume input_tokenizer, encoder_model, decoder_model, and other necessary components are defined

# Ensure the target data includes <start> and <end> tokens
target_sentences = [
    "<start> hello world <end>",
    "<start> how are you <end>",
    # Add more sentences...
]

# Fit the tokenizer on the modified target sentences
target_tokenizer = <your tokenizer method>  # Define this appropriately
target_tokenizer.fit_on_texts(target_sentences)

# Function to decode a sequence
def decode_sequence(input_seq):
    # Encode the input as state vectors
    states_value = encoder_model.predict(input_seq)

    # Create the target sequence with a start token
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    # Create variables to store the translation and stop condition
    stop_condition = False
    decoded_sentence = ""

    while not stop_condition:
        # Get the output tokens and updated states from the decoder model
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Get the index of the most likely next word
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = target_tokenizer.index_word.get(sampled_token_index, '')

        # Append the word to the decoded sentence
        decoded_sentence += ' ' + sampled_word

        # Check for the end token or if the sentence exceeds max length
        if (sampled_word == "<end>" or len(decoded_sentence.split()) > max_target_len):
            stop_condition = True

        # Update the target sequence to the last predicted word's index for the next time step
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update the states for the next iteration
        states_value = [h, c]

    return decoded_sentence

# Function to translate an input sentence
def translate(sentence):
    # Convert the input sentence into a sequence
    sequence = input_tokenizer.texts_to_sequences([sentence])
    padded_sequence = pad_sequences(sequence, maxlen=max_input_len, padding="post")

    # Generate the translation using the decoder
    translation = decode_sequence(padded_sequence)

    return translation

# Example usage
translated_sentence = translate("hello")
print("Translated Sentence:", translated_sentence)



SyntaxError: invalid syntax (<ipython-input-24-78e5de326418>, line 16)