## Inference and Imports

This section covers the setup for running inference using a pre-trained sequence-to-sequence model. The process includes importing necessary libraries, defining functions to load the models and tokenizers, and utilizing a function to lemmatize Sindhi words.

#### Steps and Explanation

1. **Import Libraries**:
   - Import essential libraries including `numpy`, `sklearn`, and components from `tensorflow.keras`.

2. **Load Models and Tokenizers**:
   - Define a function `load_models_and_tokenizers` to load the pre-trained encoder and decoder models, input and target tokenizers, and the configuration dictionary from saved files.

3. **Lemmatize Function**:
   - Define a `lemmatize` function to convert input text into sequences, pad them, and predict the lemma using the loaded models. This function implements the inference logic of the sequence-to-sequence model.

4. **Test the Lemmatizer**:
   - Load the models and tokenizers using the `load_models_and_tokenizers` function.
   - Test the lemmatizer on a set of Sindhi words to verify its functionality.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model,load_model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
import pickle

### Loading Models and Tokenizers

This function loads the trained encoder and decoder models along with their corresponding tokenizers and configuration settings. This step is crucial for performing inference using the trained POS tagging model. Here's a breakdown of the code:

#### Code Steps and Explanation

1. **Loading Encoder and Decoder Models**:
   - `encoder_model = load_model('encoder_model.h5')`: Loads the saved encoder model.
   - `decoder_model = load_model('decoder_model.h5')`: Loads the saved decoder model.

2. **Loading Input Tokenizer**:
   - `with open('input_tokenizer.pickle', 'rb') as handle: input_tokenizer = pickle.load(handle)`: Opens and loads the input tokenizer from the saved pickle file.

3. **Loading Target Tokenizer**:
   - `with open('target_tokenizer.pickle', 'rb') as handle: target_tokenizer = pickle.load(handle)`: Opens and loads the target tokenizer from the saved pickle file.

4. **Loading Configuration**:
   - `with open('config.pickle', 'rb') as handle: config = pickle.load(handle)`: Opens and loads the configuration settings from the saved pickle file.

5. **Returning Loaded Objects**:
   - `return encoder_model, decoder_model, input_tokenizer, target_tokenizer, config`: Returns the loaded encoder model, decoder model, input tokenizer, target tokenizer, and configuration settings.

In [3]:
def load_models_and_tokenizers():
    encoder_model = load_model('encoder_model.h5')
    decoder_model = load_model('decoder_model.h5')

    with open('input_tokenizer.pickle', 'rb') as handle:
        input_tokenizer = pickle.load(handle)

    with open('target_tokenizer.pickle', 'rb') as handle:
        target_tokenizer = pickle.load(handle)

    with open('config.pickle', 'rb') as handle:
        config = pickle.load(handle)

    return encoder_model, decoder_model, input_tokenizer, target_tokenizer, config

### Lemmatization Function

This function performs lemmatization on an input text using the trained encoder-decoder model architecture. It translates the input sequence into its lemmatized form by leveraging the trained sequence-to-sequence model.

#### Code Steps and Explanation

1. **Tokenizing and Padding Input Text**:
   - `input_seq = input_tokenizer.texts_to_sequences([input_text])`: Converts the input text into a sequence of integer tokens.
   - `input_seq = pad_sequences(input_seq, maxlen=config['max_input_len'], padding='post')`: Pads the token sequence to the maximum input length defined in the configuration.

2. **Encoding the Input**:
   - `states_value = encoder_model.predict(input_seq)`: Encodes the input sequence and retrieves the internal states from the encoder model.

3. **Initializing the Target Sequence**:
   - `target_seq = np.zeros((1, 1))`: Initializes an empty target sequence with a length of 1.
   - `target_seq[0, 0] = 2`: Sets the first character of the target sequence to the start token (typically represented by the integer 2).

4. **Decoding the Output**:
   - A loop is used to generate the output sequence token by token:
     - `output_tokens, h, c = decoder_model.predict([target_seq] + states_value)`: Predicts the next token and updates the states.
     - `sampled_token_index = np.argmax(output_tokens[0, -1, :])`: Finds the index of the most probable token.
     - `sampled_char = target_tokenizer.index_word.get(sampled_token_index, '')`: Retrieves the actual token (character) from the token index.
     - `decoded_sentence += sampled_char`: Appends the predicted token to the decoded sentence.
     - The loop continues until a stop condition is met: either the end of the sequence is reached or the maximum target length is exceeded.

5. **Updating Target Sequence and States**:
   - `target_seq = np.zeros((1, 1))`: Prepares the target sequence for the next iteration.
   - `target_seq[0, 0] = sampled_token_index`: Sets the target sequence to the sampled token index.
   - `states_value = [h, c]`: Updates the states for the next prediction.

6. **Returning the Decoded Sentence**:
   - `return decoded_sentence`: Returns the final lemmatized sentence after the loop completes.

In [4]:
def lemmatize(input_text, encoder_model, decoder_model, input_tokenizer, target_tokenizer, config):
    input_seq = input_tokenizer.texts_to_sequences([input_text])
    input_seq = pad_sequences(input_seq, maxlen=config['max_input_len'], padding='post')

    # Encode the input
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character
    target_seq[0, 0] = 2  # start token

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = target_tokenizer.index_word.get(sampled_token_index, '')
        decoded_sentence += sampled_char

        # Exit condition: either hit max length or find stop character
        if (sampled_char == '' or len(decoded_sentence) > config['max_target_len']):
            stop_condition = True

        # Update the target sequence (of length 1)
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

### Testing the Lemmatizer

The following code demonstrates how to load the trained models and tokenizers, and use the lemmatization function to predict lemmas for a list of test words.

#### Steps and Explanation

1. **Load Models and Tokenizers**:
   - `encoder_model, decoder_model, input_tokenizer, target_tokenizer, config = load_models_and_tokenizers()`: Loads the pre-trained encoder and decoder models, as well as the input and target tokenizers, and the configuration dictionary.

2. **Test the Lemmatizer**:
   - A list of test words (`test_words = ['ڪرڻ', 'اڪثر', 'کائيندو']`) is defined for testing the lemmatizer.
   - For each word in the test list:
     - `lemma = lemmatize(word, encoder_model, decoder_model, input_tokenizer, target_tokenizer, config)`: The `lemmatize` function is called to predict the lemma of the word.
     - `print(f"Word: {word}, Lemma: {lemma}")`: The original word and its predicted lemma are printed.

In [5]:
encoder_model, decoder_model, input_tokenizer, target_tokenizer, config = load_models_and_tokenizers()
# Test the lemmatizer
test_words = ['ڪرڻ', 'اڪثر', 'کائيندو']
for word in test_words:
  lemma = lemmatize(word, encoder_model, decoder_model, input_tokenizer, target_tokenizer, config)
  print(f"Word: {word}, Lemma: {lemma}")



Word: ڪرڻ, Lemma: ڪر
Word: اڪثر, Lemma: اڪثر
Word: کائيندو, Lemma: ناه
