# Sindhi Language Lemmatizer using Sequence-to-Sequence Model

## Introduction

This notebook demonstrates the process of building and training a lemmatizer for the Sindhi language using a sequence-to-sequence (seq2seq) model with LSTM layers. Lemmatization is a crucial step in many natural language processing tasks, reducing words to their base or dictionary form.

## Objective

Our goal is to create a model that can accurately lemmatize Sindhi words, handling the complexities and nuances of the language's morphology.

## Approach

We'll use the following steps:

1. Data Preparation: Load and preprocess the Sindhi POS dataset.
2. Model Architecture: Implement a seq2seq model with an encoder-decoder structure.
3. Training: Train the model on our prepared dataset.

## Dataset

We're using the SindhiPosDataset.csv, which contains Sindhi words along with their lemmas and other linguistic information.

## Libraries Used

- pandas: For data manipulation and analysis
- numpy: For numerical operations
- sklearn: For train-test split
- tensorflow.keras: For building and training the neural network model
- matplotlib: For visualizing the training process

## Note

This lemmatizer is designed specifically for the Sindhi language and may not generalize well to other languages without modifications. The performance of the model depends on the quality and quantity of the training data.

Let's begin by importing the necessary libraries and loading our dataset!

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model,load_model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
import pickle
import matplotlib.pyplot as plt

## Data Loading and Preprocessing

The `load_and_preprocess_data` function handles the crucial steps of preparing our Sindhi language data for the lemmatization model. Here's a breakdown of its operations:

1. **Data Loading**:
   - Reads the CSV file containing Sindhi words and their lemmas.

2. **Train-Test Split**:
   - Splits the data into training (80%) and testing (20%) sets using sklearn's `train_test_split`.

3. **Tokenization**:
   - Creates two separate tokenizers:
     - `input_tokenizer` for the word forms ('FORM' column)
     - `target_tokenizer` for the lemmas ('LEMMA' column)
   - Both tokenizers operate at the character level, which is crucial for handling the morphological complexity of Sindhi.

4. **Sequence Conversion**:
   - Converts the words and lemmas into numerical sequences using the fitted tokenizers.

5. **Sequence Length Determination**:
   - Calculates the maximum length for input (word) and target (lemma) sequences.

6. **Padding**:
   - Pads all sequences to the maximum length to ensure uniform input to the neural network.
   - Uses post-padding (adds zeros at the end of sequences).

7. **Return Values**:
   - Returns preprocessed training and testing data.
   - Also returns the tokenizers and maximum lengths for later use in inference.

This function encapsulates the entire data preparation pipeline, ensuring our Sindhi language data is appropriately formatted for training our seq2seq lemmatization model.

In [4]:
def load_and_preprocess_data(file_path):
    data = pd.read_csv(file_path)
    train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

    input_tokenizer = Tokenizer(char_level=True)
    input_tokenizer.fit_on_texts(train_df['FORM'])
    target_tokenizer = Tokenizer(char_level=True)
    target_tokenizer.fit_on_texts(train_df['LEMMA'])

    input_train = input_tokenizer.texts_to_sequences(train_df['FORM'])
    target_train = target_tokenizer.texts_to_sequences(train_df['LEMMA'])
    input_test = input_tokenizer.texts_to_sequences(test_df['FORM'])
    target_test = target_tokenizer.texts_to_sequences(test_df['LEMMA'])

    max_input_len = max(len(seq) for seq in input_train + input_test)
    max_target_len = max(len(seq) for seq in target_train + target_test)

    input_train = pad_sequences(input_train, maxlen=max_input_len, padding='post')
    target_train = pad_sequences(target_train, maxlen=max_target_len, padding='post')
    input_test = pad_sequences(input_test, maxlen=max_input_len, padding='post')
    target_test = pad_sequences(target_test, maxlen=max_target_len, padding='post')

    return (input_train, target_train, input_test, target_test,
            input_tokenizer, target_tokenizer, max_input_len, max_target_len)


## Model Architecture: Sequence-to-Sequence (Seq2Seq) for Sindhi Lemmatization

The `create_model` function constructs our seq2seq model for Sindhi lemmatization. This architecture is particularly suited for tasks where both input and output are sequences, like our word-to-lemma conversion.

### Model Components:

1. **Encoder**:
   - Input: Accepts variable-length sequences of word characters.
   - Embedding Layer: Converts character indices to dense vectors of fixed size.
   - LSTM Layer: Processes the embedded sequence, capturing contextual information.
   - Output: Produces a fixed-size context vector (final state).

2. **Decoder**:
   - Input: Accepts variable-length sequences of lemma characters.
   - Embedding Layer: Similar to the encoder's embedding.
   - LSTM Layer: Generates the lemma sequence, initialized with the encoder's final state.
   - Dense Layer: Produces probability distribution over possible output characters.

### Key Features:

- **Shared Embedding Dimension**: Both encoder and decoder use the same embedding dimension for consistency.
- **LSTM Units**: Determine the size of the LSTM layers in both encoder and decoder.
- **State Transfer**: The encoder's final state initializes the decoder's LSTM, passing context.
- **Softmax Activation**: Used in the final dense layer for character-level prediction.

### Model Compilation:

- **Optimizer**: RMSprop, effective for recurrent neural networks.
- **Loss Function**: Sparse Categorical Crossentropy, suitable for character-level prediction.

This architecture allows the model to learn the complex mapping between Sindhi words and their lemmas, handling variable-length inputs and outputs effectively.

In [5]:
def create_model(input_vocab_size, target_vocab_size, embedding_dim, lstm_units):
    encoder_inputs = Input(shape=(None,))
    encoder_embedding = Embedding(input_vocab_size, embedding_dim)(encoder_inputs)
    encoder_lstm = LSTM(lstm_units, return_state=True)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
    encoder_states = [state_h, state_c]

    decoder_inputs = Input(shape=(None,))
    decoder_embedding = Embedding(target_vocab_size, embedding_dim)
    decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True)
    decoder_dense = Dense(target_vocab_size, activation='softmax')

    decoder_embedded = decoder_embedding(decoder_inputs)
    decoder_outputs, _, _ = decoder_lstm(decoder_embedded, initial_state=encoder_states)
    decoder_outputs = decoder_dense(decoder_outputs)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

    return model, encoder_inputs, encoder_states, decoder_inputs, decoder_embedding, decoder_lstm, decoder_dense

## Training the Sindhi Lemmatization Model

The `train_model` function is responsible for training our sequence-to-sequence model on the Sindhi lemmatization task. Here's a breakdown of its operation:

### Preparation of Decoder Input

1. **Decoder Input Creation**:
   - Creates a copy of the target training data, shifted by one time step.
   - This is a crucial step in sequence-to-sequence learning, implementing teacher forcing.

2. **Start Token**:
   - Inserts a start token (value 2) at the beginning of each decoder input sequence.
   - This signals the beginning of decoding for each lemma.

### Model Training

1. **Input Data**:
   - `input_train`: The encoded Sindhi words (encoder input).
   - `decoder_input_data`: The prepared decoder input sequences.
   - `target_train`: The true lemma sequences (decoder target).

2. **Training Parameters**:
   - `batch_size`: Number of samples per gradient update.
   - `epochs`: Number of times the model will cycle through the entire dataset.

3. **Validation Split**:
   - Reserves 20% of the training data for validation.
   - Helps monitor the model's performance on unseen data during training.

4. **Model Fitting**:
   - Uses Keras' `fit` method to train the model.
   - Returns a `history` object containing training metrics.

### Return Value

- The function returns the `history` object, which can be used to plot training and validation loss curves.


In [6]:
def train_model(model, input_train, target_train, batch_size, epochs):
    decoder_input_data = np.zeros_like(target_train)
    decoder_input_data[:, 1:] = target_train[:, :-1]
    decoder_input_data[:, 0] = 2  # start token

    history = model.fit([input_train, decoder_input_data], target_train,
              batch_size=batch_size, epochs=epochs, validation_split=0.2)
    return history

## Saving the Sindhi Lemmatization Model and Associated Data

The `save_model_and_tokenizers` function is crucial for preserving the trained model and its associated components. This allows for later reuse without needing to retrain the model. Here's a breakdown of what this function does:

### Model Saving

1. **Full Model**:
   - Saves the complete seq2seq model as 'full_model.h5'.
   - This includes both the encoder and decoder parts of the model.

2. **Encoder Model**:
   - Saves the encoder part separately as 'encoder_model.h5'.
   - Used for encoding input words during inference.

3. **Decoder Model**:
   - Saves the decoder part separately as 'decoder_model.h5'.
   - Used for generating lemmas during inference.

### Tokenizer Saving

1. **Input Tokenizer**:
   - Saves the tokenizer used for encoding Sindhi words.
   - Stored as 'input_tokenizer.pickle'.

2. **Target Tokenizer**:
   - Saves the tokenizer used for encoding Sindhi lemmas.
   - Stored as 'target_tokenizer.pickle'.

### Configuration Saving

- Saves important configuration parameters:
  - `max_input_len`: Maximum length of input sequences.
  - `max_target_len`: Maximum length of target sequences.
- Stored as 'config.pickle'.

### Technical Details

- Uses the HDF5 format (.h5) for saving Keras models.
- Utilizes Python's `pickle` module for serializing tokenizers and configuration.
- Employs the highest pickle protocol for efficient storage.

This comprehensive saving process ensures that all necessary components are preserved, allowing for easy model deployment and inference on new Sindhi words without requiring access to the original training data or retraining the model.

In [7]:
def save_model_and_tokenizers(model, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_input_len, max_target_len):
    model.save('full_model.h5')
    encoder_model.save('encoder_model.h5')
    decoder_model.save('decoder_model.h5')

    with open('input_tokenizer.pickle', 'wb') as handle:
        pickle.dump(input_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

    with open('target_tokenizer.pickle', 'wb') as handle:
        pickle.dump(target_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

    with open('config.pickle', 'wb') as handle:
        pickle.dump({
            'max_input_len': max_input_len,
            'max_target_len': max_target_len
        }, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Plotting the Loss Curve

In this cell, we define a function to visualize the training and validation loss of a model over epochs. This is a crucial step to understand the model's performance during training and to detect overfitting or underfitting.

#### Code Explanation

- **Function Definition**:
  - `plot_loss_curve(history)`: This function takes the training history object as input and plots the loss curves for both training and validation data.

- **Plotting Process**:
  - **Figure Setup**:
    - `plt.figure(figsize=(12, 6))`: Creates a new figure with a specified size of 12 inches by 6 inches.
  - **Plotting Training Loss**:
    - `plt.plot(history.history['loss'], label='Training Loss')`: Plots the training loss over epochs.
  - **Plotting Validation Loss**:
    - `plt.plot(history.history['val_loss'], label='Validation Loss')`: Plots the validation loss over epochs.
  - **Title and Labels**:
    - `plt.title('Model Loss Over Epochs')`: Sets the title of the plot.
    - `plt.xlabel('Epoch')`: Sets the label for the x-axis.
    - `plt.ylabel('Loss')`: Sets the label for the y-axis.
  - **Legend**:
    - `plt.legend()`: Adds a legend to the plot to distinguish between training and validation loss curves.
  - **Saving the Plot**:
    - `plt.savefig('loss_curve.png')`: Saves the plot as a PNG file named 'loss_curve.png'.
  - **Close the Plot**:
    - `plt.close()`: Closes the plot to free up memory.

In [8]:
def plot_loss_curve(history):
    plt.figure(figsize=(12, 6))
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Model Loss Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.savefig('loss_curve.png')
    plt.close()

### Training and Saving the POS Tagging Model

In this series of steps, we load and preprocess the Sindhi POS tagging dataset, create and train the model, and finally save the trained model along with the tokenizers. Here's a breakdown of the code:

#### Code Steps and Explanation

1. **File Path Initialization**:
   - `file_path = "/content/SindhiPosDataset.csv"`: Sets the path to the Sindhi POS tagging dataset.

2. **Data Loading and Preprocessing**:
   - `load_and_preprocess_data(file_path)`: Loads and preprocesses the dataset, returning the necessary inputs and targets for training and testing, as well as tokenizers for input and target sequences, and the maximum sequence lengths.

3. **Vocabulary Sizes and Model Parameters**:
   - `input_vocab_size = len(input_tokenizer.word_index) + 1`: Computes the input vocabulary size.
   - `target_vocab_size = len(target_tokenizer.word_index) + 1`: Computes the target vocabulary size.
   - `embedding_dim = 128`: Sets the embedding dimension for the model.
   - `lstm_units = 256`: Sets the number of LSTM units in the model.

4. **Model Creation**:
   - `create_model(input_vocab_size, target_vocab_size, embedding_dim, lstm_units)`: Creates the encoder-decoder model for POS tagging.

5. **Model Training**:
   - `train_model(model, input_train, target_train, batch_size=64, epochs=100)`: Trains the model with a batch size of 64 for 100 epochs, returning the training history.

6. **Loss Curve Plotting**:
   - `plot_loss_curve(history)`: Plots the loss curve for the training process to visualize the model's performance over epochs.

7. **Creating Inference Models**:
   - **Encoder Model**:
     - `encoder_model = Model(encoder_inputs, encoder_states)`: Creates the encoder model for inference.
   - **Decoder Model**:
     - `decoder_state_input_h = Input(shape=(lstm_units,))`: Defines the input for the decoder's hidden state.
     - `decoder_state_input_c = Input(shape=(lstm_units,))`: Defines the input for the decoder's cell state.
     - `decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]`: Groups the decoder state inputs.
     - `decoder_embedded = decoder_embedding(decoder_inputs)`: Embeds the decoder inputs.
     - `decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedded, initial_state=decoder_states_inputs)`: Passes the embedded inputs through the LSTM layer with the initial state.
     - `decoder_states = [state_h, state_c]`: Groups the decoder states.
     - `decoder_outputs = decoder_dense(decoder_outputs)`: Applies the dense layer to the decoder outputs.
     - `decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)`: Creates the decoder model for inference.

8. **Saving the Models and Tokenizers**:
   - `save_model_and_tokenizers(model, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_input_len, max_target_len)`: Saves the trained model, encoder and decoder inference models, and tokenizers to disk.


In [10]:
file_path = "/content/SindhiPosDataset.csv"
(input_train, target_train, input_test, target_test,
input_tokenizer, target_tokenizer, max_input_len, max_target_len) = load_and_preprocess_data(file_path)

input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1
embedding_dim = 128
lstm_units = 256

model, encoder_inputs, encoder_states, decoder_inputs, decoder_embedding, decoder_lstm, decoder_dense = create_model(
        input_vocab_size, target_vocab_size, embedding_dim, lstm_units)

history = train_model(model, input_train, target_train, batch_size=64, epochs=100)
plot_loss_curve(history)
    # Create inference models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(lstm_units,))
decoder_state_input_c = Input(shape=(lstm_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_embedded = decoder_embedding(decoder_inputs)
decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedded, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)

save_model_and_tokenizers(model, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_input_len, max_target_len)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

  saving_api.save_model(
