## Step-by-Step Explanation of the Text Recognition Model with CTC Loss:

This model tackles text recognition from images by combining CNNs for feature extraction and Bidirectional LSTMs for sequence modeling. Here's a breakdown focusing on the data flow and CTC loss:

**1. Input:**

* The model takes an image as input. The code defines the input shape based on pre-defined parameters like `img_w` (image width) and `img_h` (image height). This image has a single channel, likely representing grayscale values.

**2. Feature Extraction with CNNs (VGG-like architecture):**

* The image goes through several convolutional layers with ReLU activation and batch normalization. 
* These layers extract features from the image, like edges, shapes, and textures, that are relevant for recognizing text characters.
* Max pooling layers are used for downsampling and reducing the dimensionality while retaining important information.

**3. Reshape for RNN:**

* The output from the CNNs is reshaped into a format suitable for feeding into the RNN layer. This typically involves flattening the spatial dimensions (width and height) into a single feature vector for each time step.

**4. Embedding with Dense Layer:**

* A dense layer projects the reshaped CNN output to a lower-dimensional space, creating a more compact representation of the features. This can be considered an embedding of the extracted features.

**5. Bidirectional LSTMs for Sequence Modeling:**

* Two stacked LSTMs are used in a bidirectional fashion.
    * One LSTM processes the sequence (embedding) in the forward direction.
    * Another LSTM processes the same sequence in the backward direction.
* This allows the model to capture contextual information from both directions of the sequence, which is crucial for recognizing characters in the correct order.
* Batch normalization is applied after each LSTM layer for stability.

**6. Character Activations with Dense Layer:**

* The final output from the Bidirectional LSTMs is fed into a dense layer with a number of neurons equal to the total number of characters the model needs to recognize.
* This dense layer projects the internal representation to a space where each element represents the probability of a specific character being present at a particular position in the sequence.

**7. Softmax Activation for Output Probabilities:**

* A softmax activation function is applied to the output of the dense layer. 
* Softmax normalizes the output values between 0 and 1, representing the probability of each character class for each position in the sequence.

**8. CTC Loss with Lambda Layer (Training Only):**

* This step is only relevant during training.
* A Lambda layer is used because Keras doesn't natively support loss functions with extra parameters.
* The Lambda layer calls the `ctc_lambda_func` function, which takes the following arguments:
    * Predicted outputs (`y_pred`): The probabilities of characters from the softmax layer.
    * Labels (`labels`): The ground truth text sequence represented as one-hot encoded vectors.
    * Input sequence length (`input_length`): The length of the input image sequence.
    * Label length (`label_length`): The length of the ground truth text sequence.
* The `ctc_lambda_func` ignores the first two outputs of `y_pred` as they are often unreliable. 
* It then calculates the CTC (Connectionist Temporal Classification) loss between the predicted probabilities and the ground truth labels, considering the sequence lengths.

**9. Model Outputs:**

* In training mode, the model returns a single output, which is the CTC loss calculated by the Lambda layer. This loss is used to backpropagate gradients and train the model parameters.
* In evaluation mode, the model returns the character activation probabilities from the softmax layer. This allows you to predict the most likely character sequence for a given image.

**Overall, this model leverages CNNs to extract features from the image, uses Bidirectional LSTMs to capture sequential information, and employs CTC loss for training, making it suitable for text recognition tasks.**

**Connectionist Temporal Classification (CTC)** 

is a special type of loss function used for training neural networks, particularly Recurrent Neural Networks (RNNs) like LSTMs, in sequence labeling problems. Here's a breakdown of CTC and its role in the model you provided:

**Challenges in Sequence Labeling:**

* Traditional classification approaches assume a fixed-size output for each input. In sequence labeling tasks like text recognition, the output sequence (text) can have a variable length compared to the input image.
* Additionally, there might be inconsistencies between the timing of features in the input sequence and the corresponding labels. For example, a single spoken phoneme might be stretched over multiple time steps in audio data.

**What CTC Does:**

* CTC addresses these challenges by allowing the model to learn the alignment between the input sequence and the output labels during training. 
* It doesn't require a strict one-to-one correspondence between input features and output labels.
* CTC considers "blank" labels, which are inserted between characters in the predicted sequence. These blanks help the model handle variations in the timing of features and the length of the output sequence.

**How CTC Works:**

1. **Probabilities for Each Time Step:** The model outputs a probability distribution for each character (including a blank label) at every time step in the input sequence.
2. **Alignment Paths:** CTC considers all possible alignments between the predicted sequence (with blanks) and the ground truth labels. These alignments define how the predicted characters and blanks correspond to the actual characters in the label sequence.
3. **Marginal Probability:** CTC calculates a marginal probability by summing the probabilities of all valid alignments that lead to the ground truth label sequence.
4. **Loss Calculation:** The CTC loss is computed as the negative log-likelihood of the marginal probability. Minimizing this loss during training encourages the model to generate output sequences that are more likely to align with the correct labels.

**Benefits of CTC:**

* Handles variable-length sequences without requiring pre-segmentation.
* Accounts for potential inconsistencies between input features and output labels.
* Provides robustness to noise and variations in the input data.

**In the context of the text recognition model:**

* CTC loss helps the model learn to predict the correct sequence of characters for an image, even if the characters have different durations or there are slight variations in their appearance.


By incorporating CTC loss, the model can effectively learn to recognize text in images despite the challenges of variable sequence lengths and potential timing misalignments.


In [None]:
#further model and parameter explanation
import keras
import random
from keras import backend as K
import warnings
warnings.filterwarnings("ignore")
from keras.layers import Input, Conv2D, MaxPool2D, Dense,MaxPooling2D
from keras.layers import AveragePooling2D, Flatten, Activation, Bidirectional
from keras.layers import BatchNormalization, Dropout
from keras.layers import Concatenate, Add, Multiply, Lambda
from keras.layers import UpSampling2D, Reshape
from keras.layers import add,concatenate
from keras.layers import Reshape
from keras.models import Model
from keras.layers import LSTM,GRU
import tensorflow as tf
from parameter import *

MAL_VECTOR = 'ംഃഅആഇഈഉഊഋഌഎഏഐഒഓഔകഖഗഘങചഛജഝഞടഠഡഢണതഥദധനഩപഫബഭമയരറലളഴവശഷസഹാിീുൂൃെേൈൊോൌ്ൎൗൺൻർൽൾ.,'

ASCII_VECTOR = '-+=!@#$%^&*(){}[]|\'"\\/?<>;:0123456789'

ENG_VECTOR = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

CHAR_VECTOR = MAL_VECTOR

letters = [letter for letter in CHAR_VECTOR] # letter array

num_classes = len(letters) + 1               # total length of output chars + CTC separation char

img_w, img_h = 350, 32

# Network parameters
batch_size = 64
val_batch_size = 16

downsample_factor = 4
max_text_len = 60
def ctc_loss_function(args):
    """
    CTC loss function takes the values passed from the model returns the CTC loss using Keras Backend ctc_batch_cost function
    """
    y_pred, y_true, input_length, label_length = args
    # since the first couple outputs of the RNN tend to be garbage we need to discard them, found this from other CRNN approaches
    # I Tried by including these outputs but the results turned out to be very bad and got very low accuracies on prediction
    y_pred = y_pred[:, 2:, :]
    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)


# # Loss and train functions, network architecture
def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    y_pred = y_pred[:, 2:, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)


def get_Model(stage,drop_out_rate=0.35):
    if K.image_data_format() == 'channels_first':
        input_shape = (1, img_w, img_h)
    else:
        input_shape = (img_w, img_h, 1)

    model_input=Input(shape=input_shape,name='the_input',dtype='float32')

    # Convolution layer
    model = Conv2D(64, (3, 3), padding='same', name='conv1', kernel_initializer='he_normal')(model_input)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = MaxPooling2D(pool_size=(2, 2), name='max1')(model)

    model = Conv2D(128, (3, 3), padding='same', name='conv2', kernel_initializer='he_normal')(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = MaxPooling2D(pool_size=(2, 2), name='max2')(model)

    model = Conv2D(256, (3, 3), padding='same', name='conv3', kernel_initializer='he_normal')(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = Conv2D(256, (3, 3), padding='same', name='conv4', kernel_initializer='he_normal')(model)
    model=Dropout(drop_out_rate)(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = MaxPooling2D(pool_size=(1, 2), name='max3')(model)

    model = Conv2D(512, (3, 3), padding='same', name='conv5', kernel_initializer='he_normal')(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = Conv2D(512, (3, 3), padding='same', name='conv6')(model)
    model=Dropout(drop_out_rate)(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)
    model = MaxPooling2D(pool_size=(1, 2), name='max4')(model)

    model = Conv2D(512, (2, 2), padding='same', kernel_initializer='he_normal', name='con7')(model)
    model=Dropout(0.25)(model)
    model = BatchNormalization()(model)
    model = Activation('relu')(model)

    # CNN to RNN
    model = Reshape(target_shape=((87, 1024)), name='reshape')(model)
    model = Dense(64, activation='relu', kernel_initializer='he_normal', name='dense1')(model)

    # RNN layer
    model=Bidirectional(LSTM(256, return_sequences=True, kernel_initializer='he_normal'), merge_mode='sum')(model)
    model=Bidirectional(LSTM(256, return_sequences=True, kernel_initializer='he_normal'), merge_mode='concat')(model)

    # transforms RNN output to character activations:
    model = Dense(num_classes, kernel_initializer='he_normal',name='dense2')(model)
    y_pred = Activation('softmax', name='softmax')(model)


    labels = Input(name='the_labels', shape=[max_text_len], dtype='float32')
    input_length = Input(name='input_length', shape=[1], dtype='int64')
    label_length = Input(name='label_length', shape=[1], dtype='int64')

    #CTC loss function
    loss_out = Lambda(ctc_loss_function, output_shape=(1,),name='ctc')([y_pred, labels, input_length, label_length]) #(None, 1)

    if stage=='train':
        return model_input,y_pred,Model(inputs=[model_input, labels, input_length, label_length], outputs=loss_out)
    else:
        return Model(inputs=[model_input], outputs=y_pred)

## Line-by-Line Breakdown of the Text Recognition Code with CTC Loss:

**Imports:**

1. `import keras`: Imports the core Keras library for building neural networks.
2. `import random`: Imports the `random` module, potentially for data augmentation or shuffling (not used in the provided code).
3. `from keras import backend as K`: Imports the Keras backend, providing access to low-level functionalities.
4. `import warnings`: Imports the `warnings` module to suppress warnings (not recommended practice).
5. `warnings.filterwarnings("ignore")`: Suppresses all warnings. It's generally better to address warnings rather than ignore them.
6. **Keras Layer Imports:** Imports various Keras layers for building the model:
    * `Input`, `Conv2D`, `MaxPool2D`, `Dense`, `AveragePooling2D`, `Flatten`, `Activation`, `Bidirectional`
    * `BatchNormalization`, `Dropout`, `Concatenate`, `Add`, `Multiply`, `Lambda`
    * `UpSampling2D`, `Reshape`
7. `from keras.models import Model`: Imports the `Model` class for creating a Keras model.
8. `from keras.layers import LSTM, GRU`: Imports LSTM and GRU layers for recurrent neural networks (not used in this model, uses Bidirectional LSTM).
9. `import tensorflow as tf` (Optional): Imports TensorFlow, potentially used as the backend for Keras (might be redundant with keras import).
10. `from parameter import *`: Imports parameters from a separate file (`parameter.py` not provided). 

**Character and Text Parameters:**

11. `MAL_VECTOR`: Defines a string containing the characters used in the Malayalam language.
12. `ASCII_VECTOR`: Defines a string containing ASCII characters. (Not used in the current model).
13. `ENG_VECTOR`: Defines a string containing English characters. (Not used in the current model).
14. `CHAR_VECTOR = MAL_VECTOR`: Selects the character set to be used for recognition (currently Malayalam).
15. `letters = [letter for letter in CHAR_VECTOR]`: Creates a list of individual characters from the chosen character set.
16. `num_classes = len(letters) + 1`: Calculates the total number of classes (characters + 1 for CTC separation).
17. `img_w, img_h = 350, 32`: Defines the image width and height for the input data.

**Network Training Parameters:**

18. `batch_size = 64`: Sets the batch size for training, which is the number of images processed in one step.
19. `val_batch_size = 16`: Sets the batch size for validation, which might be smaller than training batch size.
20. `downsample_factor = 4`: Likely a parameter used for data pre-processing (not defined here).
21. `max_text_len = 60`: Defines the maximum length of the ground truth text sequence (number of characters).

**CTC Loss Function (ctc_loss_function):**

22. `def ctc_loss_function(args):`: Defines a function named `ctc_loss_function` that takes arguments.
23. `"""CTC loss function..."""`: Docstring explaining the function's purpose (CTC loss calculation).
24. `y_pred, y_true, input_length, label_length = args`: Unpacks the arguments passed to the function:
    * `y_pred`: Predicted probabilities for characters from the model.
    * `y_true`: Ground truth labels (one-hot encoded text).
    * `input_length`: Length of the input image sequence.
    * `label_length`: Length of the ground truth text sequence.
25. `y_pred = y_pred[:, 2:, :]`: Removes the first two outputs of the predicted probabilities (`y_pred`) as they are often unreliable in RNNs.
26. `return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)`: Returns the CTC loss calculated using the Keras backend function `ctc_batch_cost`.


**CTC Lambda Function (ctc_lambda_func):**

28. same as above *CTC Loss Function (ctc_loss_function):* 

The code you provided defines two functions related to CTC loss:

1. `ctc_loss_function`: This function calculates the CTC loss during training.
2. `ctc_lambda_func` (commented out): This function seems to be a redundant version of `ctc_loss_function`.

**CTC Loss Explained:**

CTC (Connectionist Temporal Classification) is a special type of loss function used for training neural networks, particularly Recurrent Neural Networks (RNNs) like LSTMs, in sequence labeling problems. Here's a breakdown of CTC loss:

**Challenges in Sequence Labeling:**

* Traditional classification approaches assume a fixed-size output for each input. In sequence labeling tasks like text recognition, the output sequence (text) can have a variable length compared to the input image.
* Additionally, there might be inconsistencies between the timing of features in the input sequence and the corresponding labels. For example, a single spoken phoneme might be stretched over multiple time steps in audio data.

**What CTC Does:**

* CTC addresses these challenges by allowing the model to learn the alignment between the input sequence and the output labels during training. 
* It doesn't require a strict one-to-one correspondence between input features and output labels.
* CTC considers "blank" labels, which are inserted between characters in the predicted sequence. These blanks help the model handle variations in the timing of features and the length of the output sequence.

**How CTC Loss Works:**

1. **Probabilities for Each Time Step:** The model outputs a probability distribution for each character (including a blank label) at every time step in the input sequence.
2. **Alignment Paths:** CTC considers all possible alignments between the predicted sequence (with blanks) and the ground truth labels. These alignments define how the predicted characters and blanks correspond to the actual characters in the label sequence.
3. **Marginal Probability:** CTC calculates a marginal probability by summing the probabilities of all valid alignments that lead to the ground truth label sequence.
4. **Loss Calculation:** The CTC loss is computed as the negative log-likelihood of the marginal probability. Minimizing this loss during training encourages the model to generate output sequences that are more likely to align with the correct labels.

**Benefits of CTC:**

* Handles variable-length sequences without requiring pre-segmentation.
* Accounts for potential inconsistencies between input features and output labels.
* Provides robustness to noise and variations in the input data.

**In the context of the text recognition model:**

* CTC loss helps the model learn to predict the correct sequence of characters for an image, even if the characters have different durations or there are slight variations in their appearance.


By incorporating CTC loss, the model can effectively learn to recognize text in images despite the challenges of variable sequence lengths and potential timing misalignments.

**Note about ctc_lambda_func:**

The `ctc_lambda_func` seems to be a commented-out version of `ctc_loss_function`. It's likely redundant and not used in the final model.


In [2]:
#get_Model explaination

The `get_Model` function in the provided code defines the architecture of the text recognition model using Convolutional Neural Networks (CNNs) for feature extraction and Bidirectional LSTMs for sequence modeling with CTC loss for training. Here's a detailed breakdown:

**Function Arguments:**

1. `stage` (str): This argument specifies whether the model is for training (`'train'`) or prediction (`'test'`).
2. `drop_out_rate` (float, optional): This argument controls the dropout rate for regularization (default 0.35).

**Model Inputs:**

* `model_input`: This is the input layer that takes an image with a pre-defined shape (`img_w`, `img_h`) and data type (`float32`). 

**CNN Feature Extraction:**

The model goes through several convolutional layers with specific configurations:

* `Conv2D` layers: Extract features from the image.
* `BatchNormalization`: Normalizes the activations after each convolutional layer for stability.
* `Activation('relu')`: Applies the ReLU activation function for non-linearity.
* `MaxPooling2D` layers: Downsample the feature maps to reduce dimensionality.

**Detailed Breakdown of CNN Layers:**

1. `conv1`: Applies 64 filters of size 3x3 with padding and he_normal initialization.
2. `max1`: Max pooling with pool size 2x2.
3. `conv2`: Applies 128 filters of size 3x3 with padding and he_normal initialization.
4. `max2`: Max pooling with pool size 2x2.
5. `conv3`: Applies 256 filters of size 3x3 with padding and he_normal initialization.
6. `conv4`: Applies 256 filters of size 3x3 with padding, he_normal initialization, and dropout (regularization).
7. `max3`: Max pooling with pool size 1x2 (only downsamples vertically).
8. `conv5`: Applies 512 filters of size 3x3 with padding and he_normal initialization.
9. `conv6`: Applies 512 filters of size 3x3 with padding, he_normal initialization, and dropout.
10. `max4`: Max pooling with pool size 1x2.
11. `con7`: Applies 512 filters of size 2x2 with padding and he_normal initialization, followed by dropout.

**Reshape for RNN:**

* `reshape`: Reshapes the output from the CNN layers into a format suitable for feeding into the RNN layer (typically a 2D tensor with features for each time step).

**Dense Layer for Feature Embedding:**

* `dense1`: Applies a dense layer with 64 units and ReLU activation to project the CNN features to a lower-dimensional space, creating a more compact feature representation.


**Bidirectional LSTMs for Sequence Modeling:**

1. `Bidirectional(LSTM(256, return_sequences=True, kernel_initializer='he_normal'), merge_mode='sum')`: Applies two stacked LSTMs with 256 units each in a bidirectional fashion.
    * `return_sequences=True`: Ensures the full output sequence is returned at each step.
    * `kernel_initializer='he_normal'`: Initializes LSTM weights with he_normal initialization.
    * `merge_mode='sum'`: Combines the outputs from the forward and backward LSTMs by summing them element-wise at each time step.
2. `Bidirectional(LSTM(256, return_sequences=True, kernel_initializer='he_normal'), merge_mode='concat')`: Applies another set of bidirectional LSTMs with 256 units each.
    * `merge_mode='concat'`: Combines the outputs from the forward and backward LSTMs by concatenating them along the feature dimension, resulting in a larger output dimension.

**Character Activations with Dense Layer:**

* `dense2`: Applies a dense layer with the number of classes (characters + 1 for CTC separation) and he_normal initialization to generate probabilities for each character class at each position in the sequence.
* `Activation('softmax')`: Applies the softmax activation function to normalize the output probabilities, ensuring they sum to 1 and represent the probability of each character class.

**Model Outputs:**

* `y_pred`: The final output of the model, representing the predicted probabilities for each character class at each position in the sequence (after applying softmax).

**Additional Layers for Training (if stage='train'):**

* `labels`: Input layer for the ground truth text sequence (one-hot encoded).
* `input_length`: Input layer for the length of the input