# BiLSTM CRNN - OCR Model
This model is desgined for recognition of handwritten names

# Outline
1. [Import Packages](#1)
2. [Hyperparameters](#2)
3. [Helper Functions](#3)
4. [Model Architecture](#4)
   - [Convolutional Layers (CNN)](#4.1)
   - [Recurrent Layers (BiLSTM)](#4.2)
   - [CTC Loss](#4.3)
5. [Loading the Dataset](#5)
6. [Building the Model](#6)
   - [Defining CTC loss](#6.1)
   - [Compiling the Model](#6.2)
   - [Training](#6.3)
7. [Evaluation and Predictions](#7)

<a name="1"></a>
## 1 - Import Packages

The following packages are used:
- `numpy` for  scientific computation in python
- `tensorflow` and `sklearn` for defining the model architecture
- `os` and `pandas` for data manipulation
- `cv2` for image manipulation

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import (Input, Conv2D, MaxPooling2D, Reshape, 
                                     Bidirectional, LSTM, Dense, Lambda, Activation, 
                                     BatchNormalization, Dropout)
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.backend import ctc_batch_cost, get_value, ctc_decode

<a name="2"></a>
## 2 - Hyperparameters

In [None]:
# Define parameters
IMAGE_HEIGHT = 64
IMAGE_WIDTH = 256
CHANNELS = 1  # Grayscale images
SEQ_LENGTH = 64
MAX_LABEL_LENGTH = 34

In [None]:
ALPHABET = "!\"#&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz "
CHAR_COUNT = len(ALPHABET) + 1  # Including CTC blank token

In [None]:
# Paths
TRAIN_CSV_PATH = 'data/written_name_train.csv'
VALID_CSV_PATH = 'data/written_name_valid.csv'
TRAIN_IMAGE_PATH = 'data/train/train/'
VALID_IMAGE_PATH = 'data/valid/valid/'

<a name="3"></a>
## 3 - Helper Functions

In [None]:
def load_dataset(csv_path, test_size=0.2):
    """Loads dataset, removes unreadable labels, and splits into train and validation sets."""
    df = pd.read_csv(csv_path).dropna()
    df = df[df['IDENTITY'] != 'UNREADABLE']
    df['IDENTITY'] = df['IDENTITY'].str.upper()  # Standardize labels
    return df

In [None]:
def preprocess_image(img):
    """Converts an image to a fixed size with normalization."""
    if img is None:
        return None
    final_img = np.ones((IMAGE_HEIGHT, IMAGE_WIDTH)) * 255  # White background
    h, w = img.shape
    final_img[:min(h, IMAGE_HEIGHT), :min(w, IMAGE_WIDTH)] = img[:IMAGE_HEIGHT, :IMAGE_WIDTH]
    return cv2.rotate(final_img, cv2.ROTATE_90_CLOCKWISE) / 255.0  # Rotate & Normalize

def load_images(df, img_dir):
    """Loads and preprocesses images dynamically."""
    image_paths = df['FILENAME'].tolist()
    images = []
    
    for img_name in image_paths:
        img_path = os.path.join(img_dir, img_name)
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        if img is not None:
            images.append(preprocess_image(img))
    
    return np.array(images).reshape(-1, IMAGE_WIDTH, IMAGE_HEIGHT, CHANNELS)

In [None]:
def encode_labels(df):
    """Converts text labels into numerical sequences."""
    num_samples = len(df)
    label_sequences = np.ones((num_samples, MAX_LABEL_LENGTH)) * -1
    label_lengths = np.zeros((num_samples, 1))
    input_lengths = np.ones((num_samples, 1)) * (SEQ_LENGTH - 2)  # Adjusted for CTC loss

    for i, text in enumerate(df['IDENTITY']):
        label_lengths[i] = len(text)
        label_sequences[i, :len(text)] = [ALPHABET.find(ch) for ch in text]

    return label_sequences, label_lengths, input_lengths, np.zeros((num_samples,))

<a name="4"></a>
## 4 - Define a CRNN-BiLSTM model utilising CTC Loss
Works best for OCR pipeline

CRNN (Convolutional Recurrent Neural Network) combined with BiLSTM (Bidirectional Long Short-Term Memory) is a powerful architecture used for sequence-based tasks such as Optical Character Recognition (OCR), speech-to-text, and handwriting recognition.

It consists of three main components:
1.	Convolutional Layers (CNN): Extract spatial features.

2.	Recurrent Layers (BiLSTM): Capture sequence dependencies.

3.	CTC Loss (Connectionist Temporal Classification): Handles unsegmented sequence labeling.

<a name="4.1"></a>
### 4.1 - Convolutional Layers (CNN)
&emsp;The CNN module extracts spatial features from the input image. It applies multiple convolutional layers, followed by activation functions and pooling.

&emsp;Operations:

$$
Z = W * X + B
$$

&emsp;where:
- $X$ = input image(for feature map from previous layer)
- $W$ = Convolution kernel (filter)
- $B$ = Bias
- $*$ = Convolution operation
- $Z$ = Output feature map

#### 4.1.2 Activation Function(ReLU):
$$
f(x) = max(0, x)
$$

#### 4.1.3 Max Pooling:
$$
P(i, j) = \max \{Z(i + m, j + n)\}, \quad \forall m, n \in K
$$

&emsp;where $K$ is the pooling window size.

<a name="4.2"></a>
### 4.2 - Recurrent Layers (BiLSTM)
&emsp;After extracting features using CNN, a Bidirectional LSTM (BiLSTM) processes the sequence in both forward and backward directions.

<div style="text-align: center;">
    <img src="https://www.researchgate.net/publication/373875187/figure/fig3/AS:11431281188406384@1694606838077/Structure-of-a-BiLSTM-cell-with-its-gates.jpg" alt="BiLSTM Block" width="500"/>
</div>

#### 4.2.1 LSTM Cell Equations:

&emsp;$f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$


&emsp;$i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$


&emsp;$o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$


&emsp;$c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_c x_t + U_c h_{t-1} + b_c)$


&emsp;$h_t = o_t \odot \tanh(c_t)$

&emsp;where:
- $f_t$ = Forget gate
- $i_t$ = Input gate
- $o_t$ = Output gate
- $c_t$ = Cell state
- $h_t$ = Hidden state
- $\sigma$ = Sigmoid function
- $\odot$ = Element-wise multiplication

In BiLSTM, two LSTMs process the sequence in both directions, and the final output is:

$
h_t^{\text{BiLSTM}} = h_t^{\text{forward}} + h_t^{\text{backward}}
$

In [None]:
def build_ocr_model(seq_length, char_count):
    input_tensor = Input(shape=(256, 64, 1), name='input')
    features = Conv2D(32, (3, 3), padding='same', kernel_initializer='he_normal')(input_tensor)
    features = BatchNormalization()(features)
    features = Activation('relu')(features)
    features = MaxPooling2D(pool_size=(2, 2))(features)
    
    features = Conv2D(64, (3, 3), padding='same', kernel_initializer='he_normal')(features)
    features = BatchNormalization()(features)
    features = Activation('relu')(features)
    features = MaxPooling2D(pool_size=(2, 2))(features)
    features = Dropout(0.3)(features)
    
    features = Conv2D(128, (3, 3), padding='same', kernel_initializer='he_normal')(features)
    features = BatchNormalization()(features)
    features = Activation('relu')(features)
    features = MaxPooling2D(pool_size=(1, 2))(features)
    features = Dropout(0.3)(features)
    
    features = Reshape(target_shape=(seq_length, 1024))(features)
    features = Dense(64, activation='relu', kernel_initializer='he_normal')(features)
    features = Bidirectional(LSTM(256, return_sequences=True))(features)
    features = Bidirectional(LSTM(256, return_sequences=True))(features)
    features = Dense(char_count, kernel_initializer='he_normal')(features)
    preds = Activation('softmax', name='softmax')(features)
    
    return Model(inputs=input_tensor, outputs=preds)

<a name="4.3"></a>
### 4.3 - CTC Loss (Connectionist Temporal Classification)
&emsp;CTC loss is used when the input and output sequences do not have a strict alignment (e.g., OCR tasks where the number of characters varies).

#### 4.3.1 Probability of Alignment Path:
$
P(y|X) = \sum_{\pi \in \mathcal{A}(y)} P(\pi | X)
$

&emsp;where:
- $X$ = Input sequence
- $y$ = Target sequence
- $\pi$ = Possible alignments
- $\mathcal{A}(y)$ = Set of all valid alignments

#### 4.3.2 CTC Loss Function
$
\mathcal{L}{CTC} = - \sum{t=1}^{T} \log P(y_t | X)
$

CTC allows the model to learn character sequences without explicit alignment between input images and text.

In [None]:
def ctc_loss_layer(args):
    y_pred, labels, input_length, label_length = args
    return ctc_batch_cost(labels, y_pred[:, 2:, :], input_length, label_length)

<a name="5"></a>
## 5 - Loading the Dataset

In [None]:
train_data = load_dataset(TRAIN_CSV_PATH)
valid_data = load_dataset(VALID_CSV_PATH)

# Load images
train_images = load_images(train_data, TRAIN_IMAGE_PATH)
valid_images = load_images(valid_data, VALID_IMAGE_PATH)

In [None]:
# Encode labels
train_labels, train_label_len, train_input_len, train_output = encode_labels(train_data)
valid_labels, valid_label_len, valid_input_len, valid_output = encode_labels(valid_data)

In [None]:
print(f"Train Samples: {len(train_data)}, Validation Samples: {len(valid_data)}")
print(f"Image Shape: {train_images.shape}, Label Shape: {train_labels.shape}")

<a name="6"></a>
## 6 - Building the Model

In [None]:
# Build OCR model
pre_model = build_ocr_model(SEQ_LENGTH, CHAR_COUNT)
pre_model.summary()

<a name="6.1"></a>
### 6.1 - Defining CTC loss model 

In [None]:
ground_truth_labels = Input(name='gtruth_labels', shape=[MAX_LABEL_LENGTH], dtype='float32')
input_lengths = Input(name='input_length', shape=[1], dtype='int64')
label_lengths = Input(name='label_length', shape=[1], dtype='int64')

ctc_loss = Lambda(ctc_loss_layer, output_shape=(1,), name='ctc')(
    [pre_model.output, ground_truth_labels, input_lengths, label_lengths])

In [None]:
ocr_model = Model(inputs=[pre_model.input, ground_truth_labels, input_lengths, label_lengths], 
                       outputs=ctc_loss)

<a name="6.2"></a>
### 6.2 - Compile Model

In [None]:
best_model_file = "Best_OCR_Model.keras"
ocr_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=Adam(0.0001))
checkpoint = ModelCheckpoint(filepath=best_model_file, monitor='val_loss', save_best_only=True, mode='min')

<a name="6.3"></a>
### 6.3 - Training the Model

In [None]:
history = ocr_model.fit(
    x=[train_images, train_labels, train_input_len, train_label_len],
    y=train_output,
    validation_data=([valid_images, valid_labels, valid_input_len, valid_label_len], valid_output),
    epochs=60, 
    batch_size=128, 
    shuffle=True,
    callbacks=[checkpoint]
)

##### Plot the loss graph of the model

In [None]:
# Plot training history
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'])
plt.show()

<a name = "7"></a>
### 7 - Evaluation and Prediction

In [None]:
ocr_model.load_weights('Best_OCR_Model.keras')

In [None]:
def num_to_label(num_seq, alphabet):
    text = []
    for num in num_seq:
        if num != -1:
            text.append(alphabet[num])
    return ''.join(text)

In [None]:
# Path to test images
test_dir = 'data/test/test/'

# Get list of test images
test_images = sorted(os.listdir(test_dir))

In [None]:
# Store predictions
submission_data = []

In [None]:
# Counter
total_images = len(test_images)
print(f"Total images to process: {total_images}")

### Run Prediction on the Test Set

In [None]:
# Generate predictions for test images
for idx, img_name in enumerate(test_images, start=1):
    img_path = os.path.join(test_dir, img_name)
    image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    
    if image is not None:
        image = preprocess_image(image)  # Preprocess the image
        pred = pre_model.predict(image.reshape(1, 256, 64, 1), verbose=False)
        decoded = get_value(ctc_decode(pred, input_length=np.ones(pred.shape[0]) * pred.shape[1], greedy=True)[0][0])
        pred_text = num_to_label(decoded[0], ALPHABET)
    else:
        pred_text = "MISSING_LABEL"  # If the image is missing, leave an missing prediction

    submission_data.append([img_name, pred_text])

    # Show progress
    print(f"Processed {idx}/{total_images} images", end='\r')

In [None]:
# Convert to DataFrame and save as CSV
submission_df = pd.DataFrame(submission_data, columns=['Id', 'Predicted'])
submission_df.to_csv('written_test.csv', index=False)

print("Submission file 'written_test.csv' has been created successfully!")

In [None]:
# Load submission CSV
submission_df = pd.read_csv("written_test.csv")

# Test image directory
test_dir = "data/test/test/"

In [None]:

def display_prediction(image_index):
    """ Function to display image and its predicted label. """
    
    # Get the image filename and predicted label from submission file
    img_name = submission_df.loc[image_index, 'Id']
    predicted_label = submission_df.loc[image_index, 'Predicted']
    img_path = os.path.join(test_dir, img_name)

    # Load and preprocess image
    image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    if image is None:
        print(f"Error loading image: {img_name}")
        return
    
    processed_image = preprocess_image(image)

    # Model prediction
    pred = pre_model.predict(processed_image.reshape(1, 256, 64, 1))
    decoded = get_value(ctc_decode(pred, input_length=np.ones(pred.shape[0]) * pred.shape[1], greedy=True)[0][0])
    model_predicted_text = num_to_label(decoded[0], ALPHABET)

    # Display image
    plt.imshow(image, cmap='gray')
    plt.axis('off')
    plt.title(f"Submission Prediction: {predicted_label}", fontsize=10)
    plt.show()

### Manually load image and prediction for error checking

In [None]:
try:
    img_idx = int(input(f"Enter an image index (0 to {len(submission_df)-1}), or -1 to exit: "))
    if 0 <= img_idx < len(submission_df):
        display_prediction(img_idx)
    else:
        print("Invalid index. Please enter a valid number")
except ValueError:
    print("Please enter a valid integer.")