# UberEats Menu Extraction - CRNN implementation

We will use the following steps to create our text recognition model.
1. Collecting Dataset
2. Preprocessing Data
3. Creating Network Architecture
4. Defining Loss function
5. Training model
6. Decoding outputs from prediction

In [1]:
# Imports
import os
import fnmatch
import cv2
import numpy as np
import string
import time

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, LSTM, Reshape, BatchNormalization, Input, Conv2D
from keras.layers import MaxPool2D, Lambda, Bidirectional
from keras.models import Model
from keras.activations import relu, sigmoid, softmax
import keras.backend as K
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

import tensorflow as tf
from tensorflow.python.client import device_lib

Using TensorFlow backend.


### Initial TF Setup

In [2]:
# ignore warnings in the output
tf.logging.set_verbosity(tf.logging.ERROR)

In [3]:
# Check all available devices if GPU is available
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 9667735059869621268
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14427974410158030307
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 4897243441956212156
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7896458855
locality {
  bus_id: 1
  links {
    link {
      device_id: 1
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 17205179779777069436
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 7896458855
locality {
  bus_id: 1
  links {
    link {
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 7684767529933889767
physical_d

In [4]:
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

## Global Variables

In [5]:

# Data Path
path = 'data/mnt/ramdisk/max/90kDICT32px'
filepath="models/best_model.hdf5"
# lists for training dataset
training_img = []
training_txt = []
train_input_length = []
train_label_length = []
orig_txt = []
 
#lists for validation dataset
valid_img = []
valid_txt = []
valid_input_length = []
valid_label_length = []
valid_orig_txt = []
 
max_label_len = 0
 
i =1 
flag = 0

In [6]:
# Global Variables
char_list = string.ascii_letters + string.digits
print("Character List: ", char_list)

Character List:  abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789


In [7]:
# Training Variables
batch_size = 256
epochs = 10

## Data Ingestion 

We will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. This can be downloaded from:

wget https://www.robots.ox.ac.uk/~vgg/data/text/mjsynth.tar.gz

## Data Preprocessing
To preprocess our input image we will use followings:
1. Read the image and convert into a gray-scale image
2. Make each image of size (128,32) by using padding
3. Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
4. Normalize the image pixel values by dividing it with 255.

To preprocess the output labels we will use the followings:
1. Read the text from the name of the image as the image name contains text written inside the image.
2. Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26    etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
3. Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

**Note**: 
In preprocessing step we also need to create two other lists: 
1. label length and 
2. input length to our RNN. 

These two lists are important for our CTC loss. **Label length** is the length of each output text label and **input length** is the same for each input to the LSTM layer which is 31 in our architecture.*


In [8]:
"""
"""
def encode_to_labels(text):
    # We encode each output word into digits
    digit_list = []
    for index, character in enumerate(text):
        try:
            digit_list.append(char_list.index(character))
        except:
            print("Error in finding index for character ", character)
    #End For
    return digit_list

In [9]:
"""
"""
i =1
flag = 0
for root, dirnames, filenames in os.walk(path):
    for f_name in fnmatch.filter(filenames, '*.jpg'):
        # read input image and convert to grayscale
        img = cv2.cvtColor(cv2.imread(os.path.join(root, f_name)), cv2.COLOR_BGR2GRAY)

        # convert each image of shape (32, 128, 1)
        w, h = img.shape

        if h > 128 or w > 32:
            continue
        # endif

        # Process the images to bring them to scale
        if w < 32:
            add_zeros = np.ones((32-w, h))*255
            img = np.concatenate((img, add_zeros))
        # endif
        if h < 128:
            add_zeros = np.ones((32, 128-h))*255
            img = np.concatenate((img, add_zeros), axis=1)
        # endif    

        img = np.expand_dims(img , axis = 2)

        # Normalise the image
        img = img/255.

        # Get the text for the image
        txt = f_name.split('_')[1]

        # compute maximum length of the text
        if len(txt) > max_label_len:
            max_label_len = len(txt)

        # split the 150000 data into validation and training dataset as 10% and 90% respectively
        if i%10 == 0:     
            valid_orig_txt.append(txt)   
            valid_label_length.append(len(txt))
            valid_input_length.append(31)
            valid_img.append(img)
            valid_txt.append(encode_to_labels(txt))
        else:
            orig_txt.append(txt)   
            train_label_length.append(len(txt))
            train_input_length.append(31)
            training_img.append(img)
            training_txt.append(encode_to_labels(txt))
        # endif
        # break the loop if total data is 150000
        if i == 150000:
            flag = 1
            break
        # endif

        i+=1
    # endfor-2
    if flag == 1:
        break
# endfor-1

In [10]:
# pad each output label to maximum text length
train_padded_txt = pad_sequences(training_txt, maxlen=max_label_len, padding='post', value = len(char_list))
valid_padded_txt = pad_sequences(valid_txt, maxlen=max_label_len, padding='post', value = len(char_list))

## Network Archtecture
This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

1. Input shape for our architecture having an input image of height 32 and width 128.
2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

In [11]:
# input with shape of height=32 and width=128 
inputs = Input(shape=(32,128,1))
 
# convolution layer with kernel size (3,3)
conv_1 = Conv2D(64, (3,3), activation = 'relu', padding='same')(inputs)
# poolig layer with kernel size (2,2)
pool_1 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_1)
 
conv_2 = Conv2D(128, (3,3), activation = 'relu', padding='same')(pool_1)
pool_2 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_2)
 
conv_3 = Conv2D(256, (3,3), activation = 'relu', padding='same')(pool_2)
 
conv_4 = Conv2D(256, (3,3), activation = 'relu', padding='same')(conv_3)
# poolig layer with kernel size (2,1)
pool_4 = MaxPool2D(pool_size=(2, 1))(conv_4)
 
conv_5 = Conv2D(512, (3,3), activation = 'relu', padding='same')(pool_4)
# Batch normalization layer
batch_norm_5 = BatchNormalization()(conv_5)
 
conv_6 = Conv2D(512, (3,3), activation = 'relu', padding='same')(batch_norm_5)
batch_norm_6 = BatchNormalization()(conv_6)
pool_6 = MaxPool2D(pool_size=(2, 1))(batch_norm_6)
 
conv_7 = Conv2D(512, (2,2), activation = 'relu')(pool_6)
 
squeezed = Lambda(lambda x: K.squeeze(x, 1))(conv_7)
 
# bidirectional LSTM layers with units=128
blstm_1 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(squeezed)
blstm_2 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(blstm_1)
 
outputs = Dense(len(char_list)+1, activation = 'softmax')(blstm_2)

# model to be used at test time
act_model = Model(inputs, outputs)

In [12]:
act_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 32, 128, 1)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 32, 128, 64)       640       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 16, 64, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 64, 128)       73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 8, 32, 128)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 32, 256)        295168    
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 8, 32, 256)        590080    
__________

## Loss Function
Here, we are using the CTC loss function. CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

In [13]:
labels = Input(name='the_labels', shape=[max_label_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')

In [14]:
def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
 
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

In [15]:
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([outputs, labels, input_length, label_length])

#model to be used at training time
model = Model(inputs=[inputs, labels, input_length, label_length], outputs=loss_out)

## Train the Model
To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss. In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier. labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

In [16]:
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = 'adam')

In [17]:
checkpoint = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

In [18]:
callbacks_list = [checkpoint]

In [19]:
training_img = np.array(training_img)
train_input_length = np.array(train_input_length)
train_label_length = np.array(train_label_length)

In [20]:
valid_img = np.array(valid_img)
valid_input_length = np.array(valid_input_length)
valid_label_length = np.array(valid_label_length)

In [21]:
model.fit(x=[training_img, train_padded_txt, train_input_length, train_label_length], 
          y=np.zeros(len(training_img)), batch_size=batch_size,
          epochs = epochs, 
          validation_data = ([valid_img, valid_padded_txt, valid_input_length, valid_label_length], 
                             [np.zeros(len(valid_img))]), 
          verbose = 1, callbacks = callbacks_list)

Train on 135000 samples, validate on 15000 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 26.11899, saving model to models/best_model.hdf5
Epoch 2/10

Epoch 00002: val_loss improved from 26.11899 to 10.89314, saving model to models/best_model.hdf5
Epoch 3/10

Epoch 00003: val_loss improved from 10.89314 to 4.69746, saving model to models/best_model.hdf5
Epoch 4/10

Epoch 00004: val_loss improved from 4.69746 to 3.95999, saving model to models/best_model.hdf5
Epoch 5/10

Epoch 00005: val_loss improved from 3.95999 to 3.38159, saving model to models/best_model.hdf5
Epoch 6/10

Epoch 00006: val_loss improved from 3.38159 to 3.20071, saving model to models/best_model.hdf5
Epoch 7/10

Epoch 00007: val_loss improved from 3.20071 to 3.17905, saving model to models/best_model.hdf5
Epoch 8/10

Epoch 00008: val_loss improved from 3.17905 to 3.06030, saving model to models/best_model.hdf5
Epoch 9/10

Epoch 00009: val_loss did not improve from 3.06030
Epoch 10/10

Epoch 00010: val_

<keras.callbacks.History at 0x7f903a6c6b00>

In [22]:
act_model.save(filepath)

## Test the Model
Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. 

In [23]:
from keras.models import load_model
# load the saved best model weights
act_model = load_model('models/best_model.hdf5')



In [24]:
valid_img[:10].shape

(10, 32, 128, 1)

In [25]:
# predict outputs on validation images
prediction = act_model.predict(valid_img[:10])

In [26]:
# use CTC decoder
out = K.get_value(K.ctc_decode(prediction, input_length=np.ones(prediction.shape[0])*prediction.shape[1],
                         greedy=True)[0][0])

In [27]:
# see the results
i = 0
for x in out:
    print("original_text =  ", valid_orig_txt[i])
    print("predicted text = ", end = '')
    for p in x:  
        if int(p) != -1:
            print(char_list[int(p)], end = '')       
    print('\n')
    i+=1

original_text =   IMPORTS
predicted text = IMPORTS

original_text =   scherzo
predicted text = scherzo

original_text =   interlude
predicted text = interlude

original_text =   interpreting
predicted text = Interpreting

original_text =   Barbuda
predicted text = Barbuda

original_text =   garroter
predicted text = garroter

original_text =   Chat
predicted text = Chat

original_text =   Detect
predicted text = Detect

original_text =   WHORL
predicted text = EWHORL

original_text =   MARIAN
predicted text = MARIAN



# END