<a href="https://colab.research.google.com/github/Mritunjaysri01/Colab-Notebook/blob/master/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/arthurflor23/handwritten-text-recognition/blob/master/doc/image/header.png?raw=true" />

# Handwritten Text Recognition using TensorFlow 2.x

This tutorial shows how you can use the project [Handwritten Text Recognition](https://github.com/arthurflor23/handwritten-text-recognition) in your Google Colab.



## 1 Localhost Environment

We'll make sure you have the project in your Google Drive with the datasets in HDF5. If you already have structured files in the cloud, skip this step.

### 1.1 Datasets

The datasets that you can use:

a. [Bentham](http://transcriptorium.eu/datasets/bentham-collection/)

b. [IAM](http://www.fki.inf.unibe.ch/databases/iam-handwriting-database)

c. [Rimes](http://www.a2ialab.com/doku.php?id=rimes_database:start)

d. [Saint Gall](http://www.fki.inf.unibe.ch/databases/iam-historical-document-database/saint-gall-database)

e. [Washington](http://www.fki.inf.unibe.ch/databases/iam-historical-document-database/washington-database)

### 1.2 Raw folder

On localhost, download the code project from GitHub and extract the chosen dataset (or all if you prefer) in the **raw** folder. Don't change anything of the structure of the dataset, since the scripts were made from the **original structure** of them. Your project directory will be like this:

```
.
├── raw
│   ├── bentham
│   │   ├── BenthamDatasetR0-GT
│   │   └── BenthamDatasetR0-Images
│   ├── iam
│   │   ├── ascii
│   │   ├── forms
│   │   ├── largeWriterIndependentTextLineRecognitionTask
│   │   ├── lines
│   │   └── xml
│   ├── rimes
│   │   ├── eval_2011
│   │   ├── eval_2011_annotated.xml
│   │   ├── training_2011
│   │   └── training_2011.xml
│   ├── saintgall
│   │   ├── data
│   │   ├── ground_truth
│   │   ├── README.txt
│   │   └── sets
│   └── washington
│       ├── data
│       ├── ground_truth
│       ├── README.txt
│       └── sets
└── src
    ├── data
    │   ├── evaluation.py
    │   ├── generator.py
    │   ├── preproc.py
    │   ├── reader.py
    │   ├── similar_error_analysis.py
    ├── main.py
    ├── network
    │   ├── architecture.py
    │   ├── layers.py
    │   ├── model.py
    └── tutorial.ipynb

```

After that, create virtual environment and install the dependencies with python 3 and pip:

> ```python -m venv .venv && source .venv/bin/activate```

> ```pip install -r requirements.txt```

### 1.3 HDF5 files

Now, you'll run the *transform* function from **main.py**. For this, execute on **src** folder:

> ```python main.py --source=<DATASET_NAME> --transform```

Your data will be preprocess and encode, creating and saving in the **data** folder. Now your project directory will be like this:


```
.
├── data
│   ├── bentham.hdf5
│   ├── iam.hdf5
│   ├── rimes.hdf5
│   ├── saintgall.hdf5
│   └── washington.hdf5
├── raw
│   ├── bentham
│   │   ├── BenthamDatasetR0-GT
│   │   └── BenthamDatasetR0-Images
│   ├── iam
│   │   ├── ascii
│   │   ├── forms
│   │   ├── largeWriterIndependentTextLineRecognitionTask
│   │   ├── lines
│   │   └── xml
│   ├── rimes
│   │   ├── eval_2011
│   │   ├── eval_2011_annotated.xml
│   │   ├── training_2011
│   │   └── training_2011.xml
│   ├── saintgall
│   │   ├── data
│   │   ├── ground_truth
│   │   ├── README.txt
│   │   └── sets
│   └── washington
│       ├── data
│       ├── ground_truth
│       ├── README.txt
│       └── sets
└── src
    ├── data
    │   ├── evaluation.py
    │   ├── generator.py
    │   ├── preproc.py
    │   ├── reader.py
    │   ├── similar_error_analysis.py
    ├── main.py
    ├── network
    │   ├── architecture.py
    │   ├── layers.py
    │   ├── model.py
    └── tutorial.ipynb

```

Then upload the **data** and **src** folders in the same directory in your Google Drive.

## 2 Google Drive Environment


### 2.1 TensorFlow 2.x

Make sure the jupyter notebook is using GPU mode.

In [None]:
!nvidia-smi

Mon Jan 18 17:35:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Now, we'll install and switch to TensorFlow 2.x.

In [None]:
!pip install -q tensorflow-gpu

%tensorflow_version 2.x

In [None]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()

if device_name != "/device:GPU:0":
    raise SystemError("GPU device not found")

print(f"Found GPU at: {device_name}")

### 2.2 Google Drive

Mount your Google Drive partition.

**Note:** *\"Colab Notebooks/handwritten-text-recognition/src/\"* was the directory where you put the project folders, specifically the **src** folder.

In [None]:
from google.colab import drive

drive.mount("./gdrive", force_remount=True)

%cd "./gdrive/My Drive/Colab Notebooks/handwritten-text-recognition/src/"
!ls -l

After mount, you can see the list os files in the project folder.

## 3 Set Python Classes

### 3.1 Environment

First, let's define our environment variables.

Set the main configuration parameters, like input size, batch size, number of epochs and list of characters. This make compatible with **main.py** and jupyter notebook:

* **dataset**: "bentham", "iam", "rimes", "saintgall", "washington"

* **arch**: network to run: "bluche", "puigcerver", "flor"

* **epochs**: number of epochs

* **batch_size**: number size of the batch

In [None]:
import os #for using system functions
import datetime #for realtime daate and ime information
import string #using ascii characters 60 easier to hit

# define parameters
source = "bentham" #collection of manuscript written by english philospher Jeremy Bentham ,has considerable amt of punctuation marks in text
arch = "flor" #combinatin of puigcerver and bluche it is fast nad compact then other two
epochs = 1000 #1 epoch is when entire dataset is passed forward and backward through the neural network
batch_size = 16 #divison on entire dataset in small sets

# define paths
source_path = os.path.join("..", "data", f"{source}.hdf5")  # path to washington.hdf5 file
output_path = os.path.join("..", "output", source, arch)  # path to output file washington and bentham
target_path = os.path.join(output_path, "checkpoint_weights.hdf5") #path to /output/washington/flor/checkpoint_weights.hdf5
os.makedirs(output_path, exist_ok=True)  #method will create all unavailable/missing directory in the specified path

# define input size, number max of chars per line and list of valid chars
input_size = (1024, 128, 1) # height , width , depth **RGB=3  , Greyscale = 1
max_text_length = 128
charset_base = string.printable[:95]

print("source:", source_path)
print("output", output_path)
print("target", target_path)
print("charset:", charset_base)

### 3.2 DataGenerator Class

The second class is **DataGenerator()**, responsible for:

* Load the dataset partitions (train, valid, test);

* Manager batchs for train/validation/test process.

In [None]:
# sice the dataset have been trained to load the train images , validation images, test  images and printing the size of these images
# build data generators for loading and processing images in Keras

from data.generator import DataGenerator

dtgen = DataGenerator(source=source_path, # washington hdf5 file
                      batch_size=batch_size, #16
                      charset=charset_base, #95
                      max_text_length=max_text_length) #128

print(f"Train images: {dtgen.size['train']}")
print(f"Validation images: {dtgen.size['valid']}")
print(f"Test images: {dtgen.size['test']}")

### 3.3 HTRModel Class

The third class is **HTRModel()**, was developed to be easy to use and to abstract the complicated flow of a HTR system. It's responsible for:

* Create model with Handwritten Text Recognition flow, in which calculate the loss function by CTC and decode output to calculate the HTR metrics (CER, WER and SER);    #Character/word/sentence Error rate

* Save and load model;

* Load weights in the models (train/infer);

* Make Train/Predict process using *generator*.

To make a dynamic HTRModel, its parameters are the *architecture*, *input_size* and *vocab_size*.

In [None]:
from network.model import HTRModel

# create and compile HTRModel
model = HTRModel(architecture=arch, #flor
                 input_size=input_size, #1024,18,1
                 vocab_size=dtgen.tokenizer.vocab_size, #spliting of input into tokens
                 beam_width=10, #selects multiple alternatives for an input sequence at each timestep based on conditional probability
                 stop_tolerance=20, #setting up of uper and lowr limit of data
                 reduce_tolerance=15)

model.compile(learning_rate=0.001) # learning rate of optimizer (------#To train a model with fit(), you need to specify a loss function, an optimizer, and optionally, some metrics to monitor.)
model.summary(output_path, "summary.txt")

# get default callbacks and load checkpoint weights file (HDF5) if exists
model.load_checkpoint(target=target_path)

callbacks = model.get_callbacks(logdir=output_path, checkpoint=target_path, verbose=1)
'''
A callback is a set of functions to be applied at given stages of the training procedure
saving your model as a checkpoint after each successful epoch
'''

## 4 Training

The training process is similar to the *fit()* of the Keras. After training, the information (epochs and minimum loss) is save.

In [None]:
# to calculate total and average time per epoch
start_time = datetime.datetime.now()

#fit(), which will train the model by slicing the data into "batches" of size "batch_size",
# and repeatedly iterating over the entire dataset for a given number of "epochs".


h = model.fit(x=dtgen.next_train_batch(), #input as nest batch of ata grom data generator
              epochs=epochs,#number of epochs 1000
              steps_per_epoch=dtgen.steps['train'], #number of steps before declaring one epoch finished and starting the next epoch
              validation_data=dtgen.next_valid_batch(), # data on which to evaluate the loass and any model metrices at end of each epochs
              validation_steps=dtgen.steps['valid'], #validating the data generator step
              callbacks=callbacks, #calling logdir and checkpoint 
              shuffle=True, # weather shuffle the traing dta after each epoch
              verbose=1) # 0= silent , 1= progress bar , 2 = one line per epoch

total_time = datetime.datetime.now() - start_time 

loss = h.history['loss'] #showing loss history
val_loss = h.history['val_loss'] # validating loss history

min_val_loss = min(val_loss) # shows minimum validation loss
min_val_loss_i = val_loss.index(min_val_loss)
 
time_epoch = (total_time / len(loss)) #time consume by processing of each epoch
total_item = (dtgen.size['train'] + dtgen.size['valid']) # total time in traing the dataset

t_corpus = "\n".join([
    f"Total train images:      {dtgen.size['train']}",
    f"Total validation images: {dtgen.size['valid']}",
    f"Batch:                   {dtgen.batch_size}\n",
    f"Total time:              {total_time}",
    f"Time per epoch:          {time_epoch}",
    f"Time per item:           {time_epoch / total_item}\n",
    f"Total epochs:            {len(loss)}",
    f"Best epoch               {min_val_loss_i + 1}\n",
    f"Training loss:           {loss[min_val_loss_i]:.8f}", 
    f"Validation loss:         {min_val_loss:.8f}"
])
# If validation loss >> training loss you can call it overfitting.
#  If validation loss << training loss you can call it underfitting
with open(os.path.join(output_path, "train.txt"), "w") as lg:
    lg.write(t_corpus)
    print(t_corpus)

## 5 Predict

The predict process is similar to the *predict* of the Keras:

In [None]:
from data import preproc as pp #it is text preprocessig pacakge to aid NLP package development for python
from google.colab.patches import cv2_imshow #used to display an image in a window  , the window automatically fits to the image size

start_time = datetime.datetime.now()

# predict() function will return the predicts with the probabilities
predicts, _ = model.predict(x=dtgen.next_test_batch(), #feed the next batch of the input samples
                            steps=dtgen.steps['test'], #no of step before finishing prediction 
                            ctc_decode=True, # connectionist temporal classification , to aid the repetition of toekn in model
                            verbose=1) # display progress bar

# decode to string
predicts = [dtgen.tokenizer.decode(x[0]) for x in predicts] #decode the string into token 
ground_truth = [x.decode() for x in dtgen.dataset['test']['gt']] #the accuracy of the training set's classification for supervised learning techniques

total_time = datetime.datetime.now() - start_time

# mount predict corpus file
with open(os.path.join(output_path, "predict.txt"), "w") as lg: #shows the text to predict
    for pd, gt in zip(predicts, ground_truth):
        lg.write(f"TE_L {gt}\nTE_P {pd}\n")
   
for i, item in enumerate(dtgen.dataset['test']['dt'][:10]):   #shows predicted text by our model
    print("=" * 1024, "\n")
    cv2_imshow(pp.adjust_to_see(item))
    print(ground_truth[i])
    print(predicts[i], "\n")

## 6 Evaluate

Evaluation process is more manual process. Here we have the `ocr_metrics`, but feel free to implement other metrics instead. In the function, we have three parameters: 

* predicts
* ground_truth
* norm_accentuation (calculation with/without accentuation)
* norm_punctuation (calculation with/without punctuation marks)

In [None]:
# Generate predictions (probabilities -- the output of the last layer)
# on new data using `predict`


from data import evaluation

evaluate = evaluation.ocr_metrics(predicts, ground_truth) #To train a model with fit(), you need to specify a loss function, an optimizer, and optionally, some metrics to monitor.

e_corpus = "\n".join([
    f"Total test images:    {dtgen.size['test']}",
    f"Total time:           {total_time}",
    f"Time per item:        {total_time / dtgen.size['test']}\n",
    f"Metrics:",
    f"Character Error Rate: {evaluate[0]:.8f}",#CER
    f"Word Error Rate:      {evaluate[1]:.8f}",#WER
    f"Sequence Error Rate:  {evaluate[2]:.8f}"#SER
])

with open(os.path.join(output_path, "evaluate.txt"), "w") as lg:
    lg.write(e_corpus)
    print(e_corpus)