# Optical Character Recognition using TAO OCRNet

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. 

Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

<img align="center" src="https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png" width="1080">

## Learning Objectives
In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Take a pretrained OCRNet model and train OCRNet model on the IAMDATA Handwritting dataset

## Table of Contents

This notebook shows an example usecase of OCRNet using Train Adapt Optimize (TAO) Toolkit.

0. [Set up env variables and map drives](#head-0)
1. [Installing the TAO launcher](#head-1)
2. [Prepare dataset and pre-trained model](#head-2) <br>
    2.1 [Download pre-trained model](#head-2-1) <br>
3. [Provide training specification](#head-3)
4. [Run TAO training](#head-4)


## 0. Set up env variables and map drives <a class="anchor" id="head-0"></a>

When using the purpose-built pretrained models from NGC, please make sure to set the `$KEY` environment variable to the key as mentioned in the model overview. Failing to do so, can lead to errors when trying to load them as pretrained models.

The TAO launcher uses docker containers under the hood, and **for our data and results directory to be visible to the docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The code below creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container.


In [None]:
import os

# Please define this local project directory that needs to be mapped to the TAO docker session.
# %env LOCAL_PROJECT_DIR=/path/to/local/tao-experiments
%env LOCAL_PROJECT_DIR=/hdd_10t/tylerz/ocrnet/dev_blog/project

os.environ["HOST_DATA_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "data", "ocrnet")
os.environ["HOST_RESULTS_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "ocrnet")

# Set this path if you don't run the notebook from the samples directory.
# %env NOTEBOOK_ROOT=/path/to/local/tao-experiments/ocrnet
# The sample spec files are present in the same path as the downloaded samples.
os.environ["HOST_SPECS_DIR"] = os.path.join(
    os.getenv("NOTEBOOK_ROOT", os.getcwd()),
    "specs"
)


In [None]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       # Mapping the data directory
       {
           "source": os.environ["LOCAL_PROJECT_DIR"],
           "destination": "/workspace/tao-experiments"
       },
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

## 1. Installing the TAO launcher <a class="anchor" id="head-1"></a>
The TAO launcher is a python package distributed as a python wheel listed in PyPI. You may install the launcher by executing the following cell.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python 3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the `virtualenv` and `virtualenvwrapper` packages. Once you have setup virtualenvwrapper, please set the version of python to be used in the virtual env by using the `VIRTUALENVWRAPPER_PYTHON` variable. You may do so by running

```sh
export VIRTUALENVWRAPPER_PYTHON=/path/to/bin/python3.x
```
where x >= 6 and <= 8

We recommend performing this step first and then launching the notebook from the virtual environment. In addition to installing TAO python package, please make sure of the following software requirements:
* python >=3.6.9 < 3.8.x
* docker-ce > 19.03.5
* docker-API 1.40
* nvidia-container-toolkit > 1.3.0-1
* nvidia-container-runtime > 3.4.0-1
* nvidia-docker2 > 2.5.0-1
* nvidia-driver > 455+

Once you have installed the pre-requisites, please log in to the docker registry nvcr.io by following the command below

```sh
docker login nvcr.io
```

You will be triggered to enter a username and password. The username is `$oauthtoken` and the password is the API key generated from `ngc.nvidia.com`. Please follow the instructions in the [NGC setup guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#generating-api-key) to generate your own API key.


In [None]:
# SKIP this step IF you have already installed the TAO launcher.
!pip3 install nvidia-tao

In [None]:
# View the versions of the TAO launcher
!tao info

## 2. Prepare dataset and pre-trained model <a class="anchor" id="head-2"></a>

 We will be using the IAM handwritting dataset. To find more details please visit
https://fki.tic.heia-fr.ch/databases/iam-handwriting-database. Please download the IAMDATA (https://fki.tic.heia-fr.ch/databases/iam-handwriting-database) to `$HOST_DATA_DIR/`

In [None]:
# Check the dataset is present
!if [ ! -f $HOST_DATA_DIR/iamdata.zip ]; then echo 'IAMDATA zip file not found, please download.'; else echo 'Found IAMDATA zip file.';fi

In [None]:
# unpack 
!unzip -u $HOST_DATA_DIR/iamdata.zip -d $HOST_DATA_DIR/

In [None]:
# verify
!ls -l $HOST_DATA_DIR/iamdata/test

In [None]:
# Convert the IAMDATA train split to TAO Toolkit OCRNet format
!python preprocess_data.py --images_dir=$HOST_DATA_DIR/iamdata/train/images \
                           --labels_dir=$HOST_DATA_DIR/iamdata/train/gt \
                           --output_images_dir=$HOST_DATA_DIR/iamdata/train/processed \
                           --gt_file_path=$HOST_DATA_DIR/iamdata/train/gt.txt \
                           --character_list_path=$HOST_DATA_DIR/iamdata/train/character_list.txt


In [None]:
# Convert the IAMDATA test split to TAO Toolkit OCRNet format
!python preprocess_data.py --images_dir=$HOST_DATA_DIR/iamdata/test/images \
                           --labels_dir=$HOST_DATA_DIR/iamdata/test/gt \
                           --output_images_dir=$HOST_DATA_DIR/iamdata/test/processed \
                           --gt_file_path=$HOST_DATA_DIR/iamdata/test/gt.txt \
                           --character_list_path=$HOST_DATA_DIR/iamdata/test/character_list.txt

In [None]:
# Set the path from the perspective of the TAO docker container
%env DATA_DIR = /data
%env SPECS_DIR = /specs
%env RESULTS_DIR = /results

Then we will convert the raw dataset (images + labels list) to LMDB format. LMDB is a key-value memory database. With storing the dataset in RAM memory, we can enjoy a better data IO bandwidth. But if we're working with a remote file system which is used by multiple persons at the same time, we should skip the following steps and use raw dataset loader of OCRNet.

In [None]:
# Convert the raw train dataset to lmdb
print("Converting the training set to LMDB.")
!tao model ocrnet dataset_convert -e $SPECS_DIR/experiment.yaml \
                            dataset_convert.input_img_dir=/ \
                            dataset_convert.gt_file=$DATA_DIR/iamdata/train/gt.txt \
                            dataset_convert.results_dir=$DATA_DIR/iamdata/train/lmdb

In [None]:
# Convert the raw test dataset to lmdb
print("Converting the testing set to LMDB.")
!tao model ocrnet dataset_convert -e $SPECS_DIR/experiment.yaml \
                            dataset_convert.input_img_dir=/ \
                            dataset_convert.gt_file=$DATA_DIR/iamdata/test/gt.txt \
                            dataset_convert.results_dir=$DATA_DIR/iamdata/test/lmdb

In [None]:
!ls -rlt $HOST_DATA_DIR/iamdata/train/lmdb

Additionally, if you have your own dataset already in a volume (or folder), you can mount the volume on `HOST_DATA_DIR` (or create a soft link). Below shows an example:
```bash
# if your dataset is in /dev/sdc1
mount /dev/sdc1 $HOST_DATA_DIR

# if your dataset is in folder /var/dataset
ln -sf /var/dataset $HOST_DATA_DIR
```

### 2.1 Download pre-trained model <a class="anchor" id="head-2-1"></a>

We will use NGC CLI to get the pre-trained models. For more details, go to [ngc.nvidia.com](ngc.nvidia.com) and click the SETUP on the navigation bar.

In [None]:
# Installing NGC CLI on the local machine.
## Download and install
%env CLI=ngccli_cat_linux.zip
!mkdir -p $HOST_RESULTS_DIR/ngccli

# Remove any previously existing CLI installations
!rm -rf $HOST_RESULTS_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $HOST_RESULTS_DIR/ngccli
!unzip -u "$HOST_RESULTS_DIR/ngccli/$CLI" -d $HOST_RESULTS_DIR/ngccli/
!rm $HOST_RESULTS_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("HOST_RESULTS_DIR", ""), os.getenv("PATH", ""))

In [None]:
!ngc registry model list nvidia/tao/ocrnet:*

In [None]:
!mkdir -p $HOST_RESULTS_DIR/pretrained_ocrnet/

In [None]:
# Pull pretrained model from NGC
!ngc registry model download-version nvidia/tao/ocrnet:trainable_v1.0 --dest $HOST_RESULTS_DIR/pretrained_ocrnet

In [None]:
print("Check that model is downloaded into dir.")
!ls -l $HOST_RESULTS_DIR/pretrained_ocrnet/ocrnet_vtrainable_v1.0

## 3. Provide training specification <a class="anchor" id="head-3"></a>
* Dataset for the train datasets
    * In order to use the newly generated dataset, update the dataset_config parameter in the spec file at `$HOST_SPECS_DIR/experiment.yaml`
    * You also need to prepare the new `charater_list_file`.
* Other training (hyper-)parameters such as batch size, number of epochs, learning rate etc.

In [None]:
!cat $HOST_SPECS_DIR/experiment.yaml

## 4. Run TAO training <a class="anchor" id="head-4"></a>
* Provide the sample spec file and the output directory location for models
* WARNING: training will take several hours or one day to complete

In [None]:
!mkdir -p $HOST_RESULTS_DIR/experiment_dir_unpruned

In [None]:
!tao model ocrnet train -e $SPECS_DIR/experiment.yaml \
              train.results_dir=$RESULTS_DIR/experiment_dir_unpruned \
              train.pretrained_model_path=$RESULTS_DIR/pretrained_ocrnet/ocrnet_vtrainable_v1.0/ocrnet_resnet50.tlt \
              train.num_epochs=20 \
              train.optim.lr=1.0 \
              dataset.train_dataset_dir=[$DATA_DIR/iamdata/train/lmdb] \
              dataset.val_dataset_dir=$DATA_DIR/iamdata/test/lmdb \
              dataset.character_list_file=$DATA_DIR/iamdata/train/character_list.txt

## 5. Evaluate trained models <a class="anchor" id="head-5"></a>

In [None]:
print('Trained:')
print('---------------------')
!ls -ltrh $HOST_RESULTS_DIR/experiment_dir_unpruned/

In [None]:
!tao model ocrnet evaluate -e $SPECS_DIR/experiment.yaml \
                 evaluate.results_dir=$RESULTS_DIR/experiment_dir_unpruned \
                 evaluate.checkpoint=$RESULTS_DIR/experiment_dir_unpruned/best_accuracy.pth \
                 evaluate.test_dataset_dir=$DATA_DIR/iamdata/test/lmdb \
                 dataset.character_list_file=$DATA_DIR/iamdata/train/character_list.txt