<a href="https://colab.research.google.com/github/GruAna/VU/blob/master/train_EasyOCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EasyOCR

## Packages

In [1]:
import os
import cv2 as cv

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!cp /content/drive/MyDrive/Colab_Notebooks/VU/utils.py /content
!mv /content/utils.py /content/utilities.py

In [4]:
from utilities import *

## Setup

## Dataset loading (training data)

**CTW1500 dataset**

Get images. Update location of images **manually**.

In [5]:
# path to image directory, get full path to all files
imgs_dir = '/content/drive/MyDrive/Colab_Notebooks/VU/FewImages/images/'
(_, _, filenames) = next(os.walk(imgs_dir))
filenames.sort()
list_img_paths = [os.path.join(imgs_dir, file) for file in filenames]
n_imgs = len(list_img_paths)

In [6]:
# load images
train_images = [(cv.imread(file)) for file in list_img_paths]

Get paths to files with labels. Update location **manually**.

In [7]:
# path to label directory, get full path to all files
labels_dir = '/content/drive/MyDrive/Colab_Notebooks/VU/FewImages/labelsxml/'
(_, _, xml_files) = next(os.walk(labels_dir))
xml_files.sort()
list_xml_paths = [os.path.join(labels_dir, file) for file in xml_files]

Get ground truths from all xml files

In [8]:
ground_truth = []

for i, file in enumerate(list_xml_paths):
     ground_truth.append(get_labels_xml(file))

# ground_truth is in the is a list of tuples, where first is the gt word 
# and second is an array of top left and bottom right coordinates
# format: ('text', [[tl,tl],[br,br]])

Crop images with multiple words to one image = one word, save and create corresponding label .txt file (one for all images, called `gt.txt`)

In [9]:
def image_text_crop(images, ground_truth, one_file='True', result_folder='./results'):
    """
    Crops and saves images based on bounding box ground truth fro each text region.
    Creates text file with corresponding annotation.

    Parameter:
    - images: loaded images
    - groun_truth: list of gt tuples first text annotation, second np.array of 
    left top and bottom right coodinates, format: ('text', [[tl,tl],[br,br]])
    """

    # test if there are not more gts than images
    # else the for loop will never get to those exceeding image count
    gt_length = len(ground_truth)
    if len(images) > gt_length:
        images = images[:gt_length]

    if not os.path.isdir(result_folder):
        os.mkdir(result_folder)
    
    all_texts = []
    for i, img in enumerate(images):
        name, ext = os.path.splitext(filenames[i])

        # count regions in one image - used for file naming purposes
        region = 1
        
        for text, bbox in ground_truth[i]:
            # select image within coordinates (bbox)
            cropped = img[bbox[0,1]:bbox[1,1], bbox[0,0]:bbox[1,0]]

            # create image file:
            # name in format "original-00region.ext"
            new_name = name + '-' + str(region).zfill(3)
            cv.imwrite(os.path.join(result_folder, new_name + ext), cropped)

            # create  text annotation file(s)
            if one_file:
                all_texts.append(new_name + ext + '\t' + text)
            else:
                # one file for each image with word
                with open(os.path.join(result_folder, new_name + '.txt'), 'w') as f:
                    f.write(text)
            region += 1
    
    with open(os.path.join(result_folder, 'gt.txt'), 'w') as f:
        for line in all_texts:
            f.writelines(line+'\n')


In [10]:
image_text_crop(train_images, ground_truth)

In [34]:
import csv

with open('./results/gt.txt', 'r') as infile, open('./results/labels.csv', 'w') as outfile:
     stripped = (line.strip() for line in infile)
     lines = (line.split("\t") for line in stripped if line)
     writer = csv.writer(outfile)
     writer.writerow(['filename', 'words'])
     writer.writerows(lines)

In [35]:
! cp ./results/labels.csv /content/all_data/en_val
! cp ./results/labels.csv /content/all_data/en_train_filtered

## Training
### Training setup

Clone functions for training.
[EasyOCR trainer](https://github.com/JaidedAI/EasyOCR/tree/master/trainer).

In [12]:
!git clone https://github.com/JaidedAI/EasyOCR.git 

fatal: destination path 'EasyOCR' already exists and is not an empty directory.


In [13]:
% cd /content/

/content


In [14]:
! cp -RT /content/EasyOCR/trainer/ /content/

Train custom model. Following three cells is the content of 
[trainer.ipynb](https://github.com/JaidedAI/EasyOCR/blob/master/trainer/trainer.ipynb).

In [15]:
import torch.backends.cudnn as cudnn
import yaml
from train import train
from utils import AttrDict
import pandas as pd

In [16]:
cudnn.benchmark = True
cudnn.deterministic = False

In [17]:
def get_config(file_path):
    with open(file_path, 'r', encoding="utf8") as stream:
        opt = yaml.safe_load(stream)
    opt = AttrDict(opt)
    if opt.lang_char == 'None':
        characters = ''
        for data in opt['select_data'].split('-'):
            csv_path = os.path.join(opt['train_data'], data, 'labels.csv')
            df = pd.read_csv(csv_path, sep='^([^,]+),', engine='python', usecols=['filename', 'words'], keep_default_na=False)
            all_char = ''.join(df['words'])
            characters += ''.join(set(all_char))
        characters = sorted(set(characters))
        opt.character= ''.join(characters)
    else:
        opt.character = opt.number + opt.symbol + opt.lang_char
    os.makedirs(f'./saved_models/{opt.experiment_name}', exist_ok=True)
    return opt
 

In [37]:
opt = get_config("config_files/en_filtered_config.yaml")
train(opt, amp=False)

Filtering the images containing characters which are not in opt.character
Filtering the images whose label is longer than opt.batch_max_length
--------------------------------------------------------------------------------
dataset_root: all_data
opt.select_data: ['en_train_filtered']
opt.batch_ratio: ['1']
--------------------------------------------------------------------------------
dataset_root:    all_data	 dataset: en_train_filtered
all_data/en_train_filtered/.ipynb_checkpoints
sub-directory:	/en_train_filtered/.ipynb_checkpoints	 num samples: 178
num total samples of en_train_filtered: 178 x 1.0 (total_data_usage_ratio) = 178
num samples of en_train_filtered per batch: 32 x 1.0 (batch_ratio) = 32


  cpuset_checked))


--------------------------------------------------------------------------------
Total_batch_size: 32 = 32
--------------------------------------------------------------------------------
dataset_root:    all_data/en_val	 dataset: /
all_data/en_val/
sub-directory:	/.	 num samples: 178
--------------------------------------------------------------------------------
No Transformation module specified
model input parameters 64 600 20 1 256 256 97 34 None VGG BiLSTM CTC
Model:
DataParallel(
  (module): Model(
    (FeatureExtraction): VGG_FeatureExtractor(
      (ConvNet): Sequential(
        (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): ReLU(inplace=True)
        (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): ReLU(inplace=True)
        (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (6): Con

FileNotFoundError: ignored

In [36]:
! cp ./results/labels.csv /content/all_data/en_train_filtered/.ipynb_checkpoints

In [28]:
! ls -a all_data/en_train_filtered/

.	      0006-002.jpg  0009-008.jpg  0012-012.jpg	0018-010.jpg
..	      0006-003.jpg  0009-009.jpg  0012-013.jpg	0018-011.jpg
0001-001.jpg  0007-001.jpg  0009-010.jpg  0012-014.jpg	0018-012.jpg
0001-002.jpg  0007-002.jpg  0010-001.jpg  0012-015.jpg	0018-013.jpg
0001-003.jpg  0007-003.jpg  0010-002.jpg  0012-016.jpg	0018-014.jpg
0001-004.jpg  0007-004.jpg  0010-003.jpg  0012-017.jpg	0018-015.jpg
0001-005.jpg  0007-005.jpg  0010-004.jpg  0012-018.jpg	0018-016.jpg
0001-006.jpg  0007-006.jpg  0010-005.jpg  0012-019.jpg	0018-017.jpg
0001-007.jpg  0007-007.jpg  0010-006.jpg  0012-020.jpg	0018-018.jpg
0002-001.jpg  0007-008.jpg  0010-007.jpg  0012-021.jpg	0019-001.jpg
0002-002.jpg  0007-009.jpg  0010-008.jpg  0012-022.jpg	0019-002.jpg
0002-003.jpg  0007-010.jpg  0010-009.jpg  0012-023.jpg	0019-003.jpg
0002-004.jpg  0007-011.jpg  0010-010.jpg  0012-024.jpg	0019-004.jpg
0003-001.jpg  0007-012.jpg  0010-011.jpg  0012-025.jpg	0019-005.jpg
0003-002.jpg  0007-013.jpg  0011-001.jpg  0012-026.jpg	0020

Move training images to directory `/content/EasyOCR/trainer/all_data`

In [None]:
% mkdir ./all_data/en_train_filtered

In [None]:
% mkdir ./all_data/en_val

In [None]:
% cp -RT /content/results/ /content/EasyOCR/trainer/all_data/en_train_filtered/

In [None]:
% cp -RT ./all_data/en_train_filtered/ ./all_data/en_val

In [None]:
opt = get_config("/content/EasyOCR/trainer/config_files/en_filtered_config.yaml")
train(opt, amp=False)