# Optical Character Recognition (OCR) - Kindai-OCR

## 1 Validate environment

This notebook was created using Colab from Google. It is important to change the Runtime configuration to GPU before running it.

### 1.1 Validate CUDA version

We need to validate if the environment provide a CUDA version.

In [1]:
!nvidia-smi

Sun Nov 29 19:36:52 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 1.1 Verify CUDA and Torch

The following code validates whether or not CUDA is available. It should be True to continue the process.

In [2]:
import torch
torch.cuda.is_available()
# Output would be True if Pytorch is using GPU otherwise it would be False.

True

This code validates the standard output. In this case it should be '/device:GPU:0'

In [3]:
import tensorflow as tf
tf.test.gpu_device_name()
# Standard output is '/device:GPU:0'

'/device:GPU:0'

# 

## Running test.py

Following the instructions given by the Kindai developers, in order for us to run text detection, we need to put the image(s) in the test folder. Moreover we have to use the code from the file test.py. To do that the following code is used to run .py files.

In [1]:
%run test

./pretrain/vgg16_bn-6c64b313.pth
Loading text detection model from checkpoint ./pretrain/synweights_4600.pth
total words/phones 5748
./data/test/00-sample_pages_3_sec_3.png

infer/postproc time : 59.043/0.092
save image
./data/test/00-sample_pages_3_sec_2.png

infer/postproc time : 57.668/0.051
save image
./data/test/00-sample_pages_3_sec_1.png

infer/postproc time : 57.458/0.048
save image
./data/test/00-sample_pages_3_sec_4.png

infer/postproc time : 57.850/0.043
save image
elapsed time : 298.41661858558655s


This code extracts the infor from the xml file and creates a dataframe.

In [110]:
import pandas as pd 
import xml.etree.ElementTree as et 

xmlp = ET.XMLParser(encoding="utf-8")
#xtree = et.parse(xml_file, parser=xmlp)
xtree = et.parse("result.xml", parser=xmlp)
xroot = xtree.getroot() 

df = pd.DataFrame(columns=['tag','attributes','text','dpi','file','number','width','height','x','y'])
for elem in xroot.iter():
    #print(elem.tag, elem.attrib, elem.text)
    df = df.append({'tag': elem.tag, 'attributes':elem.attrib, 'text':elem.text}, ignore_index=True)
df.head()

Unnamed: 0,tag,attributes,text,dpi,file,number,width,height,x,y
0,{http://codh.rois.ac.jp/modern-magazine/}paper,{},,,,,,,,
1,{http://codh.rois.ac.jp/modern-magazine/}page,"{'dpi': '100', 'file': '00-sample_pages_3_sec_...",,,,,,,,
2,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '905', 'width': '72', 'x': '332', '...",あいくるし發路事、愛くるし（形、,,,,,,,
3,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '1026', 'width': '76', 'x': '466', ...",あいくち島（合巳。名。あれせも。ものよ,,,,,,,
4,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '1026', 'width': '69', 'x': '592', ...",あいくすりし（（きき」名その人の爲に信身あ,,,,,,,


After creating the dataframe, it is necessary to assign to each recognized line the corresponding paper. Moreover, the column name 'Attributes' contains nested dictionaries. The following code processes those dictionaries and assigns their values to new columns in the dataframe.

In [112]:
import ast
dpi = ""
file = ""
number = ""
width = ""
height = ""
x = ""
y = ""

for i in df.index:
    #print(df.iloc[i]['attributes'])

    if df.iloc[i]['attributes'].get('dpi') != None:
        dpi = df.iloc[i]['attributes'].get('dpi')
        df['dpi'].iloc[i] = dpi
    else:
        df['dpi'].iloc[i] = dpi
    if df.iloc[i]['attributes'].get('file') != None:
    file = df.iloc[i]['attributes'].get('file')
    df['file'].iloc[i] = file
    else:
        df['file'].iloc[i] = file
    if df.iloc[i]['attributes'].get('number') != None:
        number = df.iloc[i]['attributes'].get('number')
        df['number'].iloc[i] = number
    else:
        df['number'].iloc[i] = number
    if df.iloc[i]['attributes'].get('width') != None:
        width = df.iloc[i]['attributes'].get('width')
        df['width'].iloc[i] = width
    else:
        df['width'].iloc[i] = width
    if df.iloc[i]['attributes'].get('height') != None:
        height = df.iloc[i]['attributes'].get('height')
        df['height'].iloc[i] = height
    else:
        df['height'].iloc[i] = height
    if df.iloc[i]['attributes'].get('x') != None:
        x = df.iloc[i]['attributes'].get('x')
        df['x'].iloc[i] = x
    else:
        df['x'].iloc[i] = x
    if df.iloc[i]['attributes'].get('y') != None:
        y = df.iloc[i]['attributes'].get('y')
        df['y'].iloc[i] = y
    else:
        df['y'].iloc[i] = y

df

Unnamed: 0,tag,attributes,text,dpi,file,number,width,height,x,y
0,{http://codh.rois.ac.jp/modern-magazine/}paper,{},,,,,,,,
1,{http://codh.rois.ac.jp/modern-magazine/}page,"{'dpi': '100', 'file': '00-sample_pages_3_sec_...",,100,00-sample_pages_3_sec_3.png,1,2430,1040,,
2,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '905', 'width': '72', 'x': '332', '...",あいくるし發路事、愛くるし（形、,100,00-sample_pages_3_sec_3.png,1,72,905,332,0
3,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '1026', 'width': '76', 'x': '466', ...",あいくち島（合巳。名。あれせも。ものよ,100,00-sample_pages_3_sec_3.png,1,76,1026,466,0
4,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '1026', 'width': '69', 'x': '592', ...",あいくすりし（（きき」名その人の爲に信身あ,100,00-sample_pages_3_sec_3.png,1,69,1026,592,0
...,...,...,...,...,...,...,...,...,...,...
234,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '269', 'width': '54', 'x': '729', '...",錢つつ頃け入,100,00-sample_pages_3_sec_4.png,1,54,269,729,767
235,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '83', 'width': '40', 'x': '937', 'y...",てこと。,100,00-sample_pages_3_sec_4.png,1,40,83,937,860
236,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '118', 'width': '65', 'x': '660', '...",ふじ,100,00-sample_pages_3_sec_4.png,1,65,118,660,923
237,{http://codh.rois.ac.jp/modern-magazine/}line,"{'height': '89', 'width': '45', 'x': '1059', '...",「と。,100,00-sample_pages_3_sec_4.png,1,45,89,1059,947


This code prints each character with the image name from which it was extracted.

In [134]:
for i in df.index:
  print('new line')
  if df.text.iloc[i] != None:
    for char in df.text.iloc[i]:
      if char != None:
        print(char, " | ", df.file.iloc[i])
    
    
    



new line
new line
new line
あ  |  00-sample_pages_3_sec_3.png
い  |  00-sample_pages_3_sec_3.png
く  |  00-sample_pages_3_sec_3.png
る  |  00-sample_pages_3_sec_3.png
し  |  00-sample_pages_3_sec_3.png
發  |  00-sample_pages_3_sec_3.png
路  |  00-sample_pages_3_sec_3.png
事  |  00-sample_pages_3_sec_3.png
、  |  00-sample_pages_3_sec_3.png
愛  |  00-sample_pages_3_sec_3.png
く  |  00-sample_pages_3_sec_3.png
る  |  00-sample_pages_3_sec_3.png
し  |  00-sample_pages_3_sec_3.png
（  |  00-sample_pages_3_sec_3.png
形  |  00-sample_pages_3_sec_3.png
、  |  00-sample_pages_3_sec_3.png
new line
あ  |  00-sample_pages_3_sec_3.png
い  |  00-sample_pages_3_sec_3.png
く  |  00-sample_pages_3_sec_3.png
ち  |  00-sample_pages_3_sec_3.png
島  |  00-sample_pages_3_sec_3.png
（  |  00-sample_pages_3_sec_3.png
合  |  00-sample_pages_3_sec_3.png
巳  |  00-sample_pages_3_sec_3.png
。  |  00-sample_pages_3_sec_3.png
名  |  00-sample_pages_3_sec_3.png
。  |  00-sample_pages_3_sec_3.png
あ  |  00-sample_pages_3_sec_3.png
れ  |  00-sam