## 문자를 읽을 수 있는 딥러닝

### [ 목   차 ]
<br>
STEP1. 검증용 데이터셋 준비

STEP2. keras-ocr, Tesseract로 테스트 진행(Google OCT API 선택사항)

STEP3. 테스트 결과 정리

STEP4. 결과분석과 결론제시

[ 회 고 ]

In [None]:
%%capture
!pip install keras-ocr

In [None]:
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
from glob import glob

#### STEP1. 검증용 데이터셋 준비
   * 다양한 수학책의 표지 이미지 20장 준비
   * 배경 무늬 다양하게 선정
   * 문자 기울어진 것, 세로방향 포함, 많은 텍스트 포함된 표지 등 준비

In [None]:
images = sorted(glob('/content/drive/MyDrive/LMS/E_12/images/*.png'))
image_names = [x.split("/")[-1] for x in images]
image_names

['img_01.png',
 'img_02.png',
 'img_03.png',
 'img_04.png',
 'img_05.png',
 'img_06.png',
 'img_07.png',
 'img_08.png',
 'img_09.png',
 'img_10.png',
 'img_11.png',
 'img_12.png',
 'img_13.png',
 'img_14.png',
 'img_15.png',
 'img_16.png',
 'img_17.png',
 'img_18.png',
 'img_19.png',
 'img_20.png']

In [None]:
# 인식에 사용할 이미지 20장 확인
plt.figure(figsize=(18, 24))

for i, (img, name) in tqdm(enumerate(zip(images, image_names))):
    plt.subplot(5, 4, i+1)
    plt.title(f'{name}')

    img_bgr = cv2.imread(img)
    img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    plt.imshow(img_rgb)
    
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)

    plt.tight_layout()

plt.show()


Output hidden; open in https://colab.research.google.com to view.

#### STEP2. 테스트 진행
  * keras-ocr, Tesseract로 테스트
  * (Google ocr api는 선택사항)

### keras_ocr

In [None]:
import keras_ocr

# keras-ocr이 detector과 recognizer를 위한 모델을 자동으로 다운로드받게 됩니다. 
pipeline = keras_ocr.pipeline.Pipeline()

Looking for /root/.keras-ocr/craft_mlt_25k.h5
Downloading /root/.keras-ocr/craft_mlt_25k.h5
Looking for /root/.keras-ocr/crnn_kurapan.h5
Downloading /root/.keras-ocr/crnn_kurapan.h5


In [None]:
# 14분 44초

keras_images = [ keras_ocr.tools.read(img) for img in images ]
prediction_groups = [pipeline.recognize([img]) for img in tqdm(images)]

100%|██████████| 20/20 [09:19<00:00, 27.99s/it]


In [None]:
def draw_keras_ocr(images, prediction_groups):
  # Plot the predictions
  fig, axs = plt.subplots(nrows=len(images), figsize=(8, 80))
  fig.tight_layout()
  for idx, ax in enumerate(axs):
      keras_ocr.tools.drawAnnotations(image=images[idx], 
                                      predictions=prediction_groups[idx][0], ax=ax)

In [None]:
draw_keras_ocr(keras_images, prediction_groups)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# prediction_groups 내용 확인: 인식한 텍스트 및 이미지 위치 정보

print(type(prediction_groups))
print(prediction_groups[0][0][0])
print(prediction_groups[0][0][0][0])
print(prediction_groups[0][0][0][1])

<class 'list'>
('pearson', array([[176., 284.],
       [279., 284.],
       [279., 303.],
       [176., 303.]], dtype=float32))
pearson
[[176. 284.]
 [279. 284.]
 [279. 303.]
 [176. 303.]]


In [None]:
# keras_ocr 가 인식한 텍스트
keras_ocr_dict = {}
keras_ocr_text_list = []

for idx, img in enumerate(images):
  keras_ocr_dict[img] = prediction_groups[idx][0]
  keras_ocr_text_list.append([idx, list(zip(*keras_ocr_dict[img]))[0]])

In [None]:
# keras_ocr 이 인식한 문자 리스트

keras_ocr_text_list

[[0,
  ('pearson',
   'new',
   'international',
   'edition',
   'algebra',
   'michael',
   'artin',
   'second',
   'edition',
   'pearson',
   'always',
   'learning')],
 [1,
  ('first',
   'course',
   'a',
   'in',
   'differential',
   'equations',
   'modeling',
   'with',
   'applications',
   '11e')],
 [2, ('first', 'course', 'a', 'in', 'probability', 'tenth', 'edition')],
 [3, ('abstract', 'algebra', 'a', 'first', 'course', 'second', 'edition')],
 [4,
  ('andabpled',
   'pure',
   'sall',
   '5',
   'the',
   'undergraduate',
   'texts',
   '5',
   'series',
   '',
   'advanced',
   'calculus',
   'intended',
   'for',
   'furnish',
   'thie',
   'back',
   'is',
   'text',
   'that',
   'as',
   'a',
   'courses',
   'bone',
   'of',
   'the',
   'students',
   'undergraduate',
   'educat',
   'tion',
   'in',
   'mathematical',
   'analysis',
   'the',
   'goal',
   'is',
   'rigorously',
   'the',
   'fundamental',
   'within',
   'the',
   'of',
   'to',
   'present',
  

### tesseract-ocr

In [None]:
%%capture
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

In [None]:
%%capture
!pip install pytesseract

In [None]:
import os
import pytesseract
from PIL import Image
from pytesseract import Output
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# OCR Engine modes(–oem):
# 0 - Legacy engine only.
# 1 - Neural nets LSTM engine only.
# 2 - Legacy + LSTM engines.
# 3 - Default, based on what is available.

# Page segmentation modes(–psm):
# 0 - Orientation and script detection (OSD) only.
# 1 - Automatic page segmentation with OSD.
# 2 - Automatic page segmentation, but no OSD, or OCR.
# 3 - Fully automatic page segmentation, but no OSD. (Default)
# 4 - Assume a single column of text of variable sizes.
# 5 - Assume a single uniform block of vertically aligned text.
# 6 - Assume a single uniform block of text.
# 7 - Treat the image as a single text line.
# 8 - Treat the image as a single word.
# 9 - Treat the image as a single word in a circle.
# 10 - Treat the image as a single character.
# 11 - Sparse text. Find as much text as possible in no particular order.
# 12 - Sparse text with OSD.
# 13 - Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

def crop_word_regions(image_path='./images/sample.png', output_path='./output'):
    if not os.path.exists(output_path):
        os.mkdir(output_path)
    custom_oem_psm_config = r'--oem 3 --psm 3'
    image = Image.open(image_path)

    recognized_data = pytesseract.image_to_data(
        image, lang='eng',    # 한국어라면 lang='kor'
        config=custom_oem_psm_config,
        output_type=Output.DICT
    )
    
    top_level = max(recognized_data['level'])
    index = 0
    cropped_image_dict = {}
    for i in range(len(recognized_data['level'])):
        level = recognized_data['level'][i]
    
        if level == top_level:
            left = recognized_data['left'][i]
            top = recognized_data['top'][i]
            width = recognized_data['width'][i]
            height = recognized_data['height'][i]
            
         #   img_file = images.split[index]('/')[-1][:-4]
            output_img_path = os.path.join(output_path, f"{str(index).zfill(4)}.png")
            print(output_img_path)
            cropped_image = image.crop((
                left,
                top,
                left+width,
                top+height
            ))
            cropped_image.save(output_img_path)
            cropped_image_dict[output_img_path] = [(left, top), (left+width, top), (left+width, top+height), (left, top+height)]
            index += 1
    return cropped_image_dict


In [None]:
def recognize_images(cropped_image_path_list):
    custom_oem_psm_config = r'--oem 3 --psm 7'
    
    text_list = []
    for image_path in cropped_image_path_list:
        image = Image.open(image_path)
        recognized_data = pytesseract.image_to_string(
            image, lang='eng',    # 한국어라면 lang='kor'
            config=custom_oem_psm_config,
            output_type=Output.DICT
        )
        text = recognized_data['text']
        print(text)
        text_list.append(text)
    print("Done")

    return text_list


In [None]:
!pip install Pillow==9.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Pillow==9.0.0
  Downloading Pillow-9.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 7.4 MB/s 
[?25hInstalling collected packages: Pillow
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.1.1
    Uninstalling Pillow-9.1.1:
      Successfully uninstalled Pillow-9.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed Pillow-9.0.0


In [None]:
work_dir = '/content/drive/MyDrive/LMS/E_12/images'
output_dir = work_dir + '/tesseract_output'

tesseract_ocr_dict = {}
for img_path, img_file in zip(images, image_names):
    print(f'---------------{img_file}---------------')
    cropped_image_dict = crop_word_regions(img_path, output_dir)
    
    # 위에서 준비한 문자 영역 파일들을 인식하여 얻어진 텍스트를 출력합니다.
    text_list = recognize_images(cropped_image_dict.keys())
    tesseract_ocr_dict[img_file] = {'text_list': text_list, 'cropped_image_dict': cropped_image_dict}

---------------img_01.png---------------
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0000.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0001.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0002.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0003.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0004.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0005.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0006.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0007.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0008.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0009.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0010.png
/content/drive/MyDrive/LMS/E_12/images/tesseract_output/0011.png
PEARSON

NEW

INTERNAT.

| EDITION

Algebra

Vichael

Artin

Secon

d Edition

AYS LEARWNIWN

ce

PEARSON

Done
---------------img_02.png---------------
/content/drive

In [None]:
tesseract_ocr_text_pos_list = []
tesseract_ocr_text_list = []
for i, (img_path, tesseract_ocr_result) in enumerate(tesseract_ocr_dict.items()):
  tesseract_ocr_text_list.append([i, tesseract_ocr_result['text_list']])
  tesseract_ocr_text_pos_list.append([i, list(tesseract_ocr_result['cropped_image_dict'].values())])

In [None]:
plt.figure(figsize=(15,50))

for i, image in enumerate(images):
    plt.subplot(10, 2, i+1)
    plt.title(f'index: {i}')
    
    text_pos_list = tesseract_ocr_text_pos_list[i][1]
    text_list = tesseract_ocr_text_list[i][1]
    text = ', '.join(text_list)

    for text_pos in text_pos_list:
        w = text_pos[2][0]-text_pos[0][0]
        h = text_pos[2][1]-text_pos[0][1]
        rect = patches.Rectangle(
            (text_pos[0]),
            w, h,
            linewidth=2,
            edgecolor='red',
            fill=False
        )
        plt.gca().add_patch(rect)

    img_bgr = cv2.imread(image)
    img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    
    plt.imshow(img_rgb)

    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)

    plt.tight_layout()

plt.show()

Output hidden; open in https://colab.research.google.com to view.

### STEP3. 테스트 결과 정리

##### keras-ocr 및 Tesseract로 테스트 진행한 결과
<br>

 * 문자 인식: keras-ocr 의 문자인식 숫자가 많고(detect), 더 정확한 문자인식률(recognize)을 나타냄

|이미지|keras-ocr|Tesseract|
|------|---|---|
|image01|pearson','new','international','edition','algebra','michael','artin',<br> 'second','edition','pearson','always','learning'|PEARSON NEW INTERNAT. EDITION Algebra Vichael Artin Secon d <br> Edition AYS LEARWNIWN ce PEARSON|
|image02|'first','course','a','in','differential','equations','modeling',<br> 'with','applications','11e'|i ae A First Cour ma DSS Ne FOUATIONS with Modeling Applications ‘a >. ae a s 3 i|
|image03|'first','course','a','in','probability','tenth','edition'|A Hirst (Course in Probability . i r . 1 ,|
|image04|'abstract','algebra','a','first','course','second','edition'|ABSTRACT 2S \   A First    af S ty ta eee|
|image05|'andabpled','pure','sall','5','the','undergraduate','texts','5',<br> 'series','','advanced','calculus','intended','for','furnish','thie','back',<br> 'is','text',...|advanced ( aiculiu SS iS intended a a Text for Cott Aes. that furnish the hack- hon ...|
|image06|'introduction','an','to','mathematics','abstract','robert','bond',<br> 'ja','william','keane','j'|No Introduction ce Abstract Mathematics ee) re Robert iF Bond TRI iF Keane|
|image07|'texts','and','readings','trim','37','mathematics','in','e','l',<br> 'analysis','s','terence','tao','se','qa','300','t325','2000','val','hindustan',<br> 'ho','agency','book','l','lun','l'|& = Analysis I Terence Tao HINDUSTAN BOOK AGENCY|
|image08|'edition','sixth','applied','combinatorics','alan','tucker'|APPLIED COMBINATORICS ALAN F\  TUC (44%|
|image09|'calculus','edition','fourth','spivak','michael'|the CALCULUS Fourth Ree Michael Spivak|
|image10|'james','stewart','calculus','eighth','edition','early','transcendentals'|AM JAMES STEWART eer sl oe Del 1ON [}————— Early Transcenadadentals|
|image11|'calcises','made','easy','themeson','macmiltan'|el ES ete spe es A PRPs a SS ee, en dee hk RO OOO ee ee eee nee|
|image12|'ninth','edition','complex','variables','applications','and',<br> 'zaxtiy','d','x','ward','brown','james','ruel','churchill','va'|ATTA, OF ERAT), Complex Variables and Ap plications rh Soy James<br> Ward Brown Ruel VA (*hurchill|
|image13|'gallian','joseph','a','contemporary','abstract','algebra','ninth', 'edition'|Joseph aN Gallian Sa  COIN TEMPORARY ABS | RAC 1 ALGEBRA Ninth Edition EEE OO OO EEE OOOO OOOO) ae|
|image14|'i','discrete','mathematical','structures','sixth','edition',<br> 'kolman','busby','ross'|oi DISCRETE MATHEMATICAL STRUCTURES ib aia) Edition KOLLMAN BUSBY ROSS|
|image15|'cengage','discrete','mathematics','applications','with','fifth',<br> 'edition','susanna','s','epp'|CENGAGE y € os aN  se) alce Susanna S. Mele|
|image16|'undergraduate','texts','mathematics','in','utm','kenneth','ross',<br> 'a','elementary','analysis','of','the','theory','calculus','edition','second',<br> 'e1', 'ed'|yore Ay ROSS rlementary Analysis Tne Tneor y Of Calculus  Secon d Edition|
|image17|'11th','edition','nl','lnea','age','anton','chris','howard','rorres',<br> 'wiley','applications','version'|eau edition  Howard Anton  ars RHorres APPLICATIONS VERSION WILEY|
|image18|'elementary','tenth','edition','linear','algebra','version', 'applications'|a  aerate Elementary et Linear al leleleyge: , APPLICATIONS VERSION|
|image19|'oreilly','essential','math','science','data','for','of',<br> 'fundamental','take','control','your','with','data','calculus','probability',<br> 'linear','algebra','statistics','8','early','release','raw','s',<br> 'unedited','hadrien','jean'|O'REILLY Essantiqd| Matn for Lata Science Take Control] ot Your ata with Fundamenta! Calculus.<br> linear Algebra, Probability & Statistics  ,  7 eee a » Early Ralease er aes BSR eg48. maarien Jean|
|image20|'finitedimensional','vector','spaces','second','edition','paul',<br> 'halmos','r'|FINITE-DIMEN SIONAL _ ‘ VECTOR SPACES en \f 4|

 * 기울어진 글자 : 3번째 이미지에서 keras-ocr은 기울이게 인식하나, Tesseract는 기울어진 글자 인식하지 않음
 * 세로글자 인식 여부: keras-ocr은 세로글씨 거의 인식하였으나, Tesseract는 일부는 인식하고 일부는 인식하지 않음
 * 비문자의 인식 : Tesseract는 복잡한 배경 문양을 문자로 잘못 인식하는 경우가 많음

### STEP4. 결과분석과 결론 제시
<br>

##### 1. 서비스의 목적과 내용: <br>       
본 서비스의 목적은 다양한 수학책의 표지로부터 책의 타이틀 및 저자 등 목록을 작성하기 위한 문자위치 확인 및 문자의 인식을 목적으로 함

##### 2. OCR모델이 서비스의 목적에 부합하는지의 기준: <br>       
  * 문자 Detect 수준: Detect한 문자의 갯수
  * 문자 Recognize 수준: Detect한 문자의 올바른 인식
  * 비문자 Detect 오류: 문자가 아닌 배경 문양등을 문자로 인식하는 오류수준
  * 세로문자 인식: 가로 및 세로 형태의 문자 인식
  * 삐뚤어진 문자 인식: 문자가 휘어진 경우 올바르게 인식하는지

##### 3. 기준에 따른 테스트 결과 및 모델 선정: <br>       

|평가기준|keras-ocr|Tesseract|
|------|---|---|
|1.문자 Detect수준|문자가 있는 부분 거의 detect하고 있음|배경색상등에 따라 일부 detect 못하는 문자 있음|
|2.문자 Recognize 수준|한 단어가 나누어지기도 하나, 대부분 올바르게 인식|detect된 문자의 인식률이 상당히 떨어짐|
|3.비문자 Detect오류|거의 없음|배경 도형 등을 문자로 잘못 인식하는 경우가 있음|
|4.세로문자 인식|세로문자 인식함|세로문자 인식하는 경우 및 인식 못하는 경우 있음|
|5.삐뚤어진 문자 인식|기울어진 문자 인식함|기울어진 문자 인식하지 못함|
<br>

##### 평가기준에 따른 테스트 결과 keras-ocr 모델이 유리한 것으로 선정됨: <br>   

##### [ 회  고 ]
 * 이미지 속의 문자인식을 위해서 우선 문자를 detect해 내야 한다는 점을 새로 인식하게 됨
 * 내용이 복잡해질수록 이미 만들어진 API의 사용에 의존하게 되는 것 같음. 좋은 API를 비용없이 일반에 공유하는 문화가 확산되길 바래본다
 * 의미의 전달이라는 측면에서 이미지 객체 detection보다 문자 detection이 실 생활에 더 유용할 수 있을 것 같다는 생각이 든다   

<br>

 < reference >
 * https://www.uipath.com/ko/blog/ using-artificial-intelligence-to-optimize-document-understanding
 * https://brunch.co.kr/@kakao-it/318



