Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

图像识别:简单的从一张图片中识别出中文(Tesseract) #71

Open
Qingquan-Li opened this issue Oct 28, 2017 · 0 comments
Labels

Comments

@Qingquan-Li
Copy link
Owner

Qingquan-Li commented Oct 28, 2017

需求:从一张图片中识别出中文
实现:使用 Python 并借助开源库 Tesseract 实现

Tesseract是一种开源的光学字符识别(OCR)引擎,可根据Apache 2.0许可证使用。它可以直接使用,或(对于程序员)使用API从图像中提取类型,手写或打印的文本。它支持各种语言。
参考:
https://github.com/tesseract-ocr/tesseract/wiki
https://en.wikipedia.org/wiki/Tesseract_(software)

开发环境:

  • macOS
  • Python 3.6
  • brew

一、安装 tesseract

brew install tesseract

二、安装 Python 对应的包

pip3 install pytesseract

pip3-insatall-pytesseract

三、下载对应的中文训练数据

tesseract 支持多种语言:https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages

https://github.com/tesseract-ocr/tessdata 下载简体中文数据集 chi_sim.traineddata 到 /usr/local/Cellar/tesseract/3.05.01/share/tessdata 目录下:

chi_sim traineddata

四、Show the code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

# open image
image = Image.open('/Users/fatli/Desktop/dufu.png')
code = pytesseract.image_to_string(image, lang='chi_sim')
print(code)

code

附:英文识别
screenshotenglish

@Qingquan-Li Qingquan-Li changed the title 图像识别:简单的从一张图片中识别出中文 图像识别:简单的从一张图片中识别出中文(Tesseract) Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant