Skip to content
A curated list of promising OCR resources
Branch: master
Clone or download
Pull request Compare This branch is 18 commits behind wanghaisheng:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
papers
.gitignore
3754个常用汉字列表.txt
6039.txt
LICENSE
README.md
chinese_5039.txt
cid2code.txt
special-character.txt
汉语拼音码表-一级汉字拼音对照表
汉语拼音码表-二级汉字拼音对照表

README.md

awesome-ocr

A curated list of promising OCR resources

Librarys

有2个api
都支持图片
百度自家的 :基本可以放弃
化验单识别:也只能提取化验单上三个字段的一个
OcrKing 从哪来?

OcrKing 源自2009年初 Aven 在数据挖掘中的自用项目,在对技术的执着和爱好的驱动下积累已近七载经多年的积累和迭代,如今已经进化为云架构的集多层神经网络与深度学习于一体的OCR识别系统2010年初为方便更多用户使用,特制作web版文字OCR识别,从始至今 OcrKing一直提供免费识别服务及开发接口,今后将继续提供免费云OCR识别服务。OcrKing从未做过推广,

但也确确实实默默地存在,因为他相信有需求的朋友肯定能找得到。欢迎把 OcrKing 在线识别介绍给您身边有类似需求的朋友!希望这个工具对你有用,谢谢各位的支持!

OcrKing 能做什么?

OcrKing 是一个免费的快速易用的在线云OCR平台,可以将PDF及图片中的内容识别出来,生成一个内容可编辑的文档。支持多种文件格式输入及输出,支持多语种(简体中文,繁体中文,英语,日语,韩语,德语,法语等)识别,支持多种识别方式, 支持多种系统平台, 支持多形式API调用!
Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu's Silicon Valley AI Lab.

Warp-CTC是一个可以应用在CPU和GPU上高效并行的CTC代码库 (library) 介绍 CTCConnectionist Temporal Classification作为一个损失函数,用于在序列数据上进行监督式学习,不需要对齐输入数据及标签。比如,CTC可以被用来训练端对端的语音识别系统,这正是我们在百度硅谷试验室所使用的方法。 端到端 系统 语音识别

Papers

Building on recent advances in image caption generation and optical character recognition (OCR), we present a general-purpose, deep learning-based system to decompile an image into presentational markup. While this task is a well-studied problem in OCR, our method takes an inherently different, data-driven approach. Our model does not require any knowledge of the underlying markup language, and is simply trained end-to-end on real-world example data. The model employs a convolutional network for text and layout recognition in tandem with an attention-based neural machine translation system. To train and evaluate the model, we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup, as well as a synthetic dataset of web pages paired with HTML snippets. Experimental results show that the system is surprisingly effective at generating accurate markup for both datasets. While a standard domain-specific LaTeX OCR system achieves around 25% accuracy, our model reproduces the exact rendered image on 75% of examples. 

We present recursive recurrent neural networks with attention modeling (R2AM) for lexicon-free optical character recognition in natural scene images. The primary advantages of the proposed method are: (1) use of recursive convolutional neural networks (CNNs), which allow for parametrically efficient and effective image feature extraction; (2) an implicitly learned character-level language model, embodied in a recurrent neural network which avoids the need to use N-grams; and (3) the use of a soft-attention mechanism, allowing the model to selectively exploit image features in a coordinated way, and allowing for end-to-end training within a standard backpropagation framework. We validate our method with state-of-the-art performance on challenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k.

Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods

Blogs

特征描述的完整过程 http://dataunion.org/wp-content/uploads/2015/05/640.webp_2.jpg

Presentations

Projects

Commercial products

作者:chenqin
链接:https://www.zhihu.com/question/19593313/answer/18795396
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

1,识别率极高。我使用过现在的答案总结里提到的所有软件,但遇到下面这样的表格,除了ABBYY还能保持95%以上的识别率之外(包括秦皇岛三个字),其他所有的软件全部歇菜,数字认错也就罢了,中文也认不出。血泪的教训。
![](https://pic3.zhimg.com/a1b8009516c105556d2a2df319c72d72_b.jpg)
2,自由度高。可以在同一页面手动划分不同的区块,每一个区块也可以分别设置表格或文字;简体繁体英文数字。而此时大部分软件还只能对一个页面设置一种识别方案,要么表格,要么文字。
3,批量操作方便。对于版式雷同的年鉴,将一页的版式设计好,便可以应用到其他页,省去大量重复操作。
4,可以保持原有表格格式,省去二次编辑。跨页识别表格时,选择“识别为EXCEL”,ABBYY可以将表格连在一起,产出的是一整个excel文件,分析起来就方便多了。
5,包括梯形校正,歪斜校正之类的许多图片校正方式,即使扫描得歪了,或者因为书本太厚而导致靠近书脊的部分文字扭曲,都可以校正回来。
  • IRIS
 真正能把中文OCR做得比较专业的,一共也没几家,国内2家,国外2家。国内是文通和汉王,国外是ABBYY和IRIS(台湾原来有2家丹青和蒙恬,这两年没什么动静了)。像大家提到的紫光OCR、CAJViewer、MS Office、清华OCR、包括慧视小灵鼠,这些都是文通的产品或者使用文通的识别引擎,尚书则是汉王的产品,和中晶扫描仪捆绑销售的。这两家的中文识别率都是非常不错的。而国外的2家,主要特点是西方语言的识别率很好,而且支持多种西欧语言,产品化程度也很高,不过中文方面速度和识别率还是有差距的,当然这两年人家也是在不断进步。Google的开源项目,至少在中文方面,和这些家相比,各项性能指标水平差距还蛮大的呢。 

作者:张岩
链接:https://www.zhihu.com/question/19593313/answer/14199596
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
https://github.com/cisocrgroup

OCR Databases

OTHERS

You can’t perform that action at this time.