Skip to content

Commit

Permalink
add new lang info
Browse files Browse the repository at this point in the history
  • Loading branch information
zdenop committed Jun 28, 2015
1 parent e8b6d6f commit dcc457c
Show file tree
Hide file tree
Showing 4 changed files with 335 additions and 158 deletions.
10 changes: 5 additions & 5 deletions doc/tesseract.1
Expand Up @@ -2,12 +2,12 @@
.\" Title: tesseract
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Date: 06/28/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&"
.TH "TESSERACT" "1" "06/28/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
Expand Down Expand Up @@ -158,9 +158,9 @@ print tesseract parameters to the stdout\&.
.RE
.SH "LANGUAGES"
.sp
There are currently language packs available for the following languages:
There are currently language packs available for the following languages (in \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tessdata\fR\m[]):
.sp
\fBara\fR (Arabic), \fBaze\fR (Azerbauijani), \fBbul\fR (Bulgarian), \fBcat\fR (Catalan), \fBces\fR (Czech), \fBchi_sim\fR (Simplified Chinese), \fBchi_tra\fR (Traditional Chinese), \fBchr\fR (Cherokee), \fBdan\fR (Danish), \fBdan\-frak\fR (Danish (Fraktur)), \fBdeu\fR (German), \fBell\fR (Greek), \fBeng\fR (English), \fBenm\fR (Old English), \fBepo\fR (Esperanto), \fBest\fR (Estonian), \fBfin\fR (Finnish), \fBfra\fR (French), \fBfrm\fR (Old French), \fBglg\fR (Galician), \fBheb\fR (Hebrew), \fBhin\fR (Hindi), \fBhrv\fR (Croation), \fBhun\fR (Hungarian), \fBind\fR (Indonesian), \fBita\fR (Italian), \fBjpn\fR (Japanese), \fBkor\fR (Korean), \fBlav\fR (Latvian), \fBlit\fR (Lithuanian), \fBnld\fR (Dutch), \fBnor\fR (Norwegian), \fBpol\fR (Polish), \fBpor\fR (Portuguese), \fBron\fR (Romanian), \fBrus\fR (Russian), \fBslk\fR (Slovakian), \fBslv\fR (Slovenian), \fBsqi\fR (Albanian), \fBspa\fR (Spanish), \fBsrp\fR (Serbian), \fBswe\fR (Swedish), \fBtam\fR (Tamil), \fBtel\fR (Telugu), \fBtgl\fR (Tagalog), \fBtha\fR (Thai), \fBtur\fR (Turkish), \fBukr\fR (Ukrainian), \fBvie\fR (Vietnamese)
\fBafr\fR (Afrikaans) \fBamh\fR (Amharic) \fBara\fR (Arabic) \fBasm\fR (Assamese) \fBaze\fR (Azerbaijani) \fBaze_cyrl\fR (Azerbaijani \- Cyrilic) \fBbel\fR (Belarusian) \fBben\fR (Bengali) \fBbod\fR (Tibetan) \fBbos\fR (Bosnian) \fBbul\fR (Bulgarian) \fBcat\fR (Catalan; Valencian) \fBceb\fR (Cebuano) \fBces\fR (Czech) \fBchi_sim\fR (Chinese \- Simplified) \fBchi_tra\fR (Chinese \- Traditional) \fBchr\fR (Cherokee) \fBcym\fR (Welsh) \fBdan\fR (Danish) \fBdan_frak\fR (Danish \- Fraktur) \fBdeu\fR (German) \fBdeu_frak\fR (German \- Fraktur) \fBdzo\fR (Dzongkha) \fBell\fR (Greek, Modern (1453\-)) \fBeng\fR (English) \fBenm\fR (English, Middle (1100\-1500)) \fBepo\fR (Esperanto) \fBequ\fR (Math / equation detection module) \fBest\fR (Estonian) \fBeus\fR (Basque) \fBfas\fR (Persian) \fBfin\fR (Finnish) \fBfra\fR (French) \fBfrk\fR (Frankish) \fBfrm\fR (French, Middle (ca\&.1400\-1600)) \fBgle\fR (Irish) \fBglg\fR (Galician) \fBgrc\fR (Greek, Ancient (to 1453)) \fBguj\fR (Gujarati) \fBhat\fR (Haitian; Haitian Creole) \fBheb\fR (Hebrew) \fBhin\fR (Hindi) \fBhrv\fR (Croatian) \fBhun\fR (Hungarian) \fBiku\fR (Inuktitut) \fBind\fR (Indonesian) \fBisl\fR (Icelandic) \fBita\fR (Italian) \fBita_old\fR (Italian \- Old) \fBjav\fR (Javanese) \fBjpn\fR (Japanese) \fBkan\fR (Kannada) \fBkat\fR (Georgian) \fBkat_old\fR (Georgian \- Old) \fBkaz\fR (Kazakh) \fBkhm\fR (Central Khmer) \fBkir\fR (Kirghiz; Kyrgyz) \fBkor\fR (Korean) \fBkur\fR (Kurdish) \fBlao\fR (Lao) \fBlat\fR (Latin) \fBlav\fR (Latvian) \fBlit\fR (Lithuanian) \fBmal\fR (Malayalam) \fBmar\fR (Marathi) \fBmkd\fR (Macedonian) \fBmlt\fR (Maltese) \fBmsa\fR (Malay) \fBmya\fR (Burmese) \fBnep\fR (Nepali) \fBnld\fR (Dutch; Flemish) \fBnor\fR (Norwegian) \fBori\fR (Oriya) \fBosd\fR (Orientation and script detection module) \fBpan\fR (Panjabi; Punjabi) \fBpol\fR (Polish) \fBpor\fR (Portuguese) \fBpus\fR (Pushto; Pashto) \fBron\fR (Romanian; Moldavian; Moldovan) \fBrus\fR (Russian) \fBsan\fR (Sanskrit) \fBsin\fR (Sinhala; Sinhalese) \fBslk\fR (Slovak) \fBslk_frak\fR (Slovak \- Fraktur) \fBslv\fR (Slovenian) \fBspa\fR (Spanish; Castilian) \fBspa_old\fR (Spanish; Castilian \- Old) \fBsqi\fR (Albanian) \fBsrp\fR (Serbian) \fBsrp_latn\fR (Serbian \- Latin) \fBswa\fR (Swahili) \fBswe\fR (Swedish) \fBsyr\fR (Syriac) \fBtam\fR (Tamil) \fBtel\fR (Telugu) \fBtgk\fR (Tajik) \fBtgl\fR (Tagalog) \fBtha\fR (Thai) \fBtir\fR (Tigrinya) \fBtur\fR (Turkish) \fBuig\fR (Uighur; Uyghur) \fBukr\fR (Ukrainian) \fBurd\fR (Urdu) \fBuzb\fR (Uzbek) \fBuzb_cyrl\fR (Uzbek \- Cyrilic) \fBvie\fR (Vietnamese) \fByid\fR (Yiddish)
.sp
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
.SH "CONFIG FILES AND AUGMENTING WITH USER DATA"
Expand Down Expand Up @@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett
.sp
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
.sp
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/docs/blob/master/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
.sp
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
.sp
Expand Down
159 changes: 109 additions & 50 deletions doc/tesseract.1.asc
Expand Up @@ -98,57 +98,116 @@ SINGLE OPTIONS
LANGUAGES
---------

There are currently language packs available for the following languages:

*ara* (Arabic),
*aze* (Azerbauijani),
*bul* (Bulgarian),
*cat* (Catalan),
*ces* (Czech),
*chi_sim* (Simplified Chinese),
*chi_tra* (Traditional Chinese),
*chr* (Cherokee),
*dan* (Danish),
*dan-frak* (Danish (Fraktur)),
*deu* (German),
*ell* (Greek),
*eng* (English),
*enm* (Old English),
*epo* (Esperanto),
*est* (Estonian),
*fin* (Finnish),
*fra* (French),
*frm* (Old French),
*glg* (Galician),
*heb* (Hebrew),
*hin* (Hindi),
*hrv* (Croation),
*hun* (Hungarian),
*ind* (Indonesian),
*ita* (Italian),
*jpn* (Japanese),
*kor* (Korean),
*lav* (Latvian),
*lit* (Lithuanian),
*nld* (Dutch),
*nor* (Norwegian),
*pol* (Polish),
*por* (Portuguese),
*ron* (Romanian),
*rus* (Russian),
*slk* (Slovakian),
*slv* (Slovenian),
*sqi* (Albanian),
*spa* (Spanish),
*srp* (Serbian),
*swe* (Swedish),
*tam* (Tamil),
*tel* (Telugu),
*tgl* (Tagalog),
*tha* (Thai),
*tur* (Turkish),
*ukr* (Ukrainian),
There are currently language packs available for the following languages
(in https://github.com/tesseract-ocr/tessdata):

*afr* (Afrikaans)
*amh* (Amharic)
*ara* (Arabic)
*asm* (Assamese)
*aze* (Azerbaijani)
*aze_cyrl* (Azerbaijani - Cyrilic)
*bel* (Belarusian)
*ben* (Bengali)
*bod* (Tibetan)
*bos* (Bosnian)
*bul* (Bulgarian)
*cat* (Catalan; Valencian)
*ceb* (Cebuano)
*ces* (Czech)
*chi_sim* (Chinese - Simplified)
*chi_tra* (Chinese - Traditional)
*chr* (Cherokee)
*cym* (Welsh)
*dan* (Danish)
*dan_frak* (Danish - Fraktur)
*deu* (German)
*deu_frak* (German - Fraktur)
*dzo* (Dzongkha)
*ell* (Greek, Modern (1453-))
*eng* (English)
*enm* (English, Middle (1100-1500))
*epo* (Esperanto)
*equ* (Math / equation detection module)
*est* (Estonian)
*eus* (Basque)
*fas* (Persian)
*fin* (Finnish)
*fra* (French)
*frk* (Frankish)
*frm* (French, Middle (ca.1400-1600))
*gle* (Irish)
*glg* (Galician)
*grc* (Greek, Ancient (to 1453))
*guj* (Gujarati)
*hat* (Haitian; Haitian Creole)
*heb* (Hebrew)
*hin* (Hindi)
*hrv* (Croatian)
*hun* (Hungarian)
*iku* (Inuktitut)
*ind* (Indonesian)
*isl* (Icelandic)
*ita* (Italian)
*ita_old* (Italian - Old)
*jav* (Javanese)
*jpn* (Japanese)
*kan* (Kannada)
*kat* (Georgian)
*kat_old* (Georgian - Old)
*kaz* (Kazakh)
*khm* (Central Khmer)
*kir* (Kirghiz; Kyrgyz)
*kor* (Korean)
*kur* (Kurdish)
*lao* (Lao)
*lat* (Latin)
*lav* (Latvian)
*lit* (Lithuanian)
*mal* (Malayalam)
*mar* (Marathi)
*mkd* (Macedonian)
*mlt* (Maltese)
*msa* (Malay)
*mya* (Burmese)
*nep* (Nepali)
*nld* (Dutch; Flemish)
*nor* (Norwegian)
*ori* (Oriya)
*osd* (Orientation and script detection module)
*pan* (Panjabi; Punjabi)
*pol* (Polish)
*por* (Portuguese)
*pus* (Pushto; Pashto)
*ron* (Romanian; Moldavian; Moldovan)
*rus* (Russian)
*san* (Sanskrit)
*sin* (Sinhala; Sinhalese)
*slk* (Slovak)
*slk_frak* (Slovak - Fraktur)
*slv* (Slovenian)
*spa* (Spanish; Castilian)
*spa_old* (Spanish; Castilian - Old)
*sqi* (Albanian)
*srp* (Serbian)
*srp_latn* (Serbian - Latin)
*swa* (Swahili)
*swe* (Swedish)
*syr* (Syriac)
*tam* (Tamil)
*tel* (Telugu)
*tgk* (Tajik)
*tgl* (Tagalog)
*tha* (Thai)
*tir* (Tigrinya)
*tur* (Turkish)
*uig* (Uighur; Uyghur)
*ukr* (Ukrainian)
*urd* (Urdu)
*uzb* (Uzbek)
*uzb_cyrl* (Uzbek - Cyrilic)
*vie* (Vietnamese)
*yid* (Yiddish)

To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at
Expand Down
163 changes: 111 additions & 52 deletions doc/tesseract.1.html
Expand Up @@ -931,56 +931,115 @@ <h2 id="_single_options">SINGLE OPTIONS</h2>
<div class="sect1">
<h2 id="_languages">LANGUAGES</h2>
<div class="sectionbody">
<div class="paragraph"><p>There are currently language packs available for the following languages:</p></div>
<div class="paragraph"><p><strong>ara</strong> (Arabic),
<strong>aze</strong> (Azerbauijani),
<strong>bul</strong> (Bulgarian),
<strong>cat</strong> (Catalan),
<strong>ces</strong> (Czech),
<strong>chi_sim</strong> (Simplified Chinese),
<strong>chi_tra</strong> (Traditional Chinese),
<strong>chr</strong> (Cherokee),
<strong>dan</strong> (Danish),
<strong>dan-frak</strong> (Danish (Fraktur)),
<strong>deu</strong> (German),
<strong>ell</strong> (Greek),
<strong>eng</strong> (English),
<strong>enm</strong> (Old English),
<strong>epo</strong> (Esperanto),
<strong>est</strong> (Estonian),
<strong>fin</strong> (Finnish),
<strong>fra</strong> (French),
<strong>frm</strong> (Old French),
<strong>glg</strong> (Galician),
<strong>heb</strong> (Hebrew),
<strong>hin</strong> (Hindi),
<strong>hrv</strong> (Croation),
<strong>hun</strong> (Hungarian),
<strong>ind</strong> (Indonesian),
<strong>ita</strong> (Italian),
<strong>jpn</strong> (Japanese),
<strong>kor</strong> (Korean),
<strong>lav</strong> (Latvian),
<strong>lit</strong> (Lithuanian),
<strong>nld</strong> (Dutch),
<strong>nor</strong> (Norwegian),
<strong>pol</strong> (Polish),
<strong>por</strong> (Portuguese),
<strong>ron</strong> (Romanian),
<strong>rus</strong> (Russian),
<strong>slk</strong> (Slovakian),
<strong>slv</strong> (Slovenian),
<strong>sqi</strong> (Albanian),
<strong>spa</strong> (Spanish),
<strong>srp</strong> (Serbian),
<strong>swe</strong> (Swedish),
<strong>tam</strong> (Tamil),
<strong>tel</strong> (Telugu),
<strong>tgl</strong> (Tagalog),
<strong>tha</strong> (Thai),
<strong>tur</strong> (Turkish),
<strong>ukr</strong> (Ukrainian),
<strong>vie</strong> (Vietnamese)</p></div>
<div class="paragraph"><p>There are currently language packs available for the following languages
(in <a href="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</a>):</p></div>
<div class="paragraph"><p><strong>afr</strong> (Afrikaans)
<strong>amh</strong> (Amharic)
<strong>ara</strong> (Arabic)
<strong>asm</strong> (Assamese)
<strong>aze</strong> (Azerbaijani)
<strong>aze_cyrl</strong> (Azerbaijani - Cyrilic)
<strong>bel</strong> (Belarusian)
<strong>ben</strong> (Bengali)
<strong>bod</strong> (Tibetan)
<strong>bos</strong> (Bosnian)
<strong>bul</strong> (Bulgarian)
<strong>cat</strong> (Catalan; Valencian)
<strong>ceb</strong> (Cebuano)
<strong>ces</strong> (Czech)
<strong>chi_sim</strong> (Chinese - Simplified)
<strong>chi_tra</strong> (Chinese - Traditional)
<strong>chr</strong> (Cherokee)
<strong>cym</strong> (Welsh)
<strong>dan</strong> (Danish)
<strong>dan_frak</strong> (Danish - Fraktur)
<strong>deu</strong> (German)
<strong>deu_frak</strong> (German - Fraktur)
<strong>dzo</strong> (Dzongkha)
<strong>ell</strong> (Greek, Modern (1453-))
<strong>eng</strong> (English)
<strong>enm</strong> (English, Middle (1100-1500))
<strong>epo</strong> (Esperanto)
<strong>equ</strong> (Math / equation detection module)
<strong>est</strong> (Estonian)
<strong>eus</strong> (Basque)
<strong>fas</strong> (Persian)
<strong>fin</strong> (Finnish)
<strong>fra</strong> (French)
<strong>frk</strong> (Frankish)
<strong>frm</strong> (French, Middle (ca.1400-1600))
<strong>gle</strong> (Irish)
<strong>glg</strong> (Galician)
<strong>grc</strong> (Greek, Ancient (to 1453))
<strong>guj</strong> (Gujarati)
<strong>hat</strong> (Haitian; Haitian Creole)
<strong>heb</strong> (Hebrew)
<strong>hin</strong> (Hindi)
<strong>hrv</strong> (Croatian)
<strong>hun</strong> (Hungarian)
<strong>iku</strong> (Inuktitut)
<strong>ind</strong> (Indonesian)
<strong>isl</strong> (Icelandic)
<strong>ita</strong> (Italian)
<strong>ita_old</strong> (Italian - Old)
<strong>jav</strong> (Javanese)
<strong>jpn</strong> (Japanese)
<strong>kan</strong> (Kannada)
<strong>kat</strong> (Georgian)
<strong>kat_old</strong> (Georgian - Old)
<strong>kaz</strong> (Kazakh)
<strong>khm</strong> (Central Khmer)
<strong>kir</strong> (Kirghiz; Kyrgyz)
<strong>kor</strong> (Korean)
<strong>kur</strong> (Kurdish)
<strong>lao</strong> (Lao)
<strong>lat</strong> (Latin)
<strong>lav</strong> (Latvian)
<strong>lit</strong> (Lithuanian)
<strong>mal</strong> (Malayalam)
<strong>mar</strong> (Marathi)
<strong>mkd</strong> (Macedonian)
<strong>mlt</strong> (Maltese)
<strong>msa</strong> (Malay)
<strong>mya</strong> (Burmese)
<strong>nep</strong> (Nepali)
<strong>nld</strong> (Dutch; Flemish)
<strong>nor</strong> (Norwegian)
<strong>ori</strong> (Oriya)
<strong>osd</strong> (Orientation and script detection module)
<strong>pan</strong> (Panjabi; Punjabi)
<strong>pol</strong> (Polish)
<strong>por</strong> (Portuguese)
<strong>pus</strong> (Pushto; Pashto)
<strong>ron</strong> (Romanian; Moldavian; Moldovan)
<strong>rus</strong> (Russian)
<strong>san</strong> (Sanskrit)
<strong>sin</strong> (Sinhala; Sinhalese)
<strong>slk</strong> (Slovak)
<strong>slk_frak</strong> (Slovak - Fraktur)
<strong>slv</strong> (Slovenian)
<strong>spa</strong> (Spanish; Castilian)
<strong>spa_old</strong> (Spanish; Castilian - Old)
<strong>sqi</strong> (Albanian)
<strong>srp</strong> (Serbian)
<strong>srp_latn</strong> (Serbian - Latin)
<strong>swa</strong> (Swahili)
<strong>swe</strong> (Swedish)
<strong>syr</strong> (Syriac)
<strong>tam</strong> (Tamil)
<strong>tel</strong> (Telugu)
<strong>tgk</strong> (Tajik)
<strong>tgl</strong> (Tagalog)
<strong>tha</strong> (Thai)
<strong>tir</strong> (Tigrinya)
<strong>tur</strong> (Turkish)
<strong>uig</strong> (Uighur; Uyghur)
<strong>ukr</strong> (Ukrainian)
<strong>urd</strong> (Urdu)
<strong>uzb</strong> (Uzbek)
<strong>uzb_cyrl</strong> (Uzbek - Cyrilic)
<strong>vie</strong> (Vietnamese)
<strong>yid</strong> (Yiddish)</p></div>
<div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the
<strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at
<strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the
Expand Down Expand Up @@ -1047,7 +1106,7 @@ <h2 id="_history">HISTORY</h2>
<div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</p></div>
<div class="paragraph"><p>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <a href="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</a>. With Tesseract 2.00,
See <a href="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</a>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
See <a href="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</a> for more
details.</p></div>
Expand Down Expand Up @@ -1097,7 +1156,7 @@ <h2 id="_copying">COPYING</h2>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:49:44 CEST
Last updated 2015-06-28 22:23:47 CEST
</div>
</div>
</body>
Expand Down

0 comments on commit dcc457c

Please sign in to comment.