In [None]:
!mkdir number_plate

In [None]:
!export root_dir=pwd
!echo "$root_dir"

In [None]:
import os
os.chdir("./number_plate")


In [None]:
pwd

#                         Training Tesseract 3.04
- Tesseract is an optical character recognition engine for various operating systems.
- Tesseract 3.x is based on traditional computer vision algorithms.
- Version 4 of Tesseract also has the legacy OCR engine of Tesseract 3

# Requirements for text input files
Text input files need to meet these criteria:
- ASCII or UTF-8 encoding without BOM
- Unix end-of-line marker ('\n')
- The last character must be an end of line marker ('\n').

# Python Code for Generating text file

- Run below python code to generate training_text.txt file.


In [None]:
#indian number plate general format generator
import random
import string
random.seed(0)
state_code = ['AP','AR','AS','BR','CG','GA','GJ','HR','HP','JK','JH','KA','KL',
	      'MP','MH','MN','ML','MZ','NL','OD','PB','RJ','SK','TN','TS','TR','UK',
	      'UP','WB','AN','CH','DN','DD','DL','LD','PY']
punc=['-','.',' ']
def randomStringL(stringLength):
    return ''.join(random.choice(state_code))
def randomStringL2(stringLength):
    letters = string.uppercase
    return ''.join(random.choice(letters) for i in range(2))
def randomStringD(stringLength):
    letters = string.digits
    return ''.join(random.choice(letters) for i in range(stringLength))
def randomBet():
    return ''.join(random.choice(punc))
output= ''
for i in range(1000):
  rnd = randomBet()
  output = output + (randomStringL(2)+rnd+randomStringD(2)+rnd+randomStringL2(2)+rnd+randomStringD(4) )+'\n'

file = open("training_text.txt","w")
file.write(output)
file.close()

# Training Procedure

### 1. Generate Training Images and Box Files
##### 1.1 Automated Method
- Run the following command for each font in turn to create a matching tif/box file pair. 
- input:  UTF-8 text file (training_text.txt) containing our training text.
 ```sh
 $ text2image --text=training_text.txt --outputbase=[lang].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/our/fonts
 ```
 - copy fonts from "/usr/share/fonts" directory to our data directory or set path accordingly.
- Three files (eng.FreeSerifBold.exp0.box, eng.FreeSerifBold.exp0.tif, lang.unicharset) created in this step.

In [None]:
!text2image --text=training_text.txt --outputbase=eng.FreeSerifBold.exp0 --font='FreeSerif Bold' --fonts_dir=../data/fonts 

### 2. Run Tesseract for Training
For each of our training image, boxfile pairs, run Tesseract in training mode:
```sh
$ tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train
```
- The output of this step is tr file which contains the features of each character of the training page.

In [None]:
!tesseract eng.FreeSerifBold.exp0.tif eng.FreeSerifBold.exp0 box.train

### 3. Generate the unicharset file
- Tesseract’s unicharset file contains information on each symbol the Tesseract OCR engine is trained to recognize.
- Currently, generating the unicharset file is done in two steps using these commands: unicharset_extractor and set_unicharset_properties.

##### 3.1 unicharset_extractor
Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the box files generated above:
```sh
$ unicharset_extractor lang.fontname.exp0.box
```
- unicharset file will be created.

In [None]:
!unicharset_extractor eng.FreeSerifBold.exp0.box

##### 3.2 set_unicharset_properties
- This step allow the addition of extra properties in the unicharset.
```sh
$ set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=langdata
```
- output_unicharset file will be created.

In [None]:
!set_unicharset_properties --script_dir=../data/langdata -U unicharset -O output_unicharset

For futher details regarding Unicharset — Incomplete properties [click here](https://github.com/tesseract-ocr/tesseract/issues/318)

### 4. The font_properties file
- Create a font_properties text file. The purpose of this file is to provide font style information that will appear in the output when the font is recognized.
- Each line of the font_properties file is formatted as follows: fontname italic bold fixed serif fraktur
- Use default [font_properties](https://raw.githubusercontent.com/tesseract-ocr/langdata/master/font_properties) file

### 5. Clustering
- When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes.
- The character shape features can be clustered using the shapeclustering, mftraining and cntraining programs:
##### 5.1mftraining
- mftraining will output two other data files: inttemp (the shape prototypes) and pffmtable (the number of expected features for each character).
- mftraining will produce a shapetable file because we didn't run shapeclustering.
```sh
$ mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr
```

In [None]:
!mftraining -F ../data/font_properties -U unicharset -O eng.unicharset eng.FreeSerifBold.exp0.tr

##### 5.2 cntraining
- This will output the normproto data file (the character normalization sensitivity prototypes).
```sh 
$ cntraining lang.fontname.exp0.tr
```

In [None]:
!cntraining eng.FreeSerifBold.exp0.tr

In [None]:
!mv inttemp eng.inttemp
!mv pffmtable eng.pffmtable
!mv shapetable eng.shapetable
!rm eng.unicharset
!mv output_unicharset eng.unicharset
!mv normproto eng.normproto

### 6. Putting it all together
- now collect together all the files (shapetable, normproto, inttemp, pffmtable, unicharset) and rename them with a lang. prefix (for example eng.) 
- run combine_tessdata on them as follows:
```sh
$ combine_tessdata lang.
```


In [None]:
!combine_tessdata eng.

In [None]:
!mkdir ../Trained_Data

In [None]:
!mv eng.traineddata ../Trained_Data

In [None]:
%cd ..
!rm -rf number_plate

- The resulting eng.traineddata goes in our Trained_Data directory. 
- Tesseract can now recognize text in our language (in theory) with the following:
```sh
$ tesseract image.tif output -l lang
```

In [None]:
!mkdir output

In [None]:
import os
os.environ['TESSDATA_PREFIX'] = "./Trained_Data"

In [None]:
!tesseract images/image4.png output/output_0 -l eng

In [None]:
!tesseract images/image3.1.png output/output_1 -l eng

In [None]:
!tesseract images/image-4.png output/output_2 -l eng