## **Go Tesseract**

`tesseract` : Version `4.1.1`

`languages` : `eng`, `nep`

We're using `WSL` to run Tesseract OCR with Go. Below are the steps to set up and use Tesseract OCR in a `Go` project.

**Go Version**: `1.18`

**GoTesseract Docs**

[Docs](https://pkg.go.dev/github.com/otiai10/gosseract#section-readme)

**GoTesseract GitHub Repository**

[GitHub Repo](https://github.com/otiai10/gosseract)

<hr>
<hr>
<hr>


## **Task**

### **Images Types**

**Machine PDF** : Rajpatra

**Machine But Scanned** : Bar Association

**Scanned But Low Quality** : Find the Image

**Handwritten** : Handwritten Note

**Handwritten Bad Quality** : Find the Image

For each image, run the analysis and document the results below.

### **Prepare Report**

- Some words are missplelled. We've to post process the `Misspelled Words` using `Dictionary` or `Language Model`.

**Average Confidence** : Take the confidence of each word and calculate the average confidence of the entire text.

**Identify Common Pitfalls** : Identify common pitfalls or errors that Tesseract makes with each type of image.

<hr>
<hr>
<hr>

# **Working of Tesseract**

Before Tesseract reads a single letter, it must understand the "geometry" of the page. This is handled by a library called `Leptonica` and Tesseract's internal layout engine.

Modern `Tesseract` OCR engine is i.e. `V4` or `V5` combines `Old School Computer Vision` techniques with `LSTM` based `Deep Learning` models.

## **Phase 1:**

### **Adaptive Thresholding / Binarization**

First, the image is pre-processed using various techniques like `Binarization`, `Noise Removal`, `Skew Correction`, etc. to enhance the quality of the image for better text recognition.

Tesseract does not read color or grayscale images directly. It converts everything to Binary (Black and White).

**How it works:**

It doesn't just say "Anything darker than 50% is black." It uses `Local Adaptive Thresholding`. It looks at a small window of pixels. If a pixel is significantly darker than its neighbors, it becomes black (text).

### **Connected Component Analysis (Blob Finding)**

Once the image is `Binarized`, Tesseract looks for groups of connected black pixels called `Blobs`. Each blob could be a part of a letter, a whole letter, or even multiple letters stuck together.

- A `Blob` is a group of connected pixels that are all the same color (in this case, black).

- In `English`, the letter `i` might be split into two blobs: the dot and the stem. In `Devanagari`, the `Top Line (Shirorekha)` connects multiple letters into a single blob.

### **Page Layout Analysis (PLA)**

This is the most complex non-AI part. Tesseract tries to find the structure.

- `Tab Stop Detection`: It looks for vertical alignments of text to identify columns and tables.

- `Gutter Detection`: It looks for `Horizontal Spaces` between blocks of text to separate paragraphs.

`Tesseract` has hard-coded thresholds for how much white space constitutes a `Pagaraph Break` or a `Column Break`. So if `Nepali` text has a large font size or smaller line height, it might misinterpret the layout.

In such cases, we've to manually calculate the `Gaps` and set the `Page Segmentation Mode (PSM)` accordingly.


### **Line Finding and Baseline Estimation**

`Tesseract` groups blobs into lines of text. It estimates the `Baseline` for each line, which is the imaginary line on which most letters sit.

- It fits Mathematical Curve `Spline` to the bottom of the blobs in a line to estimate the baseline.

- This helps in recognizing letters that have parts going below the baseline, like `g`, `j`, `p` in English or `ज`, `घ`, `फ` in Nepali.

## **Phase 2: Recognizing Characters with LSTM**

Once `Tesseract` has identified lines of text and their baselines i.e. `Image Segmentation` for that line, it moves to the `LSTM` based recognition phase.

### **Sliding Window Approach**

Tesseract uses a `Sliding Window` approach to recognize characters. It takes small vertical slices of the line image and feeds them into the `LSTM` network.

- Each slice is typically a few pixels wide and spans the full height of the line.

- The `LSTM` processes these slices sequentially, maintaining a memory of previous slices to understand context.

### **The Probability Output**

For each slice, the `LSTM` outputs a probability distribution over all possible characters (including a special "blank" character).

- `Slice 1`: 70% 'H', 20% 'A', 10% 'Blank'

- `Slice 2`: 40% chance it's 'क', 30% chance it's 'ब'.
  
- `Slice 3`: 80% chance it's 'क'.

### **CTC Decoding (Connectionist Temporal Classification)**

The raw output is a messy streams like `H H A _ _ क क ब क _ _` 

where `_` represents the blank character.

The `CTC` algorithm cleans this up by removing duplicates and blanks, resulting in the final recognized text.

- Repeated characters are merged: `H H A` becomes `H A`.

- It collapses blanks: `H A _ _` becomes `H A`.

### **Confidence Scoring**

Tesseract assigns a confidence score to each recognized character and the overall line based on the probabilities output by the `LSTM`.

- Higher confidence scores indicate more reliable recognition.

## **Phase 3: Post-Processing**

After the `Neural Network` has recognized the text, Tesseract performs several post-processing steps to improve accuracy.

`Tesseract` checks its work using the `Language Data Files` (`.traineddata` files) that contain dictionaries and language models.

### **Dictionary Lookup**

Tesseract compares recognized words against its internal dictionaries to correct common OCR errors.

- `Tesseract` uses a `Directed Acyclic Word Graph (DAWG)` to efficiently store and look up words. It checks if the word `तपाईं` exists in the Nepali dictionary.

- If the `LSTM` was 51% sure it saw `तपाईं` and 49% sure it saw `तपाई`, Tesseract will correct it to `तपाईं` based on the dictionary.

### **N-Grams and Bigrams**

It looks at word pairs. If `नेपाल सरकार` is a common bigram in Nepali, and the OCR output is `नेपाल ससरकार`, Tesseract might correct it to `नेपाल सरकार`.