## Overview
This document outlines the key steps in improving OCR accuracy through image processing. It describes the process of reading text from images and enhancing results by adjusting based on confidence scores. The methods presented are derived from both extensive research and practical experimentation, with confidence scoring refined through numerous tests on clear and noisy images. The flowchart provides a visual representation of these techniques to optimize OCR performance.

## Flowchart

Processes in order:

1. Scale (mentioned in flowchart)
1. remove lines
1. denoise
1. sharpen image
1. erode
1. thresholding

```{mermaid}
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%

flowchart TD
    start([Start]) -->read_file[/Read file/] --> convert_to_image[Convert pdf file to image] --> get_confidence["Record confidence level (tesseract)"]
    convert_to_image --> try_scale[Try scaling] --> get_confidence
    
    try_scale --> compare_improve{Did confidence improve?} -->|Yes|above_threshold{Above threshold?}
    compare_improve -->|No|remove_last_process["Remove last process\n(pop confidence)"] --> try_next_process
    above_threshold -->|Yes|End([End])
    above_threshold -->|No|try_next_process[Try next process] --> get_confidence
    try_next_process --> compare_improve
    
```

::: {.callout-note icon=false}
# Notes
* Recording a confidence level assumes the values are stored in a stack.
* I've set the threshold at 85, as this represents a high confidence level. However, this can be adjusted if needed.
:::

## Research
This section discusses my findings for each of the acord files that I was using for testing. I have provided an example image of each file within its respective section.

::: {.callout-note}
Click on each image to enlarge.
:::

### Acord 1
![Acord 1](docs/imgs/acord_1.png){width=40% fig-align="left" .lightbox}

When processing this ACORD, I found that scaling alone gives the best result.

| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Base|86.25|
|<mark>Scaled</mark>|<mark>88.19</mark>|
|Denoised|86.43|

: Comparison of ACORD 1 scores {.striped}

### Acord 2
![Acord 2](docs/imgs/acord_2.png){width=40% fig-align="left" .lightbox}

This ACORD performed very similar to the first ACORD. Scaling still had the largest impact, however denoising performed slightly better than denoising on the last one.

| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Base|86.97|
|<mark>Scaled</mark>|<mark>88.83</mark>|
|Denoised|87.19|

: Comparison of ACORD 2 scores {.striped}

### Acord 3
::: {#images ncol-layout=2}
![Acord 3](docs/imgs/acord_3.png){width=40% fig-align="left" .lightbox}
![Acord 3 without lines](docs/imgs/acord_3_lr.png){width=40% fig-align="left" .lightbox}
:::

This ACORD proved to be the most difficult out of the 7 files examined. The line in the left side of the file obscured many of the words meaning they could not be properly read.
This file was also the most noisy image so that also had to be dealt with. My first approach was to remove the lines from the image, leaving just the words.
Looking at that implementation we can see an increase in readability for tesseract.

| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Base|78.54|
|Lines removed|81.77|

: {.striped}

<br>

The image is still very noisy, however, so next I removed the noise. Let's look at the denoised image without the lines compared to denoising with the lines.


| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Denoised with lines|80.05|
|Denoised without lines|82.18|

: {.striped}

<br>

Combining the 2 approaches I was able to improve it ever so slightly. Here's the performance of removing the lines and denoising the image. The table lists each permutation to compare the order in which the process is used.

| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Denoise then remove lines|81.16|
|Remove lines then denoise|82.18|

: {.striped}

<br>

Finally, scaling the image now that it's in a better format.

| Preprocessing Type | Confidence Score |
|------|--------|----------|
|Scaled after removal and denoise|83.12|

: {.striped}

### Acord 4
![Acord 4](docs/imgs/acord_4.png){width=40% fig-align="left" .lightbox}

### Acord 5
![Acord 5](docs/imgs/acord_5.png){width=40% fig-align="left" .lightbox}

### Acord 6
![Acord 6](docs/imgs/acord_6.png){width=40% fig-align="left" .lightbox}

### Acord 7
![Acord 7](docs/imgs/acord_7.png){width=40% fig-align="left" .lightbox}
