# OCR - From images to text

In this notebook, we're going to see how we can extract text from images using the ```pytesseract``` library. However, we're going to touch on a lot of different skills we've learned this semester - including drawing on ideas from Language Analytics, too!

In [None]:
# basic python tools
import re, os, sys
sys.path.append("..")

# OCR tools
import cv2
import pytesseract

# util functions
from utils.imutils import jimshow as show
from utils.imutils import jimshow_channel as show_channel

# data processing tools
import numpy as np 
import pandas as pd 

# readymade spellchecker
from autocorrect import Speller

def clean_string(string):
    """Removes punctuation to assist in OCR correction"""
    processed = string.replace("\n"," ")\
                     .replace("\n\n"," ")\
                     .replace("__"," ")\
                     .replace(" - "," ")\
                     .replace('-""' ," ")\
                     .replace("|", "")\
                     .replace("!", "")\
                     .replace("\s\s"," ")\
                     .lstrip()
    return " ".join(processed.split())

## OCR using ```Tesseract```

Tesseract/pytesseract is quite a rich library with lots of different functionality and small tweaks and tricks that can improve your OCR. Check out the documentation for more info:

**Pytesseract:** [Github](https://github.com/h/pytesseract) <br><br>
**Tesseract:** [Github](https://github.com/tesseract-ocr/tesseract); [Documentation](https://tesseract-ocr.github.io/)

In [None]:
filepath = os.path.join("..", 
                        "..",
                        "cds-viz-data",
                        "data", 
                        "img", 
                        "jefferson.jpg")

The simplest way of using ```pytesseract``` is simply to call the ```.image_to_string()```. As the name suggests, this produces a single string with all of the text content found in the image:

In [None]:
text = pytesseract.image_to_string(filepath)
print(text)

The library also has a method for returning the information as a dataframe which contains a detailed collection of information about its predictions:

In [None]:
df = pytesseract.image_to_data(filepath, 
                               output_type='data.frame')

In [None]:
df

## Preprocess with Open-CV

Note that Tesseract on Github give a bunch of tips for how best to preprocess images to improve performance. 

You should have the skills to actually do all of these things using OpenCV: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling

In [None]:
image = cv2.imread(filepath)

In [None]:
show(image)

__Crop__

The first thing we want to do is to crop this around the center of the image to keep only the main text.

In [None]:
(cX, cY) = (image.shape[1]//2, image.shape[0]//2)
cropped = image[cY-750:cY+1150, cX-750:cX+700]

__Greyscale__

Next, we greyscale the image to attempt to remove extra noise.

In [None]:
grey = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)

In [None]:
show_channel(grey)

__OCR again__

Let's see how these simple steps improve performance of the OCR model.

In [None]:
text = pytesseract.image_to_string(grey)

In [None]:
print(text)

__Thresholding__

Way back when we worked more with OpenCV, we learned that we could also *binarize* images using thresholding to make everything black or white (like when we created *masks*).

In [None]:
# threshold
(T, thres) = cv2.threshold(grey, 110, 255, cv2.THRESH_BINARY)

In [None]:
show_channel(thres)

In [None]:
text = pytesseract.image_to_string(thres)

In [None]:
print(text)

## Quick and cheap spell checking

One of the main issues we seem to see is single-character errors which give misspelled words. So, let's see how far we can get by doing some simple spell checking and correction with the ```autocorrect``` library:

__Initialize speller__

In [None]:
spell = Speller(only_replacements=True)

In [None]:
cleaned = clean_string(text)

In [None]:
spell(cleaned.lower())

## Tasks

__Spell checking with generative LLMs__
- Head over to HuggingChat and check out some of the newest LLMs perform on this task. Test all of the available models and ask the following questions

__Some test images__

- I've attached some links to culturally significant images below. How well does the OCR pipeline work on these images? What do you need to do to get it to work? What does this suggest about the challenges or limitations of OCR?
    - [Image 1](https://www.techsmith.com/blog/wp-content/uploads/2021/09/Make-a-meme-butterfly.png)
    - [Image 2](https://datasciencedojo.com/wp-content/uploads/52.jpg)
    - [Image 3](https://datasciencedojo.com/wp-content/uploads/36.png)
    - [Image 4 (an actually serious example)](https://upload.wikimedia.org/wikipedia/commons/7/7e/King_James_Bible-Isaiah_26.jpg)
    - [Image 5](https://imgs.xkcd.com/comics/git.png)