# Assessing word embeddings for improving OCR accuracy

I will work on this project with Adnan Fazlinovic. 

Our idea is to compare how different embeddings affect OCR accuracy, similar to the approach taken in this article: https://medium.com/states-title/using-nlp-bert-to-improve-ocr-accuracy-385c98ae174c (Links to an external site.)

We would like to try some embeddings we have learned about in the course, such as CBoW, word2vec, FastText, ELMo, BERT. If we have more time and find other interesting representations online we will evaluate them as well. The idea is to find a baseline OCR accuracy and then exploring if, and how much, this accuracy can be improved by applying word embeddings to the incorrectly scanned words. 

- Probably use a synthetic dataset (read in europarl corpus and corrupt it a bit)
- Richard suggests corrupting it in ways that often happen in OCR, like rn <-> m, i <-> l, cl <-> d 
    - (https://scribenet.com/articles/2016/03/04/how-to-get-the-most-out-of-ocr)
- I think he means we should use a spellchecker to find misspelled words, and check with NER that they are not a name
- Then maybe we don't throw out the misspelled word, it can be useful

It feels like we're mostly making a spell/grammar checking model - that's fine

- Richard thinks that character level embeddings can be useful here
- Evaluation of performance: see what people usually use within OCR. Some ideas:

    - https://www.aclweb.org/anthology/I17-1101.pdf
    - https://loicbarrault.github.io/papers/afli_cicling2015.pdf

In [32]:
import re
import spacy
import numpy as np
from enchant.checker import SpellChecker

spell = SpellChecker("en-UK")
nlp = spacy.load('en_core_web_sm')

#import nltk
#from nltk.corpus import stopwords
#stop = stopwords.words('english')

## Try out spellchecker

In [34]:
spell.check("cat's")

True

In [35]:
spell.suggest("spel")

['spiel',
 'spelt',
 'spell',
 'seel',
 'spec',
 'sped',
 'spew',
 'Opel',
 'sp el',
 'sp-el']

# Read in and corrupt the corpus

Character elision is when letter pairs and individual letters are confused by the software. These types of errors occur any time pairs of letters are shaped similarly to other letters. Six common pairs are:

rn <-> m, cl <-> d, vv <-> w, ol <-> d, li <-> h, nn <-> m

In [36]:
elisionArray = []
elisionArray.append(['rn', 'm'])
elisionArray.append(['ol', 'd'])
elisionArray.append(['cl', 'd'])
elisionArray.append(['vv', 'w'])
elisionArray.append(['li', 'h'])
elisionArray.append(['nn', 'm'])
elisionArray = np.array(elisionArray)

## An example of how the character elision will be handled

In [108]:
# A test sentence that contains a lot of 'rn' and 'li'
line = "I smilingly scorn this little barn with my lilac yarn and delicious fern"
print(line)

new_line = line
elision_prob = 0.5

# Randomize search order, since some letter pairs overlap 
elisionArray = np.random.permutation(elisionArray)
for pair in elisionArray:
    
    n_errors = 0
    
    for m in re.finditer(pair[0], new_line):
        
        rd = np.random.rand(1)
        if rd < elision_prob:
            
            # Note that the position can't be used later since the line changes length!
            print("--> Replaced ", pair[0], " at position ", m.start(), "")
            
            tmp = list(new_line)
            tmp[m.start()-n_errors:m.end()-n_errors] = "%%"
            new_line = "".join(tmp)
            new_line = new_line.replace("%%", pair[1])
            
            print(new_line)

            # count number of replacements
            n_errors += 1
            
print('-'*80)

# Save location and correct spelling for the corrupted words

line = list(line.split())
new_line = list(new_line.split())

total_errors = 0
ground_truth = []

# This will be the index of a document in the corpus
doc_num = 0     

# This contains the ground truth for each document 
tmp = []

for j in range(len(line)):
    if line[j] != new_line[j]:
        
        total_errors += 1
        tmp.append((j, line[j]))
        
ground_truth.append((doc_num, tmp))
print("Total errors: ", total_errors)
print(ground_truth)

I smilingly scorn this little barn with my lilac yarn and delicious fern
--> Replaced  rn  at position  15 
I smilingly scom this little barn with my lilac yarn and delicious fern
--> Replaced  rn  at position  32 
I smilingly scom this little bam with my lilac yarn and delicious fern
--> Replaced  rn  at position  51 
I smilingly scom this little bam with my lilac yam and delicious fern
--> Replaced  rn  at position  70 
I smilingly scom this little bam with my lilac yam and delicious fem
--> Replaced  li  at position  5 
I smihngly scom this little bam with my lilac yam and delicious fem
--------------------------------------------------------------------------------
Total errors:  5
[(0, [(1, 'smilingly'), (2, 'scorn'), (5, 'barn'), (9, 'yarn'), (12, 'fern')])]


## Apply character elision to the Europarl data

### Function definitions

In [4]:
def check_line_errors(line, new_line, total_errors):
    
    """
    Check for errors between gold standard document and "OCR" document.
    """

    line = list(line.split())
    new_line = list(new_line.split()) 
    truth = []

    for j in range(len(line)):
        if line[j] != new_line[j]:

            total_errors += 1
            truth.append((j, line[j]))
    
    return truth, total_errors

In [5]:
def read_data(corpus_file, corpus_encoding, max_lines, elisionArray, elision_prob):
    
    """
    Read in ground truth version of the text, and a (synthetic) corrupted OCR dataset.
    """
    
    total_errors = 0
    ground_truth = []
    doc_errors = []
    corrupted_data = []
    
    with open(corpus_file, encoding = corpus_encoding) as f:
        
        for d, line in enumerate(f):
        
            if d == max_lines:
                break
        
            ground_truth.append(line) 
            new_line = line
            
            # Randomize search order, since some letter pairs overlap 
            elisionArray = np.random.permutation(elisionArray)
            for pair in elisionArray:
                
                # Count number of times each letter pair has been corrupted (since this changes the line length)
                n_errors = 0 
                
                for m in re.finditer(pair[0], new_line):
                    
                    rd = np.random.rand(1)
                    if rd < elision_prob:
                        
                        # Replace the letter pair and convert to a new line
                        tmp = list(new_line)
                        tmp[m.start()-n_errors:m.end()-n_errors] = "%%"
                        new_line = "".join(tmp)
                        new_line = new_line.replace("%%", pair[1])

                        # count number of replacements
                        n_errors += 1
                     
            corrupted_data.append(new_line)
            line_truth, total_errors = check_line_errors(line, new_line, total_errors)
            if len(line_truth) > 0:
                doc_errors.append((d, line_truth))
            
    return ground_truth, corrupted_data, doc_errors, total_errors

In [113]:
def identify_spelling_errors(data, ignore):

    """
    Given a dataset and a list of characters to ignore, find misspelled words that are _not_ names.
    """
    
    n_misspelled = 0

    for line in data:

        words = line.split()
        for word in words:

            if not word in ignore and not spell.check(word):

                # Apply nlp pipeline, check if this "misspelled word" is a name
                result = nlp(line, disable = ['tagger', 'parser'])
                is_name = False

                for entity in result.ents:
                    if entity.label_ in  ["PERSON", "NORP", "GPE", "ORG"] and entity.text.find(word) > -1:
                        is_name = True

                if not is_name:

                    # note down the line, word and location - this is what we will try to correct!
                    print(word)

                    n_misspelled += 1

    return n_misspelled

### Run the code

In [114]:
file = r"C:\Users\saran\OneDrive\Dokument\GitHub\NLP\Project5\europarl.txt"
encoding = "utf-8"

# Probability that a letter pair from our list will be confused if it is seen in the text
elision_prob = 0.1
max_lines = 100

np.random.seed(0)
ground_truth, data, doc_errors, n_errors = read_data(file, encoding, max_lines, elisionArray, elision_prob)

In [116]:
# Words and characters for the spellchecker to ignore
ignore = [",", ".", '"', "(", ")", "-", "'", "!", "?", ":", ";", "/", "n't", "'s", "'m", "%", "--", "___LANGCODE___"]
n_misspelled = identify_spelling_errors(data, ignore)

famihes
selfemployed
apphcation
renationalise
ultra-hberal
pohcy
160_000
black-marketeering
concems
hke
Sdbes
trialogue


In [117]:
print("Number of identified misspelled words: ", n_misspelled)
print("Number of synthetic errors: ", total_errors)

Number of identified misspelled words:  12
Number of synthetic errors:  5


two approaches: either throw away the misspelled word and try to fill it in from context, or use character level embeddings 

# Compute baseline accuracy using spellcheck suggestions

baseline: use top suggested word from enchant spellchecker


so I guess we create a list of misspelled words just like doc_errors, apply the suggestion for each, and see if we get closer to the ground truth or not? Will have to read his articles to make sure how to evaluate the performance

- https://www.aclweb.org/anthology/I17-1101.pdf
- https://loicbarrault.github.io/papers/afli_cicling2015.pdf

# Try to fill in missing words using BERT 
(like in the article)

# Try some other word embeddings

# Try some character level embeddings