# Assessing word embeddings for improving OCR accuracy

I will work on this project with Adnan Fazlinovic. 

Our idea is to compare how different embeddings affect OCR accuracy, similar to the approach taken in this article: https://medium.com/states-title/using-nlp-bert-to-improve-ocr-accuracy-385c98ae174c (Links to an external site.)

We would like to try some embeddings we have learned about in the course, such as CBoW, word2vec, FastText, ELMo, BERT. If we have more time and find other interesting representations online we will evaluate them as well. The idea is to find a baseline OCR accuracy and then exploring if, and how much, this accuracy can be improved by applying word embeddings to the incorrectly scanned words. As datasets we want to use images of machine written text, in English (suggestions on good datasets would be appreciated). If the data is of inconsistent quality, the focus will be too much on the image scanning part and not on the NLP application. 

If we don't find any good image resource dataset, we could use a dataset from the course and randomly corrupt a subset of the words in each text. 

A preliminary title is "Asessing word embeddings for improving OCR accuracy"

- Probably use a synthetic dataset (read in europarl corpus and corrupt it a bit)
- Richard suggests corrupting it in ways that often happen in OCR, like rn <-> m, i <-> l, cl <-> d 
    - (https://scribenet.com/articles/2016/03/04/how-to-get-the-most-out-of-ocr)
- I think he means we should use a spellchecker to find misspelled words, and check with NER that they are not a name
- Then maybe we don't throw out the misspelled word, it can be useful

It feels like we're mostly making a spell/grammar checking model - that's fine

- Richard thinks that character level embeddings can be useful here
- Evaluation of performance: see what people usually use within OCR. Some ideas:

    - https://www.aclweb.org/anthology/I17-1101.pdf
    - https://loicbarrault.github.io/papers/afli_cicling2015.pdf

## Try out spellchecker

In [None]:
from enchant.checker import SpellChecker

spell = SpellChecker("en-UK")

In [None]:
words = line.split()

for word in words:
    if not spell.check(word):
        print(word)
        
# what do we do with 's and similar words? they are correct. maybe just merge them with previous word, or remove them? 

In [None]:
spell.check("cat's")

In [None]:
spell.suggest("spel")

# Read in and corrupt the Europarl dataset

Character elision is when letter pairs and individual letters are confused by the software. These types of errors occur any time pairs of letters are shaped similarly to other letters. Six common pairs are:

rn <-> m, cl <-> d, vv <-> w, ol <-> d, li <-> h, nn <-> m

In [336]:
import numpy as np
import re

In [337]:
elisionArray = []
elisionArray.append(['rn', 'm'])
elisionArray.append(['ol', 'd'])
elisionArray.append(['cl', 'd'])
elisionArray.append(['vv', 'w'])
elisionArray.append(['li', 'h'])
elisionArray.append(['nn', 'm'])
elisionArray = np.array(elisionArray)

In [413]:
# Read in gold standard text, and a (synthetic) corrupted OCR dataset

def read_data(corpus_file, corpus_encoding, max_lines, elisionArray, elision_prob):
    
    total_errors = 0 
    gold_standard = []
    corrupted_data = []
    
    with open(corpus_file, encoding = corpus_encoding) as f:
        
        for d, line in enumerate(f):
        
            if d == max_lines:
                break
        
            gold_standard.append(line) 
            new_line = line
            
            # Randomize search order, since some letter pairs overlap 
            elisionArray = np.random.permutation(elisionArray)
            for pair in elisionArray:
                
                # Count number of times each letter pair has been corrupted (since this changes the line length)
                n_errors = 0 
                
                for m in re.finditer(pair[0], new_line):
                    
                    rd = np.random.rand(1)
                    if rd < elision_prob:
                        
                        print("Line ", d, " - found ", pair[0], " at position ", m.start())
                        
                        # Replace the letter pair and convert to a new line
                        tmp = list(new_line)
                        tmp[m.start()-n_errors:m.end()-n_errors] = "%%"
                        new_line = "".join(tmp)
                        new_line = new_line.replace("%%", pair[1])

                        # count number of replacements
                        n_errors += 1
                     
            corrupted_data.append(new_line)

    return gold_standard, corrupted_data

In [414]:
file = r"C:\Users\saran\OneDrive\Dokument\GitHub\NLP\Project5\europarl.txt"
encoding = "utf-8"

# Probability that a letter pair from our list will be confused if it is seen in the text
elision_prob = 0.5

#np.random.seed(None)
original, corrupted = read_data(file, encoding, 6, elisionArray, elision_prob)

Line  3  - found  ol  at position  59
Line  5  - found  cl  at position  59


In [415]:
original

["I therefore agree with the European Parliament 's recommendations to the World Bank in this area .\n",
 'The Commissioner responsible , Jacques Barrot , has promised to present an informative report by the end of July , and our group was keen to wait for this .\n',
 'Resumption of the session\n',
 'I am pleased that , in dialogue with the institutions , a solution has successfully been found that can satisfy everybody , or at least I hope it can , and I would thank you for your constructive work in this process .\n',
 'It has only done so when faced with intense pressure from dairy producers , the European Parliament and 21 Member States .\n',
 'These measures have also been relevant for the textile and clothing industry : for instance , the Globalisation Fund support has been used to reintegrate workers laid off in mostly small and medium-sized enterprises of the sector in Italy , Malta , Spain , Portugal , Lithuania and Belgium .\n']

In [416]:
corrupted

["I therefore agree with the European Parliament 's recommendations to the World Bank in this area .\n",
 'The Commissioner responsible , Jacques Barrot , has promised to present an informative report by the end of July , and our group was keen to wait for this .\n',
 'Resumption of the session\n',
 'I am pleased that , in dialogue with the institutions , a sdution has successfully been found that can satisfy everybody , or at least I hope it can , and I would thank you for your constructive work in this process .\n',
 'It has only done so when faced with intense pressure from dairy producers , the European Parliament and 21 Member States .\n',
 'These measures have also been relevant for the textile and dothing industry : for instance , the Globalisation Fund support has been used to reintegrate workers laid off in mostly small and medium-sized enterprises of the sector in Italy , Malta , Spain , Portugal , Lithuania and Belgium .\n']

In [476]:
# A test sentence that contains a lot of 'rn' and 'li'
line = "I scorn this little barn with my lilac yarn and smiling fern"
print(line)

new_line = line
elision_prob = 0.5

# Randomize search order, since some letter pairs overlap 
elisionArray = np.random.permutation(elisionArray)
for pair in elisionArray:
    
    n_errors = 0
    
    for m in re.finditer(pair[0], new_line):
        
        rd = np.random.rand(1)
        if rd < elision_prob:
            
            # Note that the position can't be used later since the line changes length!
            print("--> Replaced ", pair[0], " at position ", m.start(), "")
            
            tmp = list(new_line)
            tmp[m.start()-n_errors:m.end()-n_errors] = "%%"
            new_line = "".join(tmp)
            new_line = new_line.replace("%%", pair[1])
            
            print(new_line)

            # count number of replacements
            n_errors += 1
            
print('-'*80)

# Save location and correct spelling for the corrupted words

line = list(line.split())
new_line = list(new_line.split())

total_errors = 0
ground_truth = []

# This will be the index of a document in the corpus
doc_num = 0     

# This contains the ground truth for each document 
tmp = []

for j in range(len(line)):
    if line[j] != new_line[j]:
        
        total_errors += 1
        tmp.append((j, line[j]))
        
ground_truth.append((doc_num, tmp))
print("Total errors: ", total_errors)
print(ground_truth)

I scorn this little barn with my lilac yarn and smiling fern
--> Replaced  rn  at position  5 
I scom this little barn with my lilac yarn and smiling fern
--> Replaced  rn  at position  41 
I scom this little barn with my lilac yam and smiling fern
--> Replaced  rn  at position  58 
I scom this little barn with my lilac yam and smiling fem
--> Replaced  li  at position  32 
I scom this little barn with my hlac yam and smiling fem
--> Replaced  li  at position  49 
I scom this little barn with my hlac yam and smihng fem
--------------------------------------------------------------------------------
Total errors:  5
[(0, [(1, 'scorn'), (7, 'lilac'), (8, 'yarn'), (10, 'smiling'), (11, 'fern')])]


In [None]:
ignore = [",", ".", '"', "(", ")", "-", "'", "!", "?", ":", ";", "/", "n't", "'s", "'m"]

In [396]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
words = line.split()

for word in words:
    if not word in ignore and not spell.check(word):
        
        # also check that the 'misspelled' word is not someone's name - check with NER on the line
        # (should probably not remove parts of the line before running the NER)
        
        print(word)

In [None]:
#import nltk
#from nltk.corpus import stopwords
#stop = stopwords.words('english')

should save locations of the corrupted words, and what they were before

two approaches: either throw away the misspelled word and try to fill it in from context, or use character level embeddings 

# Compute baseline accuracy using spellcheck suggestions

baseline: use top suggested word from enchant spellchecker

# Try to fill in missing words using BERT 
(like in the article)

# Try some other word embeddings

# Try some character level embeddings