# Assessing word embeddings for improving OCR accuracy

I will work on this project with Adnan Fazlinovic. 

Our idea is to compare how different embeddings affect OCR accuracy, similar to the approach taken in this article: https://medium.com/states-title/using-nlp-bert-to-improve-ocr-accuracy-385c98ae174c (Links to an external site.)

We would like to try some embeddings we have learned about in the course, such as CBoW, word2vec, FastText, ELMo, BERT. If we have more time and find other interesting representations online we will evaluate them as well. The idea is to find a baseline OCR accuracy and then exploring if, and how much, this accuracy can be improved by applying word embeddings to the incorrectly scanned words. As datasets we want to use images of machine written text, in English (suggestions on good datasets would be appreciated). If the data is of inconsistent quality, the focus will be too much on the image scanning part and not on the NLP application. 

If we don't find any good image resource dataset, we could use a dataset from the course and randomly corrupt a subset of the words in each text. 

A preliminary title is "Asessing word embeddings for improving OCR accuracy"

- Probably use a synthetic dataset (read in europarl corpus and corrupt it a bit)
- Richard suggests corrupting it in ways that often happen in OCR, like rn <-> m, i <-> l, cl <-> d 
    - (https://scribenet.com/articles/2016/03/04/how-to-get-the-most-out-of-ocr)
- I think he means we should use a spellchecker to find misspelled words, and check with NER that they are not a name
- Then maybe we don't throw out the misspelled word, it can be useful

It feels like we're mostly making a spell/grammar checking model - that's fine

- Richard thinks that character level embeddings can be useful here
- Evaluation of performance: see what people usually use within OCR. Some ideas:

    - https://www.aclweb.org/anthology/I17-1101.pdf
    - https://loicbarrault.github.io/papers/afli_cicling2015.pdf

## Try out spellchecker

In [88]:
from enchant.checker import SpellChecker

spell = SpellChecker("en-UK")

In [89]:
words = line.split()

for word in words:
    if not spell.check(word):
        print(word)
        
# what do we do with 's and similar words? they are correct. maybe just merge them with previous word, or remove them? 

's


In [90]:
spell.check("cat's")

True

In [91]:
spell.suggest("spel")

['spiel',
 'spelt',
 'spell',
 'seel',
 'spec',
 'sped',
 'spew',
 'Opel',
 'sp el',
 'sp-el']

# Read in and corrupt the Europarl dataset

Character elision is when letter pairs and individual letters are confused by the software. These types of errors occur any time pairs of letters are shaped similarly to other letters. Six common pairs are:

rn <-> m, cl <-> d, vv <-> w, ol <-> d, li <-> h, nn <-> m

(could consider i <-> l as well)

In [109]:
# probability that an observed string pair will be confused when reading the text 
confusion_prob = 1

In [93]:
confusionMatrix = []
confusionMatrix.append(['rn', 'm'])
confusionMatrix.append(['ol', 'd'])
confusionMatrix.append(['cl', 'd'])
confusionMatrix.append(['vv', 'w'])
confusionMatrix.append(['li', 'h'])
confusionMatrix.append(['nn', 'm'])
confusionMatrix = np.array(confusionMatrix)

In [94]:
file = r"C:\Users\saran\OneDrive\Dokument\GitHub\NLP\Project5\europarl.txt"
encoding = "utf-8"

with open(file, encoding = encoding) as f:
    for i, line in enumerate(f):
        print(line)
        
        if i == 5:
            break

I therefore agree with the European Parliament 's recommendations to the World Bank in this area .

The Commissioner responsible , Jacques Barrot , has promised to present an informative report by the end of July , and our group was keen to wait for this .

Resumption of the session

I am pleased that , in dialogue with the institutions , a solution has successfully been found that can satisfy everybody , or at least I hope it can , and I would thank you for your constructive work in this process .

It has only done so when faced with intense pressure from dairy producers , the European Parliament and 21 Member States .

These measures have also been relevant for the textile and clothing industry : for instance , the Globalisation Fund support has been used to reintegrate workers laid off in mostly small and medium-sized enterprises of the sector in Italy , Malta , Spain , Portugal , Lithuania and Belgium .



In [95]:
print(line)

These measures have also been relevant for the textile and clothing industry : for instance , the Globalisation Fund support has been used to reintegrate workers laid off in mostly small and medium-sized enterprises of the sector in Italy , Malta , Spain , Portugal , Lithuania and Belgium .



In [96]:
for pair in confusionMatrix:
    line = line.replace(pair[0], pair[1])

In [97]:
print(line)

These measures have also been relevant for the textile and dothing industry : for instance , the Globahsation Fund support has been used to reintegrate workers laid off in mostly small and medium-sized enterprises of the sector in Italy , Malta , Spain , Portugal , Lithuania and Belgium .



In [100]:
ignore = [",", ".", '"', "(", ")", "-", "'", "!", "?", ":", ";", "/", "n't", "'s", "'m"]

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [104]:
words = line.split()

for word in words:
    if not word in ignore and not spell.check(word):
        
        # also check that the 'misspelled' word is not someone's name - check with NER on the line
        # (should probably not remove parts of the line before running the NER)
        
        print(word)

dothing
Globahsation


In [99]:
#import nltk
#from nltk.corpus import stopwords
#stop = stopwords.words('english')

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\saran/nltk_data'
    - 'C:\\Users\\saran\\anaconda3\\nltk_data'
    - 'C:\\Users\\saran\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\saran\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\saran\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


should save locations of the corrupted words, and what they were before

two approaches: either throw away the misspelled word and try to fill it in from context, or use character level embeddings 

# Compute baseline accuracy using spellcheck suggestions

baseline: use top suggested word from enchant spellchecker

# Try to fill in missing words using BERT 
(like in the article)

# Try some word embeddings