# Post OCR

First task: distil a lexicon of good words from the corpus.

Intuitition: make a list of the bi- and trigrams of letters, select the most freqent of these, and weed out the ones that 
cannot be part of words and are clearly ocr mistakes.

Then find all words in the corpus that consist of such bi- and trigrams.

We will miss rare words of which no correct form exists in the corpus.

We may try to correct such words by replacing their faulty bi- or trigrams by corrected ones.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import collections
import re
from tf.app import use

In [3]:
POST_DIR = os.path.expanduser("~/github/Dans-labs/clariah-dr/postocr")
if not os.path.exists(POST_DIR):
    os.makedirs(POST_DIR, exist_ok=True)

In [64]:
A = use("Dans-labs/clariah-dr/tf/daghregister/004/0.1:clone", checkout="clone", hoist=globals())

This is Text-Fabric 9.1.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

13 features found and 0 ignored


Generic function to show a frequency distribution of data.

Data is a dict where the keys are frequencies and the values are the amounts of items that have that frequency.

In [65]:
def showDistribution(data, itemLabel, amountLabel):
    buckets = collections.Counter()
    for (freq, nItems) in data.items():
        bucket = freq
        for n in range(7, 0, -1):
            limit = 10 ** n
            if freq >= limit:
                bucket = int(freq / limit) * limit
                break
        buckets[bucket] += nItems
    for (bucket, nItems) in sorted(buckets.items(), key=lambda x: -x[0]):
        plural = " " if nItems == 1 else "s"
        print(f"{nItems:>7} {itemLabel}{plural} with {amountLabel} >= {bucket:>7}")

We walk through the corpus and harvest the 2- and 3- lettergrams.
For each gram we store information about the forms they occur in and how often they occur overall.

More precisely:

* `GRAM[n]["form"]` gives per n-gram a dict keyed by word forms that contain it and how often

In [66]:
WORD_OCCS = collections.defaultdict(list)
GRAM = {
    2: collections.defaultdict(list),
    3: collections.defaultdict(list),
}

GRAM_INDEX = {}

CHARS = collections.Counter()

In [67]:
def getGrams():
    CHARS.clear()
    WORD_OCCS.clear()
    GRAM_INDEX.clear()
    for (n, grams) in GRAM.items():
        grams.clear()
            
    ns = list(GRAM)
    
    allWords = F.otype.s("word")
    with open(f"{POST_DIR}/forms.tsv", "w") as fh:
        for w in allWords:
            letters = F.letters.v(w)
            for c in letters:
                CHARS[c] += 1
            WORD_OCCS[letters].append(w)
            (volume, page, line) = T.sectionFromNode(w)
            fh.write(f"{page}\t{line}\t{letters}\n")
            lower = f" {letters.lower()} "
            if letters in GRAM_INDEX:
                continue
            index = collections.defaultdict(list)
            GRAM_INDEX[letters] = index
            myGrams = {n: set() for n in GRAM}
            for (i, c) in enumerate(letters):
                for n in ns:
                    first = i - n + 1
                    if first >= 0:
                        gram = letters[first:i + 1]
                        myGrams[n].add(gram)

            for (n, grams) in myGrams.items():
                for gram in grams:
                    GRAM[n][gram].append(letters)
                    index[n].append(gram)
            
    print("CHARACTERS:")
    for (c, freq) in sorted(CHARS.items(), key=lambda x: (-x[1], x[0])):
        print(f"{c} {freq:>7}")
    charRep = "".join(sorted(CHARS))
    print(charRep)
    print(f"{len(allWords)} words in {len(WORD_OCCS)} distinct forms")
    for (n, grams) in GRAM.items():
        print(f"{len(grams)} {n}-grams")

In [68]:
getGrams()

CHARACTERS:
e  245074
n  129421
a   85020
t   74078
o   72160
d   70758
r   69171
s   56507
l   40368
c   37938
i   37431
g   33067
v   29973
h   29719
m   27402
y   22528
u   19452
p   16292
w   14298
b   14233
k   12423
f    7391
'    3866
0    3297
j    3203
C    2974
1    2866
2    2694
M    2161
G    2066
E    1994
S    1910
P    1472
4    1460
D    1451
B    1379
x    1379
5    1368
J    1367
A    1352
3    1289
q    1199
H    1190
T    1111
6    1101
8    1029
7     873
N     746
O     725
L     718
z     657
9     605
U     598
W     596
V     579
I     566
(     541
R     509
^     439
K     310
*     273
F     265
.     231
Q     168
ü     158
é     154
ë      98
)      94
:      88
Z      70
-      68
Y      57
/      53
ê      52
è      46
ï      40
Ë      35
ó      28
ö      26
Ü      22
[      19
\      18
«      18
X      16
"      15
;      15
<      15
>      15
!      14
|       8
£       6
$       4
&       4
?       4
]       4
Ö       4
»       3
È       3
Ó       

We have the bi- and trigrams now.

# OCR key

I compute the OCR keys of all forms, in order to see whether illegal words have counterparts with the same OCR key that are legal.
If so, we con choose the one with the minimum edit distance as a correction.

In [72]:
CHAR_CLASSES = """
*0 •™_~"[
i1 fijklrtBDEFIJKLPRT1!ïÈÉËÏ£|!\
i2 nhuHNUüÜ«°]
i3 mM
o1 abdgopqOQ690óöÓÖ()»}#&><^
c1 ecCGèéêë€*®?
v1 vxyVXY
v2 ww
s1 sS5$§
z1 zZ
21 2%
a1 A
"""

OCR_KEY = {}

for line in CHAR_CLASSES.strip().split("\n"):
    (clsCard, chars) = line.split(" ", 1)
    (cls, card) = clsCard
    for c in chars:
        OCR_KEY[c] = (cls, card)
        
        
def getOcrKey(letters):
    clses = []
    for c in letters:
        (cls, card) = OCR_KEY.get(c, (c, 1))
        if clses and clses[-1][0] == cls:
            clses[-1][1] += card
        else:
            clses.append([cls, card])
    return "".join(f"{cls}{card}" for (cls, card) in clses)

getOcrKey("amw")

'o1i3v2'

Let's make an index of the word forms by ocr key.

In [75]:
WORD_OCR = collections.defaultdict(list)

def makeOcrIndex():
    WORD_OCR.clear()
    
    for word in WORD_OCCS:
        WORD_OCR[getOcrKey(word)].append(word)
        
    print(f"{len(WORD_OCCS)} words clustered into {len(WORD_OCR)} ocr keys") 

In [76]:
makeOcrIndex()

24717 words clustered into 16827 ocr keys


Let's the amount of word froms they occur in.
We show it in two ways:

* only looking at the amount of distinct words the n-grams occur in
* taking into account the frequencies of the words the n-grams occur in

In [32]:
def occFreq(n, gram):
    forms = GRAM[n][gram]
    return sum(len(WORD_OCCS[form]) for form in forms)
               
               
def distFreq():
    for (n, grams) in GRAM.items():
        fileName = f"{n}-gram-info.tsv"
        itemLabel = f"{n}-gram"
        amountLabel = f"frequency of word occurrences"
        print(f"{len(grams)} letter {itemLabel}s by {amountLabel}")
        distribution = collections.Counter()
        with open(f"{POST_DIR}/{fileName}", "w") as fh:
            for (gr, forms) in sorted(grams.items(), key=lambda x: -occFreq(n, x[0])):
                examples = list(forms)[0:3]
                exampleRep = "\t".join(examples)
                nForms = len(forms)
                nOccs = occFreq(n, gr)
                distribution[nOccs] += 1
                fh.write(f"{gr}\t{nForms}\t{nOccs}\t{exampleRep}\n")
        showDistribution(distribution, itemLabel, amountLabel)

In [33]:
distFreq()

1746 letter 2-grams by frequency of word occurrences
      2 2-grams with frequency of word occurrences >=   60000
      1 2-gram  with frequency of word occurrences >=   50000
      1 2-gram  with frequency of word occurrences >=   40000
      2 2-grams with frequency of word occurrences >=   30000
      4 2-grams with frequency of word occurrences >=   20000
     24 2-grams with frequency of word occurrences >=   10000
      5 2-grams with frequency of word occurrences >=    9000
      6 2-grams with frequency of word occurrences >=    8000
      5 2-grams with frequency of word occurrences >=    7000
      6 2-grams with frequency of word occurrences >=    6000
     18 2-grams with frequency of word occurrences >=    5000
     15 2-grams with frequency of word occurrences >=    4000
     27 2-grams with frequency of word occurrences >=    3000
     28 2-grams with frequency of word occurrences >=    2000
     69 2-grams with frequency of word occurrences >=    1000
     12 2-grams w

# Legal grams

We try to weed out grams that cannot occur in real words.

We leave out grams that have a too low frequency and grams that have illegal characters in them.

We might leave out legal grams in this process!

In [34]:
LIMIT = {
    2: 10,
    3: 10,
}

In [35]:
LEGAL_GRAM = {
    2: set(),
    3: set(),
}

In [36]:
impureRe = re.compile(r"""[\\/(^0-9.,*:<>()•«!\[\]"]""")

In [37]:
def getLegals():
    for n in GRAM:
        LEGAL_GRAM[n] = set()
        
    for (n, grams) in GRAM.items():
        limit = LIMIT[n]
        legals = LEGAL_GRAM[n]
        for (gram, forms) in grams.items():
            freq = occFreq(n, gram)
            if freq >= limit and not impureRe.search(gram):
                legals.add(gram)
        
    for (n, grams) in LEGAL_GRAM.items():
        print(f"{len(grams)} legal {n}-grams")
        with open(f"{POST_DIR}/legal-{n}-grams.tsv", "w") as fh:
            distribution = collections.Counter()
            for gram in sorted(grams, key=lambda x: occFreq(n, x)):
                freq = occFreq(n, gram)
                distribution[freq] += 1
                fh.write(f"{gram}\t{freq}\n")
            showDistribution(distribution, f"legal {n}-gram", "occurrences")

In [38]:
getLegals()

570 legal 2-grams
      2 legal 2-grams with occurrences >=   60000
      1 legal 2-gram  with occurrences >=   50000
      1 legal 2-gram  with occurrences >=   40000
      2 legal 2-grams with occurrences >=   30000
      4 legal 2-grams with occurrences >=   20000
     24 legal 2-grams with occurrences >=   10000
      5 legal 2-grams with occurrences >=    9000
      6 legal 2-grams with occurrences >=    8000
      5 legal 2-grams with occurrences >=    7000
      6 legal 2-grams with occurrences >=    6000
     18 legal 2-grams with occurrences >=    5000
     15 legal 2-grams with occurrences >=    4000
     27 legal 2-grams with occurrences >=    3000
     26 legal 2-grams with occurrences >=    2000
     67 legal 2-grams with occurrences >=    1000
     12 legal 2-grams with occurrences >=     900
      7 legal 2-grams with occurrences >=     800
     13 legal 2-grams with occurrences >=     700
     12 legal 2-grams with occurrences >=     600
     14 legal 2-grams with occur

# Legal words

We now distil the words that are legal, by selecting the words whose bi- and trigrams are all legal.

In fact, we compute something slightly more general: for each word we compute its legality.

The legality of a word is the percentage of legal grams with respect to the total number of grams in it.

In [39]:
LEGAL_FORM = {}

In [40]:
def getLegality():
    LEGAL_FORM.clear()
    
    for (form, info) in GRAM_INDEX.items():
        legal = 0
        for (n, grams) in info.items():
            legal += int(round(100 * sum(1 for gram in grams if gram in LEGAL_GRAM[n]) / len(grams)))
        legal = int(round(legal / 2))
        LEGAL_FORM[form] = legal
        
    print(f"{len(GRAM_INDEX)} word forms with legality distributed as:")
    with open(f"{POST_DIR}/legality.tsv", "w") as fh:
        distribution = collections.Counter()
        for (form, leg) in sorted(LEGAL_FORM.items(), key=lambda x: (-x[1], x[0])):
            fh.write(f"{form}\t{leg}\n")
            distribution[leg] += 1
        showDistribution(distribution, f"word form", "legality")

In [41]:
getLegality()

23995 word forms with legality distributed as:
  16442 word forms with legality >=     100
   3267 word forms with legality >=      90
   1493 word forms with legality >=      80
    714 word forms with legality >=      70
    396 word forms with legality >=      60
    238 word forms with legality >=      50
    120 word forms with legality >=      40
     39 word forms with legality >=      30
     68 word forms with legality >=      20
    111 word forms with legality >=      10
      2 word forms with legality >=       8
      3 word forms with legality >=       7
      2 word forms with legality >=       6
   1100 word forms with legality >=       0
