# PDF-text extraction and OCR validation

This Jupyter notebook is dedicated to the task of extracting text from PDF files and validating the quality of Optical Character Recognition (OCR). It employs two main libraries for this purpose:

1. `PyMuPDF`: This library is used for extracting text from PDF files that have undergone OCR. It reads the PDF files and retrieves the text content, which can then be processed or analyzed.
2. `sklearn` and `textdistance`: These libraries are used to validate the quality of the OCR. This is done by comparing the text extracted from the OCR-ed PDF files with a master version of the text (also known as the ground truth). The comparison helps to identify any discrepancies or errors in the OCR process.

The outcome of this process is a pandas DataFrame that contains the results of the OCR validation. This DataFrame is then saved as a serialized Python object in a .pkl file for future use or analysis. 

In [11]:
import pandas as pd

from ssrq_retro_lab.config import PROJECT_ROOT
from ssrq_retro_lab.repository import reader
from ssrq_retro_lab.validate.ocr import calc_error_rate, ErrorRate

In [12]:
import json

txt_pdf_conversion_table = json.loads(
    reader.TextReader((PROJECT_ROOT / "data/ZG/txt_to_pdf.json")).read()
)

In [13]:
from fitz_new import Document

def get_page_text_from_pdf(pdf: Document, page: int) -> str:
    return pdf.load_page(page).get_textpage().extractText(sort=True)


In [14]:
from collections import namedtuple

volumes = [pdf for pdf in (PROJECT_ROOT / "data/ZG/pdf").glob("*.pdf")]
master_transcriptions = [txt for txt in (PROJECT_ROOT / "data/ZG/master").glob("*[0-9].txt")]

TranscriptInfo = namedtuple("TranscriptInfo", ["name", "page_number", "volume", "cer", "cosine_similarity"])

results: list[TranscriptInfo] = []

for volume in volumes:
    doc = reader.PDFReader(volume).read()
    volume_name = volume.name.removesuffix(".pdf").replace(".", "_")
    transcriptions = [
        transcription
        for transcription in master_transcriptions
        if transcription.name.startswith(volume_name)
    ]

    for transcription in transcriptions:
        page_number = int(
                txt_pdf_conversion_table[volume_name][
                    transcription.name.removesuffix(".txt")
                ]
            )
        page_text = get_page_text_from_pdf(
            doc,
            page_number,
        )
        master_text = reader.TextReader(transcription).read()
        results.append(
            TranscriptInfo(
                transcription.name.removesuffix(".txt"),
                page_number,
                volume_name,
                *calc_error_rate(master_text, page_text)
            )
        )

df = pd.DataFrame(results)
df

Unnamed: 0,name,page_number,volume,cer,cosine_similarity
0,ZG_1_1_10,387,ZG_1_1,2.41566,0.998088
1,ZG_1_1_11,388,ZG_1_1,2.830734,0.982301
2,ZG_1_1_13,437,ZG_1_1,2.335709,0.993726
3,ZG_1_1_12,418,ZG_1_1,3.188839,0.985316
4,ZG_1_1_8,385,ZG_1_1,2.182453,0.994351
5,ZG_1_1_9,386,ZG_1_1,2.552491,0.998155
6,ZG_1_1_15,606,ZG_1_1,2.816901,0.990826
7,ZG_1_1_14,466,ZG_1_1,2.302632,0.996708
8,ZG_1_1_7,380,ZG_1_1,2.636204,0.994389
9,ZG_1_1_6,126,ZG_1_1,4.043752,0.981815


In [15]:
df.to_pickle('./pkl_cache/ocr_quality.pkl')