# Equation Search
## Problem description
1. PyMuPDF4LLM does not handle text-based equation extraction
2. Most equations are located within the level 2 text

## Approach
1. Detect equation on the page
2. Identify replacement block
3. If success, convert image of equation to latex representation

(1.) is done through a heuristic approach.
(2.) is done by the outline of the following ... where blocks of the equations in level 2
(3.) is done by using a multimodal LLM to convert images of a page to an equation

## Alternative approaches
* Using visual models such as:
    * Using Docling equation-parser
    * Nougat
    * Using unstructured.io for formula detection
* Using propietary services
    * MathPix

In [1]:
import pymupdf

print(pymupdf.__doc__)

PyMuPDF 1.26.3: Python bindings for the MuPDF 1.26.3 library (rebased implementation).
Python 3.13 running on win32 (64-bit).



# Problem
* Based on https://github.com/pymupdf/PyMuPDF/discussions/763
* Start with looking at PyMuPDF PDF engine and its capabilities
* No build in equation detection, especially problematic for text based equation

# Pipeline
1. Equation detection
    * (option) verify detection
2. Equation conversion latex equation
    * (option) verify conversion
3. Replacing original equation with latex equation


## Equation detection
### 1. heuristic based equation detection
In our case code examples are not a problem:

Yes, exactly!
In PDF, text is just text. The PDF specification contains nothing to sub-divide different kinds of text. Equations are also text and be coded in any font, can be italic, or normal, mono-spaced of proportional, serifed or sans-serifed.
Also note that the equation symbol appears in program code listings a lot - PyMuPDF.pdf is full of such examples.

So I would say, that you have to develop your own way of recognizing equations ... and whatever you will develop, may not work with the next PDF example.


In [1]:
# [INFO]: MOVED TO MAIN DOCUMENT PIPELINE, NO LONGER MAINTAINED
# all to code moved to main pipeline

# Research on regex patterns to find equation on page

In [139]:
snippet_1 = """
1. The capital requirement for earthquake risk shall be equal to the following:


ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
*SCR* *earthquake* ¼ ðX *CorrEQ* ð *r,s* Þ � *SCR* ð *earthquake,r* Þ � *SCR* ð *earthquake,s* Þ Þ þ *SCR* [2] ð *earthquake,other* Þ


s


*CorrEQ* ð *r,s* Þ � *SCR* ð *earthquake,r* Þ � *SCR* ð *earthquake,s* Þ Þ þ *SCR* [2] ð *earthquake,other* Þ


ð *r,s* Þ


where:
"""
snippet_2 = """
following amount:


ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
*L* ð *earthquake,r* Þ ¼ *Q* ð *earthquake,r* Þ � X *Corr* ð *earthquake,r,i,j* Þ � *WSI* ð *earthquake,r,i* Þ � *WSI* ð *earthquake,r,j* Þ


s


*Corr* ð *earthquake,r,i,j* Þ � *WSI* ð *earthquake,r,i* Þ � *WSI* ð *earthquake,r,j* Þ


ð *i,j* Þ


where:

"""

In [None]:
import os
with open(os.path.join("data", "raw", "solvency-II-files", f"{file_name}.pdf"), 'rb') as f:
        original_doc = pymupdf.open(f) # detection on original pdf

        for page in [original_doc[272]]:
            for image in page.get_images(full=True):
                 xref = image[0]
                 image_data = original_doc.extract_image(xref)
                 with open(f"image-page-{page.number + 1}-{xref}.{image_data['ext']}", "wb") as img_file:
                    img_file.write(image_data["image"])
            print(page.get_images(full=True))
            print(page.get_xobjects())

            page_text = page.get_text()
            with open(f"text-page-{page.number + 1}.txt", "w", encoding="utf-8") as text_file:
                text_file.write(page_text)
            # print(page.get_image_info(0)[0])



[(556, 0, 1235, 165, 1, '', '', 'I0', 'CCITTFaxDecode', 0), (1721, 0, 39, 48, 1, '', '', 'I1', 'CCITTFaxDecode', 0), (1720, 0, 34, 64, 1, '', '', 'I2', 'CCITTFaxDecode', 0), (1719, 0, 31, 62, 1, '', '', 'I3', 'CCITTFaxDecode', 0), (1718, 0, 1155, 388, 1, '', '', 'I4', 'CCITTFaxDecode', 0), (1720, 0, 34, 64, 1, '', '', 'I5', 'CCITTFaxDecode', 0), (1719, 0, 31, 62, 1, '', '', 'I6', 'CCITTFaxDecode', 0), (557, 0, 1020, 267, 1, '', '', 'I7', 'CCITTFaxDecode', 0), (1720, 0, 34, 64, 1, '', '', 'I8', 'CCITTFaxDecode', 0), (1719, 0, 31, 62, 1, '', '', 'I9', 'CCITTFaxDecode', 0)]
[]


# Equation detection through docling

In [1]:
# from langchain_docling import DoclingLoader

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_formula_enrichment = True

converter = DocumentConverter(format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})

result = converter.convert(r"..\data\raw\test-data\equation-examples\solvency II - level 1 - v2 - equation with noise 68-88.pdf")
doc = result.document

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


OSError: [WinError 1314] A required privilege is not held by the client: '..\\..\\blobs\\bb5f1e798ca9fe0e12f0dad2118ea4e4b9065e0e' -> 'C:\\Users\\bvbraak\\.cache\\huggingface\\hub\\models--ds4sd--docling-layout-old\\snapshots\\b5b4bd59ad2b69aab715e9b1f1dfd74394c45fd4\\README.md'