## This code snippet below with use Mistral AI OCR model to convert the PDF to a markdown format whil retaining the maximum context and structure of the PDF.

### NOTE: Remember it will cost depending upon your usage as to how big of a PDF you are working with so be careful with your API costs.

### Recommendation: Please convert your PDF to markdown once and store those files as just passing the same PDF again and again wil still yield same results but would cost you anyway.

In [12]:
"""
Complete script to:
  1. Process a PDF using Mistral OCR to extract markdown text.
  2. Send the combined OCR markdown to an OpenAI agent that extracts glossary terms
     and definitions—using only the OCR-provided text (word-for-word)—and outputs CSV.
  
Requirements:
  - mistralai, openai, and any other dependencies must be installed.
  - Replace "YOUR_MISTRAL_API_KEY" and "YOUR_OPENAI_API_KEY" with your actual API keys.
"""

import os
import json
from pathlib import Path
import base64
from mistralai import Mistral
from openai import OpenAI

# ------------------------------------------------------------------------------
# API Keys 
# ------------------------------------------------------------------------------
MISTRAL_API_KEY = ""
OPENAI_API_KEY = ""  

# File path to your PDF document
PDF_PATH = Path("test.pdf")
assert PDF_PATH.is_file(), f"PDF file not found: {PDF_PATH}"

# Mistral client for OCR processing
mistral_client = Mistral(api_key=MISTRAL_API_KEY)

# OpenAI client for the o3 mini agent
openai_client = OpenAI(api_key=OPENAI_API_KEY)

# ------------------------------------------------------------------------------
# Step 1: Process PDF to Markdown using Mistral OCR
# ------------------------------------------------------------------------------
def process_pdf_to_markdown(pdf_path: Path) -> str:
    # --- Upload the PDF ---
    uploaded_pdf = mistral_client.files.upload(
        file={
            "file_name": pdf_path.name,
            "content": pdf_path.read_bytes(),
        },
        purpose="ocr"
    )
    
    # --- Retrieve a signed URL for the uploaded PDF ---
    signed_url = mistral_client.files.get_signed_url(file_id=uploaded_pdf.id)
    
    # --- Process OCR using the "mistral-ocr-latest" model ---
    ocr_response = mistral_client.ocr.process(
        model="mistral-ocr-latest",
        document={
            "type": "document_url",
            "document_url": signed_url.url,
        },
        include_image_base64=False  # We only need the markdown text output
    )
    
    # --- Extract and combine markdown text from all pages ---
    response_dict = json.loads(ocr_response.json())
    pages = response_dict.get("pages", [])
    combined_markdown = "\n\n".join(page.get("markdown", "") for page in pages)
    
    # --- Optionally, write the combined markdown to a file ---
    with open("output.md", "w", encoding="utf-8") as f:
        f.write(combined_markdown)
    
    print("OCR complete. Markdown output saved to output.md")
    return combined_markdown


# ------------------------------------------------------------------------------
# Main Execution
# ------------------------------------------------------------------------------

# Process the PDF to extract OCR markdown text
ocr_markdown = process_pdf_to_markdown(PDF_PATH)
    




OCR complete. Markdown output saved to output.md


C:\Users\Pahul\AppData\Local\Temp\ipykernel_19896\3703150094.py:62: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  response_dict = json.loads(ocr_response.json())


### I recommned o3 mini model because while testing I found this model to be better and less compute intensive at the same time. Please adjust the prompt if you think you can get better results with a different prompt.

In [2]:
def extract_glossary_csv_from_local_md() -> str:
    import os
    from openai import OpenAI

    # Read the OCR markdown from the local "output.md" file
    with open("output.md", "r", encoding="utf-8") as f:
        ocr_markdown = f.read()
    
    # Build the prompt for CSV extraction
    prompt = f"""Below is the OCR output from a PDF document containing glossary terms and definitions.
Your task is to extract all glossary terms and their exact corresponding definitions using only the text provided.
DO NOT generate or add any definitions from your own knowledge—use only the text from the OCR output word-for-word. 
Note: Some entries may be misaligned, split across lines, or arranged in columns. Reconstruct them accurately as they appear.
Return the result strictly in CSV format with two columns: "Term" and "Definition". 
Do not include any additional commentary.
OCR output:
<BEGIN_OCR>
{ocr_markdown}
<END_OCR>"""

    # Initialize the OpenAI client with the API key from the environment
    openai_api_key = OPENAI_API_KEY
    if not openai_api_key:
        raise Exception("OPENAI_API_KEY not found in environment.")
    client = OpenAI(api_key=openai_api_key)
    
    # Call the openai model via the chat completions endpoint
    response = client.chat.completions.create(
        model="o3-mini-2025-01-31",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "text"},
        temperature=1,
        max_completion_tokens=20000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    
    # Access the CSV output using attribute access instead of subscripting
    csv_output = response.choices[0].message.content
    
    # Write the CSV output to "glossary.csv"
    with open("glossary.csv", "w", encoding="utf-8") as f:
        f.write(csv_output)
    
    print("Glossary extraction complete. CSV output saved to glossary.csv")
    return csv_output


In [3]:
# Let us create the required CSV file
extract_glossary_csv_from_local_md()    

Glossary extraction complete. CSV output saved to glossary.csv


'"Term","Definition"\n"Aliquoted","To divide (a whole) into equal parts."\n"Alkalinity","A measure of the acid-neutralizing capacity of water."\n"Anode","The electrode that oxidizes during an electrochemical process."\n"Backend","The treatment elements of the non-sewered sanitation system. The backend generates an output for safe reuse or disposal (e.g., outputs include liquid or solid fertilizers)."\n"Biosorption","The passive removal/binding of toxic substances from aqueous solution by biological material."\n"Blackwater","A mixture of water with feces and urine."\n"Calorific value","A standard that measures the total energy content produced in the form of heat when a substance is combusted completely with air or oxygen."\n"Cathode","The electrode where the reduction reaction occurs during an electrochemical process."\n"Chemical oxygen demand (COD)","A commonly used indirect measurement of the amount of organic matter in wastewater, COD is the amount of oxygen required to oxidize solu

### Optional Code: I needed this for calculating tokens to check context length and for computation purposes

In [4]:
import tiktoken

# Replace with your actual markdown file path
markdown_file = "output.md"

# Read the content of the markdown file
with open(markdown_file, "r", encoding="utf-8") as f:
    markdown_text = f.read()

# Choose an encoding; "cl100k_base" is a good starting point for many OpenAI models.
encoding = tiktoken.get_encoding("cl100k_base")

# Encode the text and count tokens
tokens = encoding.encode(markdown_text)
token_count = len(tokens)

print(f"Token count: {token_count}")


Token count: 3785
