<a href="https://colab.research.google.com/github/royam0820/LLM_OCR/blob/main/amr_2_mistral_ocr_pdfs_batch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook demonstrates how to use the **Mistral AI API** for performing OCR (Optical Character Recognition) on a batch of PDF documents.
The main steps include:
- Authenticating with the Mistral API
- Loading and preprocessing PDF files
- Extracting text using Mistral's language model
- Optionally translating or formatting the output

⚠️ **Note:** This notebook is designed for use in **Google Colab** and requires an environment variable `MISTRAL_API_KEY` to be set using `userdata`.

Let's get started!


# Setup

In [1]:
# Make sure you have the dependencies installed
!pip install -q mistralai

In [2]:
import os
from google.colab import userdata
from mistralai import Mistral

# Fetch the API key securely
api_key = userdata.get('MISTRAL_API_KEY')

# Set it in the environment without exposing it
if api_key:
    os.environ["MISTRAL_API_KEY"] = api_key
else:
    raise ValueError("MISTRAL_API_KEY not found. Make sure it is set in Colab.")

# Initialize Mistral client
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
print("Mistral API Key loaded successfully.")


Mistral API Key loaded successfully.


In [3]:
import json       # For reading/writing JSON data (e.g., saving OCR results)
import base64     # For encoding/decoding file contents as base64 (used in API requests)
import shutil     # For file operations like moving files and creating ZIP archives
from pathlib import Path  # For handling filesystem paths in an object-oriented way

# Mistral: main client for interacting with the Mistral API
# DocumentURLChunk: helper class for structuring document inputs by URL
from mistralai import Mistral, DocumentURLChunk

# OCRResponse: data model representing the structured response from an OCR request
from mistralai.models import OCRResponse

In [4]:
# Path configuration
INPUT_DIR = Path("pdfs_to_process")   # Folder where the user places the PDFs to be processed
DONE_DIR = Path("pdfs-done")            # Folder where processed PDFs will be moved
OUTPUT_ROOT_DIR = Path("ocr_output")    # Root folder for conversion results

# Ensure directories exist
INPUT_DIR.mkdir(exist_ok=True)
DONE_DIR.mkdir(exist_ok=True)
OUTPUT_ROOT_DIR.mkdir(exist_ok=True)

In [5]:
INPUT_DIR

PosixPath('pdfs_to_process')

# Functions for files processing

- function to convert base64 encoded images in the markdown
- function to combine the markdown content from all pages of an OCR response into a single markdown string.


In [6]:
# function to convert base64 encoded images in the markdown,
# it returns a modified markdown string where image references are replaced with base64 links
# (i.e., ![img1](data:image/png;base64,...))

def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    """
    This converts base64 encoded images directly in the markdown...
    And replaces them with links to external images, so the markdown is more readable and organized.
    """
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})")
    return markdown_str


# function to combine the markdown content from all pages of an OCR response into a single markdown string.
def get_combined_markdown(ocr_response: OCRResponse) -> str:
    """
    Part of the response from the Mistral API, which is an OCRResponse object...
    And returns a single string with the combined markdown of all the pages of the PDF.
    """
    markdowns: list[str] = []
    # Iterate through each page in the OCR response
    for page in ocr_response.pages:
        image_data = {}
        # Build a dictionary of image ID to base64 string for each image on the page
        for img in page.images:
            image_data[img.id] = img.image_base64
        # Replace image references with actual base64 strings and append to list
        markdowns.append(replace_images_in_markdown(page.markdown, image_data))
    # Join all page markdowns with spacing between them
    return "\n\n".join(markdowns)

In [7]:
# Process files using the Mistral OCR API

def process_pdf(pdf_path: Path):
    # Process all PDFs in INPUT_DIR
    # - Important to be careful with the number of PDFs, as the Mistral API has a usage limit
    #   and it could cause errors by exceeding the limit.

    # Extract the base name of the PDF (without extension) to use for folder naming
    pdf_base = pdf_path.stem
    print(f"Processing {pdf_path.name} ...")

    # Create an output directory specific to this PDF
    output_dir = OUTPUT_ROOT_DIR / pdf_base
    output_dir.mkdir(exist_ok=True)
    # Create a subdirectory to store extracted images from the OCR output
    images_dir = output_dir / "images"
    images_dir.mkdir(exist_ok=True)

    # Read the PDF file as binary
    with open(pdf_path, "rb") as f:
        pdf_bytes = f.read()

    # Upload the PDF to Mistral's server for OCR processing
    uploaded_file = client.files.upload(
        file={
            "file_name": pdf_path.name,
            "content": pdf_bytes,
        },
        purpose="ocr"   # Declare the purpose to get a temporary OCR-usable file ID
    )

    # Get a signed URL (valid for a short time) to access the uploaded file
    signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

    # Send the document for OCR processing using the signed URL
    ocr_response = client.ocr.process(
        document=DocumentURLChunk(document_url=signed_url.url),   # Mistral expects a document URL
        model="mistral-ocr-latest",                               # Use the latest available OCR model
        include_image_base64=True                                 # Include images as base64 in the result
    )

    # Save OCR result as JSON
    # This can serve as a backup or for future analysis,
    # (in case something fails it could be reused, but it is not used in the rest of the code)
    ocr_json_path = output_dir / "ocr_response.json"
    with open(ocr_json_path, "w", encoding="utf-8") as json_file:
      json.dump(ocr_response.model_dump(), json_file, indent=4, ensure_ascii=False)
    print(f"OCR response saved in {ocr_json_path}")

    # Prepare Markdown output with embedded image links suitable for Obsidian or other apps
    # - Replaces base64 image references with links to saved local image files
    # - Saves the actual image files to a subdirectory

    global_counter = 1            # Used to create unique image filenames
    updated_markdown_pages = []   # Stores the cleaned markdown content from each page

    for page in ocr_response.pages:
        updated_markdown = page.markdown    # Start with original markdown

        for image_obj in page.images:
            # Extract and decode base64 image
            base64_str = image_obj.image_base64
            if base64_str.startswith("data:"):
                # Strip off data URL prefix if present
                base64_str = base64_str.split(",", 1)[1]
            image_bytes = base64.b64decode(base64_str)

            # Determine file extension (fallback to .png if none found)
            ext = Path(image_obj.id).suffix if Path(image_obj.id).suffix else ".png"
            new_image_name = f"{pdf_base}_img_{global_counter}{ext}"
            global_counter += 1

            # Save decoded image to images folder
            image_output_path = images_dir / new_image_name
            with open(image_output_path, "wb") as f:
                f.write(image_bytes)

            # # Update markdown to use in-line links for local image references - ![[image.png]]
            updated_markdown = updated_markdown.replace(
                f"![{image_obj.id}]({image_obj.id})",
                f"![[{new_image_name}]]"
            )
        # Add the updated markdown for this page to the list
        updated_markdown_pages.append(updated_markdown)

    # Combine all page markdowns and save to a single .md file
    final_markdown = "\n\n".join(updated_markdown_pages)
    output_markdown_path = output_dir / "output.md"
    with open(output_markdown_path, "w", encoding="utf-8") as md_file:
        md_file.write(final_markdown)
    print(f"Markdown generated in {output_markdown_path}")

# Executing the files processing

In [8]:
# ⚠️ Important: Mistral API has usage limits (e.g., requests per minute or daily caps)
#    So processing too many PDFs at once may lead to errors or rate limiting.

# The pdf documents to processed are in this directory: /content/pdfs_to_process

pdf_files = list(INPUT_DIR.glob("*.pdf"))
if not pdf_files:     # Find all PDF files in the input directory
    print("No PDFs to process.")
    exit()          # Exit early if no files are found

# Iterate through each PDF and process it one by one
for pdf_file in pdf_files:
    try:
        process_pdf(pdf_file)     # Run OCR and markdown/image export on the PDF
        # Move the processed PDF to the DONE_DIR to avoid reprocessing
        shutil.move(str(pdf_file), DONE_DIR / pdf_file.name)
        print(f"{pdf_file.name} moved to {DONE_DIR}")
    except Exception as e:
        print(f"Error processing {pdf_file.name}: {e}")


Processing 1804.07821v1.pdf ...
OCR response saved in ocr_output/1804.07821v1/ocr_response.json
Markdown generated in ocr_output/1804.07821v1/output.md
1804.07821v1.pdf moved to pdfs-done
Processing 2402.03300v3.pdf ...
OCR response saved in ocr_output/2402.03300v3/ocr_response.json
Markdown generated in ocr_output/2402.03300v3/output.md
2402.03300v3.pdf moved to pdfs-done
Processing 2101.03961v3.pdf ...
OCR response saved in ocr_output/2101.03961v3/ocr_response.json
Markdown generated in ocr_output/2101.03961v3/output.md
2101.03961v3.pdf moved to pdfs-done


# Backing up the OCR processes to a Google Drive

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
import shutil
import os
from google.colab import files

# Define the folder to zip
folder_to_zip = '/content/ocr_output'

# Define the output zip file name
output_filename = 'ocr_output.zip'

# Create the zip file, specifying the full output path
shutil.make_archive(os.path.join(os.getcwd(), output_filename[:-4]), 'zip', folder_to_zip)
# The above change ensures the archive is created with the specified name in the current working directory

# Download the zip file using the full path
files.download(os.path.join(os.getcwd(), output_filename))
# The above change provides the full path to the download function.

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Maitenance Scripts - Files

In [11]:
# maintenance script to delete the subfolders and files in the directory /content/ocr_output

# Define the directory to clean
output_root = Path("ocr_output")

# Safety check: make sure the directory exists
if output_root.exists() and output_root.is_dir():
    # Loop through all subfolders and files
    for item in output_root.iterdir():
        try:
            if item.is_dir():
                shutil.rmtree(item)  # Delete entire subdirectory
                print(f"Deleted folder: {item}")
            else:
                item.unlink()  # Delete file
                print(f"Deleted file: {item}")
        except Exception as e:
            print(f"Failed to delete {item}: {e}")
else:
    print(f"Directory {output_root} does not exist.")

Deleted folder: ocr_output/2101.03961v3
Deleted folder: ocr_output/2402.03300v3
Deleted folder: ocr_output/1804.07821v1


In [12]:
# maintenance script to delete the subfolders and files in the directory /content/pdfs-done

# Define the directory to clean
output_root = Path("pdfs-done")

# Safety check: make sure the directory exists
if output_root.exists() and output_root.is_dir():
    # Loop through all subfolders and files
    for item in output_root.iterdir():
        try:
            if item.is_dir():
                shutil.rmtree(item)  # Delete entire subdirectory
                print(f"Deleted folder: {item}")
            else:
                item.unlink()  # Delete file
                print(f"Deleted file: {item}")
        except Exception as e:
            print(f"Failed to delete {item}: {e}")
else:
    print(f"Directory {output_root} does not exist.")

Deleted file: pdfs-done/1804.07821v1.pdf
Deleted file: pdfs-done/2402.03300v3.pdf
Deleted file: pdfs-done/2101.03961v3.pdf


# ✅ Conclusion

This notebook walked through a complete pipeline for batch OCR processing of PDF files using the **Mistral AI API** in Google Colab.  
You learned how to:
- Authenticate with the API securely
- Process multiple PDFs from a specified folder
- Save and format the OCR output in both JSON and Markdown
- Export the results in a zipped archive

This setup is ideal for automating large-scale document understanding tasks.
You can now customize the workflow, improve prompt engineering, or extend it with translation/post-processing as needed.

