# DeepSeek-OCR Testing Notebook (with PDF Support)

**Setup Guide**: This notebook tests DeepSeek-OCR on Google Colab Pro

## Requirements
- Google Colab Pro (for guaranteed GPU access)
- ~15-20 minutes first run (model download)
- ~2 minutes subsequent runs (cached)

## What This Does
1. ✅ Checks GPU availability
2. ✅ Installs dependencies (CUDA-enabled PyTorch, flash-attention)
3. ✅ Downloads DeepSeek-OCR model (~8GB)
4. ✅ Tests with sample images
5. ✅ Processes your own images
6. ✅ **NEW: Processes PDF documents** 📄

## Instructions
1. **Runtime → Change runtime type → T4 GPU** (or better)
2. **Runtime → Run all**
3. Wait for completion (progress bars will show)
4. Upload your images/PDFs in the processing cells

---
## Step 1: Check GPU

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\n🔥 PyTorch version: {torch.__version__}")
print(f"🎮 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🚀 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("⚠️  No GPU found! Go to Runtime → Change runtime type → Select GPU")

---
## Step 2: Clone DeepSeek-OCR Repository

In [None]:
# Clone repository (includes sample images)
!git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
%cd DeepSeek-OCR
!ls -la assets/  # Show sample images

---
## Step 3: Install Dependencies

This will take ~5-10 minutes. Installing:
- PyTorch 2.6.0 with CUDA 11.8
- Flash Attention 2.7.3
- Transformers, PyMuPDF, etc.

In [None]:
# Install PyTorch with CUDA 11.8
!pip install -q torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Install other requirements
!pip install -q transformers==4.46.3 tokenizers==0.20.3 PyMuPDF img2pdf einops easydict addict Pillow numpy

In [None]:
# Install Flash Attention (takes ~3-5 minutes to build)
print("⚙️  Building Flash Attention... This takes 3-5 minutes")
!pip install -q flash-attn==2.7.3 --no-build-isolation
print("✅ Flash Attention installed!")

---
## Step 4: Load DeepSeek-OCR Model

**First run**: Downloads ~8GB model (5-10 minutes)

**Subsequent runs**: Uses cached model (30 seconds)

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
import os

print("📥 Loading DeepSeek-OCR model...")
print("⏳ First run: ~5-10 min download | Cached runs: ~30 sec")

model_name = 'deepseek-ai/DeepSeek-OCR'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True
)

# Load model
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)

# Move to GPU and set to eval mode
model = model.eval().cuda().to(torch.bfloat16)

print("✅ Model loaded successfully!")
print(f"🎯 Model is on: {next(model.parameters()).device}")

---
## Step 5: Test with Sample Images

Run OCR on the included sample images to verify everything works.

In [None]:
# Test with sample image
from PIL import Image
import matplotlib.pyplot as plt

# Display sample image
image_path = 'assets/show1.jpg'
img = Image.open(image_path)
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title('Sample Image')
plt.show()

print(f"📐 Image size: {img.size}")

### Test 1: Convert Document to Markdown

In [None]:
print("🔍 Running OCR with Markdown conversion...\n")

prompt = "<image>\n<|grounding|>Convert the document to markdown."
output_path = './output'
os.makedirs(output_path, exist_ok=True)

result = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_path,
    output_path=output_path,
    base_size=1024,      # Base resolution
    image_size=640,      # Crop size
    crop_mode=True,      # Dynamic resolution (Gundam mode)
    save_results=True,
    test_compress=True
)

print("\n" + "="*50)
print("📄 RESULT:")
print("="*50)
print(result[:500])  # Show first 500 chars
print("="*50)

---
## Step 6: Process Your Own Images 🖼️

In [None]:
# Upload images
from google.colab import files
import io

print("📤 Click 'Choose Files' to upload your images...")
uploaded = files.upload()

print(f"\n✅ Uploaded {len(uploaded)} file(s)")
for filename in uploaded.keys():
    print(f"  - {filename}")

In [None]:
# Process uploaded images
from PIL import Image

# Configuration
PROMPT = "<image>\n<|grounding|>Convert the document to markdown."  # Change this!
OUTPUT_DIR = './user_outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

for filename in uploaded.keys():
    print(f"\n{'='*60}")
    print(f"📄 Processing: {filename}")
    print(f"{'='*60}\n")
    
    # Display image
    img = Image.open(io.BytesIO(uploaded[filename]))
    plt.figure(figsize=(10, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.title(filename)
    plt.show()
    
    # Save temporarily
    temp_path = f'/tmp/{filename}'
    with open(temp_path, 'wb') as f:
        f.write(uploaded[filename])
    
    # Run inference
    result = model.infer(
        tokenizer,
        prompt=PROMPT,
        image_file=temp_path,
        output_path=OUTPUT_DIR,
        base_size=1024,
        image_size=640,
        crop_mode=True,
        save_results=True,
        test_compress=True
    )
    
    # Save result
    result_file = os.path.join(OUTPUT_DIR, f"{filename}_result.md")
    with open(result_file, 'w', encoding='utf-8') as f:
        f.write(result)
    
    print(f"\n✅ Saved to: {result_file}")
    print(f"\n📄 Result preview:\n{'-'*60}")
    print(result[:1000])  # Show first 1000 chars

print(f"\n\n🎉 All images processed!")

---
## Step 7: Process PDF Documents 📄

**NEW!** Upload PDFs and process all pages automatically.

This will:
1. Convert each PDF page to an image
2. Process all pages with OCR
3. Compile results into a single markdown file
4. Add page separators

In [None]:
# Upload PDFs
import fitz  # PyMuPDF

print("📤 Click 'Choose Files' to upload your PDF(s)...")
uploaded_pdfs = files.upload()

print(f"\n✅ Uploaded {len(uploaded_pdfs)} PDF(s)")
for filename in uploaded_pdfs.keys():
    print(f"  - {filename}")

In [None]:
# Helper function to convert PDF to images
def pdf_to_images(pdf_bytes, dpi=144):
    """
    Convert PDF to list of PIL Images
    
    Args:
        pdf_bytes: PDF file bytes
        dpi: Resolution (144 is good balance, 72=fast/low quality, 300=slow/high quality)
    
    Returns:
        List of PIL Images
    """
    images = []
    pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
    
    zoom = dpi / 72.0
    matrix = fitz.Matrix(zoom, zoom)
    
    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        pixmap = page.get_pixmap(matrix=matrix, alpha=False)
        
        # Convert to PIL Image
        img_data = pixmap.tobytes("png")
        img = Image.open(io.BytesIO(img_data))
        images.append(img)
    
    pdf_document.close()
    return images

print("✅ PDF conversion function ready!")

In [None]:
# Process PDFs
from tqdm.notebook import tqdm

# Configuration
PDF_PROMPT = "<image>\n<|grounding|>Convert the document to markdown."
PDF_OUTPUT_DIR = './pdf_outputs'
os.makedirs(PDF_OUTPUT_DIR, exist_ok=True)

# Process each PDF
for pdf_filename in uploaded_pdfs.keys():
    print(f"\n{'='*70}")
    print(f"📄 Processing PDF: {pdf_filename}")
    print(f"{'='*70}\n")
    
    # Convert PDF to images
    print("🔄 Converting PDF to images...")
    pdf_bytes = io.BytesIO(uploaded_pdfs[pdf_filename])
    page_images = pdf_to_images(pdf_bytes, dpi=144)  # Lower DPI=faster, higher=better quality
    print(f"✅ Converted {len(page_images)} pages\n")
    
    # Process each page
    all_results = []
    
    for page_num, page_img in enumerate(tqdm(page_images, desc=f"Processing pages")):
        # Show preview of first page and every 5th page
        if page_num == 0 or (page_num + 1) % 5 == 0:
            plt.figure(figsize=(8, 10))
            plt.imshow(page_img)
            plt.axis('off')
            plt.title(f'Page {page_num + 1}')
            plt.show()
        
        # Save page temporarily
        temp_page_path = f'/tmp/page_{page_num}.png'
        page_img.save(temp_page_path)
        
        # Run OCR
        try:
            result = model.infer(
                tokenizer,
                prompt=PDF_PROMPT,
                image_file=temp_page_path,
                output_path=PDF_OUTPUT_DIR,
                base_size=1024,
                image_size=640,
                crop_mode=True,
                save_results=False,
                test_compress=True
            )
            
            # Add page header and separator
            page_result = f"\n## Page {page_num + 1}\n\n{result}\n\n---\n"
            all_results.append(page_result)
            
        except Exception as e:
            print(f"⚠️  Error on page {page_num + 1}: {e}")
            all_results.append(f"\n## Page {page_num + 1}\n\n*Error processing this page*\n\n---\n")
    
    # Compile all pages into single markdown file
    pdf_base_name = pdf_filename.replace('.pdf', '')
    output_file = os.path.join(PDF_OUTPUT_DIR, f"{pdf_base_name}_complete.md")
    
    with open(output_file, 'w', encoding='utf-8') as f:
        # Write header
        f.write(f"# {pdf_filename}\n\n")
        f.write(f"**Total Pages**: {len(page_images)}\n\n")
        f.write(f"**Successfully Processed**: {len([r for r in all_results if 'Error' not in r])} pages\n\n")
        f.write(f"**Generated**: {os.popen('date').read().strip()}\n\n")
        f.write("---\n\n")
        # Write all pages
        f.writelines(all_results)
    
    print(f"\n✅ PDF processed!")
    print(f"📁 Saved to: {output_file}")
    print(f"📊 Total pages: {len(page_images)}")
    print(f"📝 File size: {os.path.getsize(output_file) / 1024:.1f} KB")

print(f"\n\n🎉 All PDFs processed! Results in: {PDF_OUTPUT_DIR}")

### Preview PDF Results

In [None]:
# Show preview of each PDF result
import glob

pdf_results = glob.glob(os.path.join(PDF_OUTPUT_DIR, '*_complete.md'))

for result_file in pdf_results:
    print(f"\n{'='*70}")
    print(f"📄 {os.path.basename(result_file)}")
    print(f"{'='*70}\n")
    
    with open(result_file, 'r', encoding='utf-8') as f:
        content = f.read()
        # Show first 2000 characters
        print(content[:2000])
        if len(content) > 2000:
            print(f"\n... (showing first 2000 chars out of {len(content)} total)")
            print(f"\n💡 Download the full file to see all content!")

---
## Step 8: Download All Results 📦

In [None]:
# Zip and download all results (images + PDFs)
import shutil

# Collect all output directories
output_dirs = []
if os.path.exists(OUTPUT_DIR):
    output_dirs.append(OUTPUT_DIR)
if os.path.exists(PDF_OUTPUT_DIR):
    output_dirs.append(PDF_OUTPUT_DIR)

if output_dirs:
    # Create combined directory
    combined_dir = './all_results'
    os.makedirs(combined_dir, exist_ok=True)
    
    # Copy all results
    for output_dir in output_dirs:
        dir_name = os.path.basename(output_dir)
        target_dir = os.path.join(combined_dir, dir_name)
        if os.path.exists(target_dir):
            shutil.rmtree(target_dir)
        shutil.copytree(output_dir, target_dir)
    
    # Create zip file
    print("📦 Creating zip file...")
    shutil.make_archive('deepseek_ocr_results', 'zip', combined_dir)
    
    # Download
    print("⬇️  Downloading results...")
    files.download('deepseek_ocr_results.zip')
    print("✅ Downloaded!")
    
    # Show summary
    print(f"\n📊 Summary:")
    if os.path.exists(OUTPUT_DIR):
        img_files = len([f for f in os.listdir(OUTPUT_DIR) if f.endswith('.md')])
        print(f"  📸 Image results: {img_files}")
    if os.path.exists(PDF_OUTPUT_DIR):
        pdf_files = len([f for f in os.listdir(PDF_OUTPUT_DIR) if f.endswith('.md')])
        print(f"  📄 PDF results: {pdf_files}")
else:
    print("⚠️  No results to download yet. Process some images or PDFs first!")

---
## 🎯 Tips & Configuration

### Resolution Modes

Adjust quality vs speed by changing these parameters:

```python
# Tiny: Fastest, lowest quality (64 vision tokens)
base_size=512, image_size=512, crop_mode=False

# Small: Fast, good quality (100 vision tokens)
base_size=640, image_size=640, crop_mode=False

# Base: Balanced (256 vision tokens)
base_size=1024, image_size=1024, crop_mode=False

# Large: Slow, best quality (400 vision tokens)
base_size=1280, image_size=1280, crop_mode=False

# Gundam: Dynamic, best for documents (256 + n×100 tokens)
base_size=1024, image_size=640, crop_mode=True  # ← DEFAULT
```

### PDF Quality Settings

In the `pdf_to_images()` function, adjust DPI:

```python
page_images = pdf_to_images(pdf_bytes, dpi=144)  # ← Change this

# dpi=72:  Fast, lower quality
# dpi=144: Balanced (default)
# dpi=300: Slow, best quality
```

### Common Prompts

```python
# Documents to markdown (best for PDFs)
"<image>\n<|grounding|>Convert the document to markdown."

# General OCR
"<image>\n<|grounding|>OCR this image."

# Text only (no layout)
"<image>\nFree OCR."

# Parse figures/charts
"<image>\nParse the figure."

# General description
"<image>\nDescribe this image in detail."
```

### Performance

| Task | Time per Page | GPU Memory |
|------|---------------|------------|
| Image (Tiny) | ~5 sec | ~8 GB |
| Image (Gundam) | ~10 sec | ~12 GB |
| Image (Large) | ~15 sec | ~14 GB |
| PDF (10 pages) | ~2 min | ~12 GB |
| PDF (50 pages) | ~10 min | ~12 GB |

---

## 📚 Resources

- [DeepSeek-OCR Paper](https://arxiv.org/abs/2510.18234)
- [Model on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- [GitHub Repository](https://github.com/deepseek-ai/DeepSeek-OCR)

---

**Created by**: Carlos Lorenzo Santos

**Date**: October 2025

**Model**: DeepSeek-OCR