# DeepSeek-OCR Testing Notebook

**Setup Guide**: This notebook tests DeepSeek-OCR on Google Colab Pro

## Requirements
- Google Colab Pro (for guaranteed GPU access)
- ~15-20 minutes first run (model download)
- ~2 minutes subsequent runs (cached)

## What This Does
1. ✅ Checks GPU availability
2. ✅ Installs dependencies (CUDA-enabled PyTorch, flash-attention)
3. ✅ Downloads DeepSeek-OCR model (~8GB)
4. ✅ Tests with sample images
5. ✅ Processes your own images

## Instructions
1. **Runtime → Change runtime type → T4 GPU** (or better)
2. **Runtime → Run all**
3. Wait for completion (progress bars will show)
4. Upload your images in the last cell

---
## Step 1: Check GPU

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\n🔥 PyTorch version: {torch.__version__}")
print(f"🎮 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🚀 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("⚠️  No GPU found! Go to Runtime → Change runtime type → Select GPU")

---
## Step 2: Clone DeepSeek-OCR Repository

In [None]:
# Clone repository (includes sample images)
!git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
%cd DeepSeek-OCR
!ls -la assets/  # Show sample images

---
## Step 3: Install Dependencies

This will take ~5-10 minutes. Installing:
- PyTorch 2.6.0 with CUDA 11.8
- Flash Attention 2.7.3
- Transformers, PyMuPDF, etc.

In [None]:
# Install PyTorch with CUDA 11.8
!pip install -q torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Install other requirements
!pip install -q transformers==4.46.3 tokenizers==0.20.3 PyMuPDF img2pdf einops easydict addict Pillow numpy

In [None]:
# Install Flash Attention (takes ~3-5 minutes to build)
print("⚙️  Building Flash Attention... This takes 3-5 minutes")
!pip install -q flash-attn==2.7.3 --no-build-isolation
print("✅ Flash Attention installed!")

---
## Step 4: Load DeepSeek-OCR Model

**First run**: Downloads ~8GB model (5-10 minutes)

**Subsequent runs**: Uses cached model (30 seconds)

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
import os

print("📥 Loading DeepSeek-OCR model...")
print("⏳ First run: ~5-10 min download | Cached runs: ~30 sec")

model_name = 'deepseek-ai/DeepSeek-OCR'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True
)

# Load model
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)

# Move to GPU and set to eval mode
model = model.eval().cuda().to(torch.bfloat16)

print("✅ Model loaded successfully!")
print(f"🎯 Model is on: {next(model.parameters()).device}")

---
## Step 5: Test with Sample Images

Run OCR on the included sample images to verify everything works.

In [None]:
# Test with sample image
from PIL import Image
import matplotlib.pyplot as plt

# Display sample image
image_path = 'assets/show1.jpg'
img = Image.open(image_path)
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title('Sample Image')
plt.show()

print(f"📐 Image size: {img.size}")

### Test 1: Convert Document to Markdown

In [None]:
print("🔍 Running OCR with Markdown conversion...\n")

prompt = "<image>\n<|grounding|>Convert the document to markdown."
output_path = './output'
os.makedirs(output_path, exist_ok=True)

result = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_path,
    output_path=output_path,
    base_size=1024,      # Base resolution
    image_size=640,      # Crop size
    crop_mode=True,      # Dynamic resolution (Gundam mode)
    save_results=True,
    test_compress=True
)

print("\n" + "="*50)
print("📄 RESULT:")
print("="*50)
print(result)
print("="*50)

### Test 2: Free OCR (Text Only, No Layout)

In [None]:
print("🔍 Running Free OCR (text only)...\n")

prompt = "<image>\nFree OCR."

result = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_path,
    output_path=output_path,
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=False,
    test_compress=True
)

print("\n" + "="*50)
print("📄 RESULT:")
print("="*50)
print(result)
print("="*50)

### Test 3: Describe Image in Detail

In [None]:
print("🔍 Describing image...\n")

prompt = "<image>\nDescribe this image in detail."

result = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_path,
    output_path=output_path,
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=False,
    test_compress=True
)

print("\n" + "="*50)
print("📄 RESULT:")
print("="*50)
print(result)
print("="*50)

---
## Step 6: Process Your Own Images

Upload your images here and process them.

In [None]:
# Upload images
from google.colab import files
import io

print("📤 Click 'Choose Files' to upload your images...")
uploaded = files.upload()

print(f"\n✅ Uploaded {len(uploaded)} file(s)")
for filename in uploaded.keys():
    print(f"  - {filename}")

In [None]:
# Process uploaded images
import os
from PIL import Image

# Configuration
PROMPT = "<image>\n<|grounding|>Convert the document to markdown."  # Change this!
OUTPUT_DIR = './user_outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Available prompts:
# "<image>\n<|grounding|>Convert the document to markdown."
# "<image>\n<|grounding|>OCR this image."
# "<image>\nFree OCR."
# "<image>\nParse the figure."
# "<image>\nDescribe this image in detail."

for filename in uploaded.keys():
    print(f"\n{'='*60}")
    print(f"📄 Processing: {filename}")
    print(f"{'='*60}\n")
    
    # Display image
    img = Image.open(io.BytesIO(uploaded[filename]))
    plt.figure(figsize=(10, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.title(filename)
    plt.show()
    
    # Save temporarily
    temp_path = f'/tmp/{filename}'
    with open(temp_path, 'wb') as f:
        f.write(uploaded[filename])
    
    # Run inference
    result = model.infer(
        tokenizer,
        prompt=PROMPT,
        image_file=temp_path,
        output_path=OUTPUT_DIR,
        base_size=1024,
        image_size=640,
        crop_mode=True,
        save_results=True,
        test_compress=True
    )
    
    # Save result
    result_file = os.path.join(OUTPUT_DIR, f"{filename}_result.md")
    with open(result_file, 'w', encoding='utf-8') as f:
        f.write(result)
    
    print(f"\n✅ Saved to: {result_file}")
    print(f"\n📄 RESULT:\n{'-'*60}")
    print(result[:1000])  # Show first 1000 chars
    if len(result) > 1000:
        print(f"\n... (truncated, see full result in {result_file})")

print(f"\n\n🎉 All images processed! Results in: {OUTPUT_DIR}")

---
## Step 7: Download Results

In [None]:
# Zip and download all results
import shutil

if os.path.exists(OUTPUT_DIR):
    # Create zip file
    shutil.make_archive('deepseek_ocr_results', 'zip', OUTPUT_DIR)
    
    # Download
    print("📦 Downloading results...")
    files.download('deepseek_ocr_results.zip')
    print("✅ Downloaded!")
else:
    print("⚠️  No results to download yet. Run the processing cell above first.")

---
## 🎯 Resolution Modes

You can adjust quality vs speed by changing these parameters:

```python
# Tiny: Fastest, lowest quality (64 vision tokens)
base_size=512, image_size=512, crop_mode=False

# Small: Fast, good quality (100 vision tokens)
base_size=640, image_size=640, crop_mode=False

# Base: Balanced (256 vision tokens)
base_size=1024, image_size=1024, crop_mode=False

# Large: Slow, best quality (400 vision tokens)
base_size=1280, image_size=1280, crop_mode=False

# Gundam: Dynamic, best for documents (256 + n×100 tokens)
base_size=1024, image_size=640, crop_mode=True  # ← Default
```

---

## 📚 Common Prompts

```python
# Documents
"<image>\n<|grounding|>Convert the document to markdown."

# General OCR
"<image>\n<|grounding|>OCR this image."

# Text only (no layout)
"<image>\nFree OCR."

# Figures in documents
"<image>\nParse the figure."

# General description
"<image>\nDescribe this image in detail."

# Locate specific text
"<image>\nLocate <|ref|>xxxx<|/ref|> in the image."
```

---

## ⚡ Tips

- **First run**: Takes 15-20 min (model download)
- **Subsequent runs**: 2-3 min (cached)
- **GPU memory**: Uses ~12-16 GB
- **Best results**: Use `crop_mode=True` for documents
- **Faster inference**: Use smaller `base_size` values

---

**Created by**: Carlos Lorenzo Santos

**Date**: October 2025

**Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)