Skip to content

Native OCR for Apple Silicon (M1/M2/M3/M4) using MLX-VLM. Alternative to DeepSeek-OCR.

License

Notifications You must be signed in to change notification settings

QbitLoop/mlx-ocr

Repository files navigation

MLX-VLM: Apple Silicon OCR Solution

Apple Silicon MLX Python License

Native OCR for M1/M2/M3/M4 Macs - A high-performance alternative to DeepSeek-OCR optimized for Apple Silicon.

🎯 What is This?

This is a complete OCR (Optical Character Recognition) solution for Apple Silicon Macs using MLX-VLM, Apple's machine learning framework. Instead of requiring NVIDIA GPUs (CUDA), it runs natively on your M4's Metal GPU.

Why MLX-VLM instead of DeepSeek-OCR?

Feature DeepSeek-OCR MLX-VLM (This Project)
GPU NVIDIA only ✅ Apple M1/M2/M3/M4
Setup Complex (CUDA, vLLM, flash-attention) Ready to use!
Performance on M4 ❌ Won't run ✅ Native & Fast
Memory High VRAM required ✅ Optimized for unified memory
Quality Excellent ✅ Excellent

🚀 Quick Start (3 Steps)

1. Initialize Environment

cd /Users/w/AI/28_Deepseek-OCR
source init

This activates the environment and sets up helpful commands.

2. Run Demo

demo

First run will download a ~1.5GB model (subsequent runs are instant).

3. Try Your Own Images

mlx-ocr your_image.jpg

That's it! 🎉


📚 What's Included

✅ Pre-configured Environment

  • Conda environment: deepseek-ocr-mlx
  • MLX 0.29.3: Apple's ML framework
  • MLX-VLM 0.3.4: Vision-Language Models
  • All dependencies installed

📜 Scripts

  • init - Initialize environment with helpful commands
  • quick_demo.py - Quick test with sample images
  • test_mlx_ocr.py - Comprehensive test suite

📖 Documentation

  • README.md - This file
  • QUICK_START.md - Command cheat sheet
  • MLX_VLM_GUIDE.md - Complete documentation

🎯 Sample Images

  • DeepSeek-OCR/assets/ - Sample images for testing

🎮 Usage Examples

After running source init, you have access to these commands:

Quick Commands

# Run demo
demo

# Full OCR test
mlx-ocr DeepSeek-OCR/assets/show1.jpg

# Quick OCR extraction
mlx-quick your_image.jpg

# Convert to markdown
mlx-ocr document.png --mode markdown

# Interactive UI
mlx-ui

# Start API server
mlx-server

Command Line (Direct)

# Basic OCR
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --image your_image.jpg \
  --prompt "Extract all text"

# Document to Markdown
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --image document.png \
  --prompt "Convert to markdown with proper formatting"

Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model
model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

# OCR
image = ["your_image.jpg"]
prompt = "Extract all text from this image."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

output = generate(
    model, processor, formatted_prompt, image,
    max_tokens=1000, temperature=0.3
)

print(output)

🎯 Available Models

Models are automatically downloaded on first use and cached locally.

Recommended for Most Users:

mlx-community/Qwen2-VL-2B-Instruct-4bit
  • Size: ~1.5GB
  • Speed: Very Fast ⚡
  • Quality: Good ✅
  • Best for: Quick tests, real-time OCR

Better Quality:

mlx-community/Qwen2-VL-7B-Instruct-4bit
  • Size: ~4GB
  • Speed: Fast ⚡
  • Quality: Excellent ✨
  • Best for: Production use, complex documents

Best Quality:

mlx-community/Qwen2.5-VL-7B-Instruct-4bit
  • Size: ~4GB
  • Speed: Fast ⚡
  • Quality: State-of-the-art 🏆
  • Best for: Maximum accuracy, research

To use a different model:

mlx-ocr image.jpg --model mlx-community/Qwen2-VL-7B-Instruct-4bit

🌟 Features

✅ OCR Capabilities

  • Text extraction from images
  • Document scanning
  • Handwriting recognition
  • Table extraction
  • Mathematical equations (LaTeX)
  • Multi-language support

✅ Output Formats

  • Plain text
  • Markdown
  • JSON (structured data)
  • LaTeX (equations)
  • Custom formats via prompts

✅ Advanced Features

  • Interactive web UI (Gradio)
  • REST API server (FastAPI)
  • Batch processing
  • Video understanding
  • Audio transcription (with audio models)

📁 Project Structure

28_Deepseek-OCR/
├── init                    # 🚀 Start here! Environment setup
├── README.md              # This file
├── QUICK_START.md         # Command reference
├── MLX_VLM_GUIDE.md       # Complete documentation
│
├── quick_demo.py          # Quick demo script
├── test_mlx_ocr.py        # Full test suite
│
└── DeepSeek-OCR/          # Original repo (reference)
    └── assets/            # Sample images
        ├── show1.jpg
        ├── show2.jpg
        ├── show3.jpg
        └── show4.jpg

⚙️ Configuration

Environment Variables (Optional)

# Use HuggingFace mirror for faster downloads (China)
export HF_ENDPOINT=https://hf-mirror.com

# Cache directory for models
export HF_HOME=~/.cache/huggingface

Performance Tuning

# Temperature (OCR accuracy)
temperature = 0.3  # More deterministic (recommended for OCR)
temperature = 0.7  # Balanced
temperature = 1.0  # More creative

# Max tokens
max_tokens = 500   # Short text
max_tokens = 1000  # Medium documents
max_tokens = 2000  # Long documents

🔧 Advanced Usage

Interactive Web UI

source init
mlx-ui

Opens a web interface at http://localhost:7860 where you can:

  • Upload images via drag & drop
  • Chat with the vision model
  • Download results
  • Adjust parameters

REST API Server

source init
mlx-server

Then use with curl:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "image": ["/path/to/image.jpg"],
    "prompt": "Extract all text",
    "max_tokens": 1000
  }'

Batch Processing

import os
from pathlib import Path
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model once
model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

# Process multiple images
image_dir = Path("path/to/images")
for image_path in image_dir.glob("*.jpg"):
    image = [str(image_path)]
    prompt = "Extract all text"

    formatted_prompt = apply_chat_template(
        processor, config, prompt, num_images=len(image)
    )

    output = generate(model, processor, formatted_prompt, image,
                     max_tokens=1000, temperature=0.3, verbose=False)

    # Save results
    output_path = image_path.with_suffix('.txt')
    output_path.write_text(output)
    print(f"✅ {image_path.name} -> {output_path.name}")

💡 Tips & Best Practices

For Best OCR Results:

  1. Use clear, high-resolution images
  2. Set temperature to 0.3 for accuracy
  3. Use 7B model for complex documents
  4. Be specific in prompts: "Extract text in reading order" vs "Extract text"

For Markdown Conversion:

  1. Include structure in prompt: "Convert to markdown with proper headings"
  2. Use temperature 0.3-0.5
  3. Increase max_tokens for long documents

For Tables:

  1. Prompt: "Extract table data as markdown table"
  2. Alternative: "Extract table as JSON"
  3. Use 7B model for complex tables

For Math Equations:

  1. Prompt: "Extract mathematical equations in LaTeX format"
  2. Use temperature 0.3
  3. 7B model recommended

📊 Performance

On Apple M4 Mac:

Model Size Speed Quality RAM Usage
2B-4bit 1.5GB ⚡⚡⚡ Very Fast Good ~3GB
7B-4bit 4GB ⚡⚡ Fast Excellent ~6GB
12B-4bit 7GB ⚡ Medium Best ~10GB

Note: First run downloads the model. Subsequent runs use cached model (instant startup).


🆚 Comparison with DeepSeek-OCR

Similarities:

  • ✅ High-quality OCR
  • ✅ Document understanding
  • ✅ Markdown/JSON export
  • ✅ Multi-modal capabilities
  • ✅ Vision-language models

Advantages of MLX-VLM:

  • Works on M4 (DeepSeek-OCR requires NVIDIA)
  • Simple setup (already done!)
  • Lower memory usage (4-bit quantization)
  • Native performance (Metal GPU)
  • Unified memory (efficient on Apple Silicon)

DeepSeek-OCR Advantages:

  • Higher token compression (fewer vision tokens)
  • Specifically trained for OCR tasks
  • Larger base models available

Bottom Line:

For M4 Macs, MLX-VLM is the only practical option and provides excellent results!


🐛 Troubleshooting

Model download is slow

# Use mirror (if in China)
export HF_ENDPOINT=https://hf-mirror.com
source init

Out of memory

# Use smaller model
mlx-ocr image.jpg --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Want better quality

# Use larger model
mlx-ocr image.jpg --model mlx-community/Qwen2-VL-7B-Instruct-4bit

Environment issues

# Recreate environment
conda deactivate
conda remove -n deepseek-ocr-mlx --all
conda create -n deepseek-ocr-mlx python=3.12 -y
conda activate deepseek-ocr-mlx
pip install mlx mlx-vlm

📚 Documentation


🎓 Learning Resources

Understanding MLX:

  • MLX is Apple's ML framework (like PyTorch for Apple Silicon)
  • Uses Metal for GPU acceleration
  • Optimized for unified memory architecture
  • Lazy evaluation for efficiency

Understanding VLMs:

  • Vision-Language Models combine vision + text understanding
  • Can "see" images and generate text descriptions
  • OCR is just one application
  • Can also: describe images, answer questions, extract data

🤝 Contributing

This is a local setup, but feel free to:

  • Test different models
  • Create custom prompts
  • Share results
  • Report issues to MLX-VLM upstream

📝 License

  • MLX: Apache 2.0
  • MLX-VLM: MIT
  • DeepSeek-OCR: Check original repo
  • This setup: Free to use

🌟 What's Next?

  1. Try the demo: source init && demo
  2. Test your images: mlx-ocr your_image.jpg
  3. Explore models: Try 7B for better quality
  4. Read the guide: MLX_VLM_GUIDE.md
  5. Build something cool: Integrate into your workflow!

📞 Support


🚀 Ready to start? Run: source init 🚀

Built with ❤️ for Apple Silicon

About

Native OCR for Apple Silicon (M1/M2/M3/M4) using MLX-VLM. Alternative to DeepSeek-OCR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published