MLX-VLM: Apple Silicon OCR Solution

Native OCR for M1/M2/M3/M4 Macs - A high-performance alternative to DeepSeek-OCR optimized for Apple Silicon.

🎯 What is This?

This is a complete OCR (Optical Character Recognition) solution for Apple Silicon Macs using MLX-VLM, Apple's machine learning framework. Instead of requiring NVIDIA GPUs (CUDA), it runs natively on your M4's Metal GPU.

Why MLX-VLM instead of DeepSeek-OCR?

Feature	DeepSeek-OCR	MLX-VLM (This Project)
GPU	NVIDIA only	✅ Apple M1/M2/M3/M4
Setup	Complex (CUDA, vLLM, flash-attention)	✅ Ready to use!
Performance on M4	❌ Won't run	✅ Native & Fast
Memory	High VRAM required	✅ Optimized for unified memory
Quality	Excellent	✅ Excellent

🚀 Quick Start (3 Steps)

1. Initialize Environment

cd /Users/w/AI/28_Deepseek-OCR
source init

This activates the environment and sets up helpful commands.

2. Run Demo

demo

First run will download a ~1.5GB model (subsequent runs are instant).

3. Try Your Own Images

mlx-ocr your_image.jpg

That's it! 🎉

📚 What's Included

✅ Pre-configured Environment

Conda environment: deepseek-ocr-mlx
MLX 0.29.3: Apple's ML framework
MLX-VLM 0.3.4: Vision-Language Models
All dependencies installed

📜 Scripts

init - Initialize environment with helpful commands
quick_demo.py - Quick test with sample images
test_mlx_ocr.py - Comprehensive test suite

📖 Documentation

README.md - This file
QUICK_START.md - Command cheat sheet
MLX_VLM_GUIDE.md - Complete documentation

🎯 Sample Images

DeepSeek-OCR/assets/ - Sample images for testing

🎮 Usage Examples

After running source init, you have access to these commands:

Quick Commands

# Run demo
demo

# Full OCR test
mlx-ocr DeepSeek-OCR/assets/show1.jpg

# Quick OCR extraction
mlx-quick your_image.jpg

# Convert to markdown
mlx-ocr document.png --mode markdown

# Interactive UI
mlx-ui

# Start API server
mlx-server

Command Line (Direct)

# Basic OCR
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --image your_image.jpg \
  --prompt "Extract all text"

# Document to Markdown
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --image document.png \
  --prompt "Convert to markdown with proper formatting"

Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model
model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

# OCR
image = ["your_image.jpg"]
prompt = "Extract all text from this image."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

output = generate(
    model, processor, formatted_prompt, image,
    max_tokens=1000, temperature=0.3
)

print(output)

🎯 Available Models

Models are automatically downloaded on first use and cached locally.

Recommended for Most Users:

mlx-community/Qwen2-VL-2B-Instruct-4bit

Size: ~1.5GB
Speed: Very Fast ⚡
Quality: Good ✅
Best for: Quick tests, real-time OCR

Better Quality:

mlx-community/Qwen2-VL-7B-Instruct-4bit

Size: ~4GB
Speed: Fast ⚡
Quality: Excellent ✨
Best for: Production use, complex documents

Best Quality:

mlx-community/Qwen2.5-VL-7B-Instruct-4bit

Size: ~4GB
Speed: Fast ⚡
Quality: State-of-the-art 🏆
Best for: Maximum accuracy, research

To use a different model:

mlx-ocr image.jpg --model mlx-community/Qwen2-VL-7B-Instruct-4bit

🌟 Features

✅ OCR Capabilities

Text extraction from images
Document scanning
Handwriting recognition
Table extraction
Mathematical equations (LaTeX)
Multi-language support

✅ Output Formats

Plain text
Markdown
JSON (structured data)
LaTeX (equations)
Custom formats via prompts

✅ Advanced Features

Interactive web UI (Gradio)
REST API server (FastAPI)
Batch processing
Video understanding
Audio transcription (with audio models)

📁 Project Structure

28_Deepseek-OCR/
├── init                    # 🚀 Start here! Environment setup
├── README.md              # This file
├── QUICK_START.md         # Command reference
├── MLX_VLM_GUIDE.md       # Complete documentation
│
├── quick_demo.py          # Quick demo script
├── test_mlx_ocr.py        # Full test suite
│
└── DeepSeek-OCR/          # Original repo (reference)
    └── assets/            # Sample images
        ├── show1.jpg
        ├── show2.jpg
        ├── show3.jpg
        └── show4.jpg

⚙️ Configuration

Environment Variables (Optional)

# Use HuggingFace mirror for faster downloads (China)
export HF_ENDPOINT=https://hf-mirror.com

# Cache directory for models
export HF_HOME=~/.cache/huggingface

Performance Tuning

# Temperature (OCR accuracy)
temperature = 0.3  # More deterministic (recommended for OCR)
temperature = 0.7  # Balanced
temperature = 1.0  # More creative

# Max tokens
max_tokens = 500   # Short text
max_tokens = 1000  # Medium documents
max_tokens = 2000  # Long documents

🔧 Advanced Usage

Interactive Web UI

source init
mlx-ui

Opens a web interface at http://localhost:7860 where you can:

Upload images via drag & drop
Chat with the vision model
Download results
Adjust parameters

REST API Server

source init
mlx-server

Then use with curl:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "image": ["/path/to/image.jpg"],
    "prompt": "Extract all text",
    "max_tokens": 1000
  }'

Batch Processing

import os
from pathlib import Path
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model once
model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

# Process multiple images
image_dir = Path("path/to/images")
for image_path in image_dir.glob("*.jpg"):
    image = [str(image_path)]
    prompt = "Extract all text"

    formatted_prompt = apply_chat_template(
        processor, config, prompt, num_images=len(image)
    )

    output = generate(model, processor, formatted_prompt, image,
                     max_tokens=1000, temperature=0.3, verbose=False)

    # Save results
    output_path = image_path.with_suffix('.txt')
    output_path.write_text(output)
    print(f"✅ {image_path.name} -> {output_path.name}")

💡 Tips & Best Practices

For Best OCR Results:

Use clear, high-resolution images
Set temperature to 0.3 for accuracy
Use 7B model for complex documents
Be specific in prompts: "Extract text in reading order" vs "Extract text"

For Markdown Conversion:

Include structure in prompt: "Convert to markdown with proper headings"
Use temperature 0.3-0.5
Increase max_tokens for long documents

For Tables:

Prompt: "Extract table data as markdown table"
Alternative: "Extract table as JSON"
Use 7B model for complex tables

For Math Equations:

Prompt: "Extract mathematical equations in LaTeX format"
Use temperature 0.3
7B model recommended

📊 Performance

On Apple M4 Mac:

Model	Size	Speed	Quality	RAM Usage
2B-4bit	1.5GB	⚡⚡⚡ Very Fast	Good	~3GB
7B-4bit	4GB	⚡⚡ Fast	Excellent	~6GB
12B-4bit	7GB	⚡ Medium	Best	~10GB

Note: First run downloads the model. Subsequent runs use cached model (instant startup).

🆚 Comparison with DeepSeek-OCR

Similarities:

✅ High-quality OCR
✅ Document understanding
✅ Markdown/JSON export
✅ Multi-modal capabilities
✅ Vision-language models

Advantages of MLX-VLM:

✅ Works on M4 (DeepSeek-OCR requires NVIDIA)
✅ Simple setup (already done!)
✅ Lower memory usage (4-bit quantization)
✅ Native performance (Metal GPU)
✅ Unified memory (efficient on Apple Silicon)

DeepSeek-OCR Advantages:

Higher token compression (fewer vision tokens)
Specifically trained for OCR tasks
Larger base models available

Bottom Line:

For M4 Macs, MLX-VLM is the only practical option and provides excellent results!

🐛 Troubleshooting

Model download is slow

# Use mirror (if in China)
export HF_ENDPOINT=https://hf-mirror.com
source init

Out of memory

# Use smaller model
mlx-ocr image.jpg --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Want better quality

# Use larger model
mlx-ocr image.jpg --model mlx-community/Qwen2-VL-7B-Instruct-4bit

Environment issues

# Recreate environment
conda deactivate
conda remove -n deepseek-ocr-mlx --all
conda create -n deepseek-ocr-mlx python=3.12 -y
conda activate deepseek-ocr-mlx
pip install mlx mlx-vlm

📚 Documentation

Quick Start: QUICK_START.md - Command cheat sheet
Full Guide: MLX_VLM_GUIDE.md - Complete documentation
MLX Docs: https://ml-explore.github.io/mlx/
MLX-VLM GitHub: https://github.com/Blaizzy/mlx-vlm
DeepSeek-OCR: https://github.com/deepseek-ai/DeepSeek-OCR

🎓 Learning Resources

Understanding MLX:

MLX is Apple's ML framework (like PyTorch for Apple Silicon)
Uses Metal for GPU acceleration
Optimized for unified memory architecture
Lazy evaluation for efficiency

Understanding VLMs:

Vision-Language Models combine vision + text understanding
Can "see" images and generate text descriptions
OCR is just one application
Can also: describe images, answer questions, extract data

🤝 Contributing

This is a local setup, but feel free to:

Test different models
Create custom prompts
Share results
Report issues to MLX-VLM upstream

📝 License

MLX: Apache 2.0
MLX-VLM: MIT
DeepSeek-OCR: Check original repo
This setup: Free to use

🌟 What's Next?

Try the demo: source init && demo
Test your images: mlx-ocr your_image.jpg
Explore models: Try 7B for better quality
Read the guide: MLX_VLM_GUIDE.md
Build something cool: Integrate into your workflow!

📞 Support

MLX Issues: https://github.com/ml-explore/mlx/issues
MLX-VLM Issues: https://github.com/Blaizzy/mlx-vlm/issues
Check docs: MLX_VLM_GUIDE.md

🚀 Ready to start? Run: source init 🚀

Built with ❤️ for Apple Silicon

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
docs/media		docs/media
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CLEANUP_SUMMARY.md		CLEANUP_SUMMARY.md
LICENSE		LICENSE
MLX_VLM_GUIDE.md		MLX_VLM_GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
convert_pdf_to_markdown.py		convert_pdf_to_markdown.py
init		init
quick_demo.py		quick_demo.py
test_mlx_ocr.py		test_mlx_ocr.py

License

QbitLoop/mlx-ocr

Folders and files

Latest commit

History

Repository files navigation