A comprehensive demonstration of the DeepSeek-OCR model for Optical Character Recognition (OCR) and document conversion tasks.
DeepSeek-OCR is a 3B parameter vision-language model designed for high-performance OCR and structured document conversion. It can:
- Convert documents to Markdown format with preserved structure
- Perform general OCR on any text-containing images
- Handle complex layouts including tables and forms
- Compress text by up to 10x while maintaining 97% accuracy
- Process financial charts into structured data
- Command-line interface for batch processing
- Gradio web interface for interactive use
- Python API for integration into your projects
- Support for both document conversion and general OCR tasks
- GPU acceleration with Flash Attention 2 support
- Python 3.8+
- CUDA-capable GPU (recommended, but CPU also supported)
- 8GB+ GPU memory for optimal performance
git clone <your-repo-url>
cd demopython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtNote: Flash Attention installation may require specific CUDA versions. If it fails, the model will fall back to standard attention.
Process a single image:
python deepseek_ocr_demo.py --image path/to/image.png --task markdownOptions:
--image: Path to the input image (required)--task: Task type -markdownorocr(default:markdown)--output: Save result to file (optional)--max-tokens: Maximum tokens to generate (default: 2048)--model: Model name or path (default:deepseek-ai/DeepSeek-OCR)
Convert a document to Markdown:
python deepseek_ocr_demo.py --image document.jpg --task markdown --output result.mdPerform general OCR:
python deepseek_ocr_demo.py --image screenshot.png --task ocrLaunch the interactive web interface:
python gradio_app.pyThen open your browser to http://localhost:7860
Features:
- Upload images via drag-and-drop or file browser
- Choose between Markdown conversion and general OCR
- Adjust max token length
- Copy results with one click
Use DeepSeek OCR in your own Python code:
from deepseek_ocr_demo import DeepSeekOCR
# Initialize model
ocr = DeepSeekOCR()
# Process single image
result = ocr.process_image("document.jpg", task="markdown")
print(result)
# Batch process multiple images
results = ocr.batch_process(
["image1.jpg", "image2.jpg", "image3.jpg"],
task="ocr"
)- Convert scanned documents to editable Markdown
- Extract structured data from PDFs
- Digitize paper forms and reports
- Extract text from screenshots
- Read text from photos
- Process receipts and invoices
- Extract data from charts and graphs
- Convert financial charts to tables
- Parse complex document layouts
- Extract information from forms
- Model Name: deepseek-ai/DeepSeek-OCR
- Parameters: 3 billion
- Architecture: Vision-Language Model (VLM)
- Input: Images (documents, screenshots, photos)
- Output: Text, Markdown, structured data
- Compression Ratio: Up to 10x text compression
- Accuracy: Maintains 97% of original information
- Speed: Real-time processing on modern GPUs
- Python: 3.12.9
- CUDA: 11.8+
- PyTorch: 2.6.0+
- Transformers: 4.46.3+
- Flash Attention: 2.7.3+
The model uses the following prompt formats:
Document to Markdown:
<image>
<|grounding|>Convert the document to markdown.
General OCR:
<image>
<|grounding|>OCR this image.
If you encounter GPU memory errors:
- Reduce
--max-tokensvalue - Process smaller images
- Use CPU instead: The model will automatically fall back to CPU if CUDA is unavailable
Flash Attention requires specific CUDA versions. If installation fails:
- The model will automatically use standard attention
- Performance may be slightly reduced but functionality remains intact
On first run, the model will be downloaded from Hugging Face (approximately 6GB). This may take some time depending on your internet connection.
See the examples/ directory for sample images and usage scripts.
This demo uses the DeepSeek-OCR model. Please refer to the official model page for license information.
Contributions are welcome! Please feel free to submit issues or pull requests.
This demo is built on top of the DeepSeek-OCR model developed by DeepSeek AI.