DeepSeek OCR Demo

A comprehensive demonstration of the DeepSeek-OCR model for Optical Character Recognition (OCR) and document conversion tasks.

Overview

DeepSeek-OCR is a 3B parameter vision-language model designed for high-performance OCR and structured document conversion. It can:

Convert documents to Markdown format with preserved structure
Perform general OCR on any text-containing images
Handle complex layouts including tables and forms
Compress text by up to 10x while maintaining 97% accuracy
Process financial charts into structured data

Features

Command-line interface for batch processing
Gradio web interface for interactive use
Python API for integration into your projects
Support for both document conversion and general OCR tasks
GPU acceleration with Flash Attention 2 support

Requirements

Python 3.8+
CUDA-capable GPU (recommended, but CPU also supported)
8GB+ GPU memory for optimal performance

Installation

1. Clone the repository

git clone <your-repo-url>
cd demo

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

Note: Flash Attention installation may require specific CUDA versions. If it fails, the model will fall back to standard attention.

Usage

Command-Line Interface

Process a single image:

python deepseek_ocr_demo.py --image path/to/image.png --task markdown

Options:

--image: Path to the input image (required)
--task: Task type - markdown or ocr (default: markdown)
--output: Save result to file (optional)
--max-tokens: Maximum tokens to generate (default: 2048)
--model: Model name or path (default: deepseek-ai/DeepSeek-OCR)

Examples

Convert a document to Markdown:

python deepseek_ocr_demo.py --image document.jpg --task markdown --output result.md

Perform general OCR:

python deepseek_ocr_demo.py --image screenshot.png --task ocr

Gradio Web Interface

Launch the interactive web interface:

python gradio_app.py

Then open your browser to http://localhost:7860

Features:

Upload images via drag-and-drop or file browser
Choose between Markdown conversion and general OCR
Adjust max token length
Copy results with one click

Python API

Use DeepSeek OCR in your own Python code:

from deepseek_ocr_demo import DeepSeekOCR

# Initialize model
ocr = DeepSeekOCR()

# Process single image
result = ocr.process_image("document.jpg", task="markdown")
print(result)

# Batch process multiple images
results = ocr.batch_process(
    ["image1.jpg", "image2.jpg", "image3.jpg"],
    task="ocr"
)

Use Cases

Document Conversion

Convert scanned documents to editable Markdown
Extract structured data from PDFs
Digitize paper forms and reports

General OCR

Extract text from screenshots
Read text from photos
Process receipts and invoices
Extract data from charts and graphs

Structured Data Extraction

Convert financial charts to tables
Parse complex document layouts
Extract information from forms

Model Information

Model Name: deepseek-ai/DeepSeek-OCR
Parameters: 3 billion
Architecture: Vision-Language Model (VLM)
Input: Images (documents, screenshots, photos)
Output: Text, Markdown, structured data

Performance

Compression Ratio: Up to 10x text compression
Accuracy: Maintains 97% of original information
Speed: Real-time processing on modern GPUs

Technical Details

Tested Configuration

Python: 3.12.9
CUDA: 11.8+
PyTorch: 2.6.0+
Transformers: 4.46.3+
Flash Attention: 2.7.3+

Prompt Format

The model uses the following prompt formats:

Document to Markdown:

<image>
<|grounding|>Convert the document to markdown.

General OCR:

<image>
<|grounding|>OCR this image.

Troubleshooting

CUDA Out of Memory

If you encounter GPU memory errors:

Reduce --max-tokens value
Process smaller images
Use CPU instead: The model will automatically fall back to CPU if CUDA is unavailable

Flash Attention Installation Issues

Flash Attention requires specific CUDA versions. If installation fails:

The model will automatically use standard attention
Performance may be slightly reduced but functionality remains intact

Model Download

On first run, the model will be downloaded from Hugging Face (approximately 6GB). This may take some time depending on your internet connection.

Examples

See the examples/ directory for sample images and usage scripts.

License

This demo uses the DeepSeek-OCR model. Please refer to the official model page for license information.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
.gitignore		.gitignore
README.md		README.md
deepseek_ocr_demo.py		deepseek_ocr_demo.py
gradio_app.py		gradio_app.py
requirements.txt		requirements.txt

BitnPi/demo

Folders and files

Latest commit

History

Repository files navigation

DeepSeek OCR Demo

Overview

Features

Requirements

Installation

1. Clone the repository

2. Create a virtual environment (recommended)

3. Install dependencies

Usage

Command-Line Interface

Examples

Gradio Web Interface

Python API

Use Cases

Document Conversion

General OCR

Structured Data Extraction

Model Information

Performance

Technical Details

Tested Configuration

Prompt Format

Troubleshooting

CUDA Out of Memory

Flash Attention Installation Issues

Model Download

Examples

License

References

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages