Skip to content

BitnPi/demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSeek OCR Demo

A comprehensive demonstration of the DeepSeek-OCR model for Optical Character Recognition (OCR) and document conversion tasks.

Overview

DeepSeek-OCR is a 3B parameter vision-language model designed for high-performance OCR and structured document conversion. It can:

  • Convert documents to Markdown format with preserved structure
  • Perform general OCR on any text-containing images
  • Handle complex layouts including tables and forms
  • Compress text by up to 10x while maintaining 97% accuracy
  • Process financial charts into structured data

Features

  • Command-line interface for batch processing
  • Gradio web interface for interactive use
  • Python API for integration into your projects
  • Support for both document conversion and general OCR tasks
  • GPU acceleration with Flash Attention 2 support

Requirements

  • Python 3.8+
  • CUDA-capable GPU (recommended, but CPU also supported)
  • 8GB+ GPU memory for optimal performance

Installation

1. Clone the repository

git clone <your-repo-url>
cd demo

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

Note: Flash Attention installation may require specific CUDA versions. If it fails, the model will fall back to standard attention.

Usage

Command-Line Interface

Process a single image:

python deepseek_ocr_demo.py --image path/to/image.png --task markdown

Options:

  • --image: Path to the input image (required)
  • --task: Task type - markdown or ocr (default: markdown)
  • --output: Save result to file (optional)
  • --max-tokens: Maximum tokens to generate (default: 2048)
  • --model: Model name or path (default: deepseek-ai/DeepSeek-OCR)

Examples

Convert a document to Markdown:

python deepseek_ocr_demo.py --image document.jpg --task markdown --output result.md

Perform general OCR:

python deepseek_ocr_demo.py --image screenshot.png --task ocr

Gradio Web Interface

Launch the interactive web interface:

python gradio_app.py

Then open your browser to http://localhost:7860

Features:

  • Upload images via drag-and-drop or file browser
  • Choose between Markdown conversion and general OCR
  • Adjust max token length
  • Copy results with one click

Python API

Use DeepSeek OCR in your own Python code:

from deepseek_ocr_demo import DeepSeekOCR

# Initialize model
ocr = DeepSeekOCR()

# Process single image
result = ocr.process_image("document.jpg", task="markdown")
print(result)

# Batch process multiple images
results = ocr.batch_process(
    ["image1.jpg", "image2.jpg", "image3.jpg"],
    task="ocr"
)

Use Cases

Document Conversion

  • Convert scanned documents to editable Markdown
  • Extract structured data from PDFs
  • Digitize paper forms and reports

General OCR

  • Extract text from screenshots
  • Read text from photos
  • Process receipts and invoices
  • Extract data from charts and graphs

Structured Data Extraction

  • Convert financial charts to tables
  • Parse complex document layouts
  • Extract information from forms

Model Information

  • Model Name: deepseek-ai/DeepSeek-OCR
  • Parameters: 3 billion
  • Architecture: Vision-Language Model (VLM)
  • Input: Images (documents, screenshots, photos)
  • Output: Text, Markdown, structured data

Performance

  • Compression Ratio: Up to 10x text compression
  • Accuracy: Maintains 97% of original information
  • Speed: Real-time processing on modern GPUs

Technical Details

Tested Configuration

  • Python: 3.12.9
  • CUDA: 11.8+
  • PyTorch: 2.6.0+
  • Transformers: 4.46.3+
  • Flash Attention: 2.7.3+

Prompt Format

The model uses the following prompt formats:

Document to Markdown:

<image>
<|grounding|>Convert the document to markdown.

General OCR:

<image>
<|grounding|>OCR this image.

Troubleshooting

CUDA Out of Memory

If you encounter GPU memory errors:

  • Reduce --max-tokens value
  • Process smaller images
  • Use CPU instead: The model will automatically fall back to CPU if CUDA is unavailable

Flash Attention Installation Issues

Flash Attention requires specific CUDA versions. If installation fails:

  • The model will automatically use standard attention
  • Performance may be slightly reduced but functionality remains intact

Model Download

On first run, the model will be downloaded from Hugging Face (approximately 6GB). This may take some time depending on your internet connection.

Examples

See the examples/ directory for sample images and usage scripts.

License

This demo uses the DeepSeek-OCR model. Please refer to the official model page for license information.

References

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgments

This demo is built on top of the DeepSeek-OCR model developed by DeepSeek AI.

About

AI POC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages