Skip to content

ProjectBroadside/BattleshipGraphicsProjects

Repository files navigation

Florence-2 Warship Extractor

A robust system for extracting warship illustrations from historical PDF documents using Microsoft's Florence-2 vision-language model.

Overview

This project implements a production-ready extraction pipeline specifically designed for processing Jane's Fighting Ships historical documents and similar naval archives. It uses advanced computer vision techniques to identify, extract, and catalog warship illustrations with high accuracy.

Key Features

  • Florence-2 Integration: Uses Microsoft's state-of-the-art vision-language model for precise object detection
  • Multi-Prompt Strategy: Employs diverse prompts to catch different types of warship illustrations
  • High-Resolution Processing: Converts PDFs at 300+ DPI for optimal detection accuracy
  • Intelligent Filtering: Non-Maximum Suppression to remove duplicate detections
  • Memory Optimization: Dynamic batch sizing and CUDA memory management
  • Comprehensive Logging: Detailed progress tracking and performance monitoring
  • CLI Interface: Command-line tools for extraction, batch processing, and reporting

Installation

Prerequisites

  • Python 3.9-3.11
  • CUDA-compatible GPU (recommended)
  • Poetry for dependency management

Setup

  1. Clone the repository:
git clone <repository-url>
cd warship-extractor
  1. Install dependencies:
poetry install
  1. Activate the virtual environment:
poetry shell

Usage

Command Line Interface

Extract warships from a single PDF:

warship-extract extract input.pdf --output-dir results/

Process multiple PDFs:

warship-extract batch pdf_folder/ --output-dir results/

Generate analysis report:

warship-extract report results/ --format html

Get system information:

warship-extract info

Python API

from warship_extractor.pipeline import ExtractionPipeline
from warship_extractor.config import Settings

# Initialize pipeline
settings = Settings()
pipeline = ExtractionPipeline(settings)

# Extract warships from PDF
results = pipeline.extract_from_pdf("janes_1900.pdf")

# Access results
for detection in results['detections']:
    print(f"Found {detection['label']} with confidence {detection['confidence']:.2f}")

Configuration

The system can be configured through environment variables or the settings file:

# Model settings
FLORENCE_MODEL_NAME="microsoft/Florence-2-large"
FLORENCE_DEVICE="cuda"

# Processing settings
PDF_DPI=300
DETECTION_CONFIDENCE_THRESHOLD=0.7

# Output settings
OUTPUT_DIR="./extracted_warships"
SAVE_ANNOTATED_IMAGES=true

Architecture

The system is built with a modular architecture:

  • Model Manager: Florence-2 model loading and caching
  • PDF Processor: High-resolution PDF to image conversion
  • Detection Engine: Coordinated detection with prompt strategies
  • Post-Processing: NMS filtering and image enhancement
  • Pipeline: Main orchestration and workflow management
  • Utilities: Logging, visualization, and reporting tools

Testing

Run the comprehensive test suite:

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=src/warship_extractor

# Run specific test module
poetry run pytest tests/unit/test_detector.py

Performance

Typical performance on Jane's Fighting Ships documents:

  • Processing Speed: 2-5 pages per minute (GPU)
  • Detection Accuracy: 85-95% for clear illustrations
  • Memory Usage: 2-4GB GPU memory for large documents
  • False Positive Rate: <10% with proper filtering

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Microsoft Research for the Florence-2 model
  • Jane's Information Group for historical naval documentation
  • PyTorch and Transformers communities

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors