A FastAPI-based service for processing medical documents and extracting patient information.
- Document OCR text reconstruction from bounding boxes
- Patient name extraction from medical documents
- Configurable settings via environment variables
- Custom exception handling
- Comprehensive test coverage (94%)
You can set up the project in two ways: using uv (recommended for local development) or using Docker.
# Clone the repository
git clone https://github.com/LavalAlexandre/docAPI.git
cd docAPI
- Python 3.13+
- uv package manager
# Install dependencies
uv sync
# Optional: Create .env file for custom configuration
cp .env.example .env
# Optional: Install pre-commit hooks
uv run pre-commit install
# Development mode (with hot reload)
uv run fastapi dev
# Production mode
uv run fastapi run src/main.py --host 0.0.0.0 --port 8000
The API will be available at http://localhost:8000
- Interactive API docs:
http://localhost:8000/docs
- Alternative docs:
http://localhost:8000/redoc
- Docker installed on your system
# Build the image
docker build -t docapi-api .
# Development mode (with hot reload)
docker run -p 8000:8000 \
-v $(pwd)/src:/app/src \
docapi-api \
fastapi dev src/main.py --host 0.0.0.0 --port 8000
# Production mode
docker run -p 8000:8000 docapi-api
# With custom environment variables
docker run -p 8000:8000 \
-e DOCAPI_ENABLE_FEMININE_TITLES=true \
docapi-api
The API will be available at http://localhost:8000
- Interactive API docs:
http://localhost:8000/docs
- Alternative docs:
http://localhost:8000/redoc
The application can be configured via environment variables. All variables are prefixed with DOCAPI_
.
See .env.example
for available configuration options:
DOCAPI_OCR_Y_THRESHOLD
: Y-axis threshold for word grouping (default: 0.01)DOCAPI_ENABLE_FEMININE_TITLES
: Support feminine medical titles (default: false)
# Run all tests with uv
uv run pytest
# With coverage report
uv run pytest --cov=src --cov-report=term
# With Docker
docker run --rm docapi-api pytest -v --cov=src --cov-report=term
# Run linter
uv run ruff check .
# Fix linting issues automatically
uv run ruff check --fix .
# Format code
uv run ruff format .
# Type checking
uv run mypy src
# Run all checks with validation script
./validate.sh
# Install pre-commit hooks
uv run pre-commit install
# Run manually on all files
uv run pre-commit run --all-files
src/
├── __init__.py
├── main.py # FastAPI application entry point
├── config.py # Application settings and configuration
├── exceptions.py # Custom exception classes
├── data/ # Data access layer
│ ├── documents.py # Fake document database
├── models/ # Pydantic models
│ ├── document.py # Document, Page, Word, BoundingBox models
├── operations/ # Business logic layer
│ ├── documents.py # Document processing operations
└── routers/ # API routes
├── documents.py # Document endpoints
tests/ # Test suite
├── test_config.py
├── test_operations.py
└── test_routers.py
Health check endpoint.
List all documents.
Get a specific document by ID.
Get the extracted patient name from a document.
The core feature of this API is extracting patient names from medical documents. The extraction uses a heuristic-based approach that identifies capitalized words while filtering out medical titles, honorifics, and sentence-starting words.
First, the document's OCR words (with bounding boxes) are reconstructed into an ordered list:
- Words are grouped into lines based on vertical position (y-coordinate)
- Within each line, words are sorted left-to-right (x-coordinate)
- This produces a reading-order list of words
The algorithm scans through the ordered words and applies the following filtering rules:
- Start with an uppercase letter (e.g., "Jean", "DUPONT")
- Be 1-2 words long (e.g., "Jean DUPONT" or "Martin")
- Not be the first word of the document
-
Follow a sentence-ending punctuation (
.
,!
,?
)- Example: "Consultation terminée. Jean" → "Jean" is rejected (starts a new sentence)
-
Follow a medical title (doctor, professor, specialist, etc.)
- Example: "Docteur Nicolas JACQUES" → "Nicolas JACQUES" is rejected
- Supports 35+ French medical titles (e.g., docteur, chirurgien, cardiologue, etc.)
- Optional feminine title support (e.g., doctoresse, chirurgienne, réanimatrice)
- When a title is detected, the algorithm skips the next word as well
-
Be a medical title itself
- Example: "Chirurgien Martin" → "Chirurgien" is rejected
-
Be an honorific (Monsieur, Madame, etc.)
- Example: "Madame Clara Martin" → "Madame" is skipped, but "Clara Martin" is valid
- If a valid name candidate is found, the algorithm checks if the next word is also capitalized
- If yes, it returns both words as a full name:
"Jean DUPONT"
- If no, it returns just the single word:
"Martin"
Input document text:
"J'ai bien revu en consultation Monsieur Jean DUPONT pour une douleur à la hanche droite. Docteur Nicolas JACQUES"
Processing:
Word | Rule Check | Result |
---|---|---|
J'ai |
First word | ❌ Skip |
bien |
Lowercase | ❌ Skip |
revu |
Lowercase | ❌ Skip |
en |
Lowercase | ❌ Skip |
consultation |
Lowercase | ❌ Skip |
Monsieur |
Honorific | ❌ Skip |
Jean |
✅ Capitalized, not after title/punctuation | Check next word... |
DUPONT |
✅ Also capitalized | ✅ Return "Jean DUPONT" |
Output: "Jean DUPONT"
Why "Nicolas JACQUES" is NOT extracted:
- "Nicolas" follows "Docteur" (a medical title)
- The algorithm skips both "Nicolas" and "JACQUES"
The extraction behavior can be customized via environment variables:
# Enable feminine medical titles (doctoresse, chirurgienne, etc.)
DOCAPI_ENABLE_FEMININE_TITLES=true
Known limitations:
- Language-specific: Designed for French medical documents
- Case-sensitive: Requires proper capitalization of names
- Simple heuristic: May fail with:
- All-caps documents
- Unusual name formats (e.g., "Marie-Claire", hyphenated names)
- Names at the start of sentences after punctuation
- Documents with multiple patient names (returns first match only)
Future improvements:
- Support for hyphenated names
- Context-aware extraction (patient vs. doctor identification)
- Additional api routes
- Implement database
- Algorithm implementation:
src/operations/documents.py::extract_patient_name_from_words()
- Configuration:
src/config.py::PatientNameExtractionConfig
- Tests:
tests/test_operations.py
The project follows a clean architecture pattern:
- Routers (
routers/
): Handle HTTP requests/responses, validate input, convert exceptions to HTTP errors - Operations (
operations/
): Contain business logic, use domain exceptions - Models (
models/
): Define data structures using Pydantic - Data (
data/
): Data access layer (currently in-memory, can be replaced with a database) - Config (
config.py
): Centralized configuration management
- Add logging throughout the application
- Implement real database layer
- Add authentication/authorization
- Add metrics and monitoring
- Add more exception handling and implement additional custom exceptions as needed