A production-ready, professionally structured PyTorch project template with comprehensive utilities, logging, checkpointing, and best practices baked in.
- ποΈ Professional Architecture - Modular design with clear separation of concerns
- βοΈ Configuration Management - Centralized dataclass-based configuration
- π Advanced Logging - Structured logging with TensorBoard integration
- πΎ Checkpoint Management - Automated model versioning and best model tracking
- π Metrics Tracking - Built-in accuracy, loss, and custom metrics
- π― Type Hints - Fully type-hinted codebase for better IDE support
- π Comprehensive Documentation - Docstrings throughout, zero code comments
- π§ͺ Unit Tests - Test suite for all major components
- π³ Docker Support - Both GPU (CUDA) and CPU containers
- π Examples Included - Custom datasets, transfer learning, and more
This project has been significantly enhanced with professional features:
- Configuration System: Centralized dataclass-based config management
- Advanced Utilities: Checkpoint management, metrics tracking, visualization
- Enhanced Training: Early stopping, LR scheduling, comprehensive logging
- Code Quality: Zero comments, full docstrings, complete type hints
- Testing: Unit tests for all major components
- Documentation: Architecture guide, usage examples, API docs
- Examples: Custom datasets, transfer learning demonstrations
- Docker installed
- Docker Compose installed
- For GPU support: NVIDIA Docker Runtime
- Python 3.8+ (for local development)
.\docker-commands.ps1 build
.\docker-commands.ps1 run
.\docker-commands.ps1 shellOnce in the container:
cd /workspace/src
python train.pydocker-compose up -d pytorch-gpu
docker exec -it pytorch-gpu bashpip install -e .
cd src
python train.pyPytorch/
βββ src/ # Main source code
β βββ config.py # Configuration management
β βββ logger.py # Logging utilities
β βββ train.py # Training script with early stopping
β βββ inference.py # Inference script
β βββ models/ # Model definitions
β β βββ simple_nn.py # Simple neural network
β β βββ __init__.py
β βββ utils/ # Utility modules
β βββ checkpoint.py # Checkpoint management
β βββ metrics.py # Metrics calculation
β βββ model.py # Model utilities
β βββ data.py # Data utilities
β βββ visualization.py # Visualization tools
β βββ __init__.py
βββ examples/ # Usage examples
β βββ custom_dataset.py # Custom dataset integration
β βββ transfer_learning.py # Transfer learning example
βββ tests/ # Unit tests
β βββ test_config.py
β βββ test_utils.py
β βββ test_models.py
βββ docs/ # Documentation
β βββ ARCHITECTURE.md # System architecture
β βββ USAGE.md # Detailed usage guide
βββ data/ # Dataset directory
βββ models/ # Saved models
βββ outputs/ # Training outputs
β βββ logs/ # TensorBoard logs
β βββ checkpoints/ # Model checkpoints
βββ notebooks/ # Jupyter notebooks
βββ pyproject.toml # Package configuration, deps, tool config
βββ .flake8 # flake8 config (not read from pyproject)
βββ requirements.txt # Docker dependencies (no torch)
βββ docker-compose.yml # Docker Compose config
βββ .github/workflows/ci.yml # Lint, type-check, and test on CI
βββ README.md # This file
from config import get_config
from models import SimpleModel
from logger import setup_logger
import torch.nn as nn
config = get_config()
logger = setup_logger(log_dir=config.paths.logs_dir)
model = SimpleModel(
input_size=config.model.input_size,
hidden_size=config.model.hidden_size,
num_classes=config.model.num_classes
).to(config.device.device)Run training:
cd src
python train.pycd src
python inference.pyfrom config import Config, ModelConfig, TrainingConfig
config = Config(
model=ModelConfig(hidden_size=256, num_classes=20),
training=TrainingConfig(batch_size=128, num_epochs=50)
)Efficiency features are CPU-safe and opt-in (TF32 + cudnn.benchmark turn on
automatically on CUDA):
from config import Config, TrainingConfig
config = Config(
training=TrainingConfig(
use_amp=True, # mixed precision (CUDA only)
compile_model=True, # torch.compile, falls back to eager on failure
gradient_clip=1.0, # max gradient norm (applied during training)
drop_last=True, # drop the last partial training batch
),
deterministic=True, # reproducible runs (disables TF32/benchmark)
)Centralized configuration using dataclasses:
from config import get_config
config = get_config()
config.training.batch_size = 128
config.model.hidden_size = 256Automatic model versioning:
from utils import CheckpointManager
checkpoint_manager = CheckpointManager(config.paths.checkpoints_dir)
checkpoint_manager.save(model, optimizer, epoch, metrics, is_best=True)Built-in metrics calculation:
from utils import AverageMeter, calculate_accuracy
loss_meter = AverageMeter('Loss')
acc = calculate_accuracy(outputs, targets)Structured logging with TensorBoard:
from logger import setup_logger, MetricsLogger
logger = setup_logger(log_dir=config.paths.logs_dir)
metrics_logger = MetricsLogger(logger)
metrics_logger.log_epoch(epoch, metrics)- Architecture Guide - System design and module descriptions
- Usage Guide - Detailed usage examples and best practices
Tests insert ../src onto sys.path, so they run from any directory:
pytest -v # run the suite
pytest --cov=src --cov-report=term-missing # with coverageLint, format, and type-check (matches CI):
black src tests
isort src tests
flake8 src tests
mypy srcpython examples/custom_dataset.pypython examples/transfer_learning.py.\docker-commands.ps1 help # Show all commands
.\docker-commands.ps1 build # Build GPU image
.\docker-commands.ps1 build-cpu # Build CPU image
.\docker-commands.ps1 run # Run GPU container
.\docker-commands.ps1 run-cpu # Run CPU container
.\docker-commands.ps1 shell # Open bash shell
.\docker-commands.ps1 jupyter # Start Jupyter notebook
.\docker-commands.ps1 tensorboard # Start TensorBoard
.\docker-commands.ps1 stop # Stop containers
.\docker-commands.ps1 clean # Remove containers/imagesmake help # Show all commands
make build # Build GPU image
make run # Run GPU container
make shell # Open bash shell
make jupyter # Start Jupyter
make tensorboard # Start TensorBoard
make stop # Stop containers
make clean # CleanupStart TensorBoard to visualize training:
.\docker-commands.ps1 tensorboardThen open: http://localhost:6006
Automatic training termination when validation performance plateaus.
Dynamic learning rate adjustment based on validation metrics.
- Parameter counting
- Weight initialization strategies
- Model freezing/unfreezing for transfer learning
- Layer-wise learning rate decay
- Custom dataset classes
- Train/val split utilities
- Data normalization helpers
- Training curve plotting
- Confusion matrix visualization
- Learning rate schedule plots
- Zero Comments: Self-documenting code with clear naming
- Type Hints: Full type annotations for IDE support
- Docstrings: Google-style docstrings for all functions
- PEP 8: Follows Python style guidelines
- Modular: Clear separation of concerns
- Tested: Unit tests for critical components
Edit requirements.txt and rebuild:
.\docker-commands.ps1 buildCopy and edit .env:
Copy-Item .env.example .envModify docker-compose.yml:
environment:
- CUDA_VISIBLE_DEVICES=0,1 # Use specific GPUs- Ensure NVIDIA Docker runtime is installed
- Check:
nvidia-smi - Verify:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Change port mapping in docker-compose.yml:
ports:
- "8889:8888" # Use different portCreate a new file in src/models/ and add it to src/models/__init__.py. See simple_nn.py for reference.
Check out examples/custom_dataset.py for a complete example of integrating custom datasets.
Models are saved in:
outputs/checkpoints/- All checkpointsoutputs/checkpoints/best_model.pth- Best performing model
Use the checkpoint utilities to load a previous checkpoint:
from utils.checkpoint import load_checkpoint
checkpoint = load_checkpoint('outputs/checkpoints/best_model.pth', model, optimizer)
start_epoch = checkpoint['epoch'] + 1Yes! Install with pip install -e . and run scripts directly.
Use TensorBoard: .\docker-commands.ps1 tensorboard then open http://localhost:6006
See LICENSE file for details.
Built with best practices for production ML projects