A production-ready FastAPI service for deploying machine learning models. This project demonstrates best practices for ML model serving with structured logging, batch inference, Prometheus metrics, rate limiting, and comprehensive testing.
Tech Stack: FastAPI • scikit-learn • Docker • Prometheus • Pydantic v2
✨ Single & Batch Predictions — Serve one prediction or hundreds at once
📊 Prometheus Metrics — Monitor API usage and performance
🔐 Rate Limiting — Protect your API (10 req/min per endpoint)
📝 Structured Logging — JSON logs with request ID tracking for debugging
✅ Comprehensive Tests — 20 tests covering all endpoints and edge cases
🚀 CI/CD Pipeline — Automated testing and Docker image publishing
⚡ Load Testing — Locust scaffold for performance validation
- WSL 2 with Python 3.11+ (or any Linux environment)
- Docker Desktop
- Git
# Clone and enter the project
git clone https://github.com/LedioZefi/ml-api.git
cd ml-api
# Install dependencies
make dev-install
# Run tests (should see 20/20 passing)
make test
# Build Docker image
make build
# Start the API
make run
# In another terminal, make a prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"sepal_length":6.1,"sepal_width":2.8,"petal_length":4.7,"petal_width":1.2}'
# Check metrics
curl http://localhost:8000/metrics
# Stop the container
make stopQuick health check. Tells you if the model is loaded and ready.
curl http://localhost:8000/healthMake a single prediction. Send iris measurements, get back the predicted class and confidence.
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"sepal_length":6.1,"sepal_width":2.8,"petal_length":4.7,"petal_width":1.2}'Response:
{
"predicted_class": "versicolor",
"class_index": 1,
"confidence": 0.92,
"probabilities": {"setosa": 0.01, "versicolor": 0.92, "virginica": 0.07}
}Make multiple predictions in one request. Great for batch processing.
curl -X POST http://localhost:8000/predict-batch \
-H "Content-Type: application/json" \
-d '{"items": [
{"sepal_length":5.1,"sepal_width":3.5,"petal_length":1.4,"petal_width":0.2},
{"sepal_length":7.0,"sepal_width":3.2,"petal_length":4.7,"petal_width":1.4}
]}'Response:
{
"items": [
{"predicted_class": "setosa", "class_index": 0, "confidence": 0.98, ...},
{"predicted_class": "versicolor", "class_index": 1, "confidence": 0.85, ...}
],
"count": 2
}Prometheus metrics for monitoring. Scrape this endpoint with Prometheus.
curl http://localhost:8000/metricsml-api/
├── .github/workflows/ci.yml # Automated testing & Docker publishing
├── app/
│ ├── main.py # FastAPI app (logging, metrics, rate limiting)
│ ├── logging_config.py # JSON logging & request ID tracking
│ ├── requirements.txt # Dependencies
│ ├── Dockerfile # Multi-stage build with model training
│ ├── model/ # Trained model (auto-generated)
│ └── schemas/predict_schema.py # Pydantic v2 validation
├── tests/
│ ├── test_health.py # Health endpoint tests
│ ├── test_predict.py # Prediction endpoint tests (20 total)
│ └── conftest.py # Pytest fixtures
├── load_test/
│ ├── locustfile.py # Load testing scenarios
│ └── README.md # Load testing guide
├── Makefile # Quick commands
├── pyproject.toml # Project config & ruff linting
├── predict_demo.py # Demo script
└── README.md # This file
make help # Show all commands
make install # Install dependencies
make dev-install # Install + dev tools (pytest, ruff)
make build # Build Docker image
make run # Run container on port 8000
make stop # Stop and remove container
make lint # Run ruff linter
make fmt # Format code with ruff
make test # Run pytest tests
make clean # Remove cache filessource .venv/bin/activate
pytest tests/ -vruff check app/ tests/ predict_demo.py
ruff format app/ tests/ predict_demo.pydocker build -t ml-api:latest ./app -f app/Dockerfiledocker run -d -p 8000:8000 --name ml-api ml-api:latestdocker logs -f ml-apiTag a release and push to automatically build and publish:
git tag v1.0.0
git push origin v1.0.0GitHub Actions will automatically:
- Run tests
- Build Docker image
- Push to
ghcr.io/LedioZefi/ml-api:v1.0.0andlatest
Pull the image:
docker pull ghcr.io/LedioZefi/ml-api:v1.0.0Use Locust to simulate realistic load:
pip install locust
locust -f load_test/locustfile.py --host=http://localhost:8000
# Open http://localhost:8089 in your browser- Limit: 10 requests per minute per client IP
- Applies to:
/predictand/predict-batch - Exceeding: Returns 429 Too Many Requests
All requests get a unique ID for tracing:
curl -X POST http://localhost:8000/predict \
-H "X-Request-ID: my-custom-id" \
-H "Content-Type: application/json" \
-d '{"sepal_length":6.1,"sepal_width":2.8,"petal_length":4.7,"petal_width":1.2}'Response includes the ID:
X-Request-ID: my-custom-id
Logs are JSON-formatted for easy parsing:
{"timestamp": "2024-01-15 10:30:45,123", "level": "INFO", "logger": "app.request", "message": "POST /predict 200"}Found a bug or have an idea? Feel free to open an issue or submit a PR!
MIT License — see LICENSE file for details
Check out the PRODUCTION_FEATURES_SUMMARY.md for detailed implementation notes.