A production-grade machine learning pipeline that predicts truck shipment delays, built for deployment on Lightning.ai with a Flask REST API.
MySQL DB ──┐
├─→ ETL Pipeline ─→ Feature Engineering ─→ Model Training ─→ Flask API
Postgres ──┘ (RF / XGB / LGBM)
truck_delay_ml/
├── config.yaml # Central config (no hardcoded values)
├── run_pipeline.py # One command: ETL + Training
├── requirements.txt
├── .env.example # Secret template (never commit .env!)
│
├── ml_pipeline/
│ ├── etl/
│ │ ├── db_connector.py # MySQL + PostgreSQL connections + mock data
│ │ ├── extractor.py # Extract & merge from both DBs
│ │ ├── transformer.py # Feature engineering & cleaning
│ │ └── loader.py # Save/load parquet files
│ ├── modeling/
│ │ └── trainer.py # Multi-model training + MLflow tracking
│ └── utils/
│ ├── config_loader.py # YAML + env var loader
│ └── logger.py # Rotating file + console logger
│
├── deployment/
│ └── flask_app.py # REST API with /predict and /predict/batch
│
└── tests/
└── test_pipeline.py # pytest unit tests
git clone https://github.com/YOUR_USERNAME/truck_delay_ml.git
cd truck_delay_ml
pip install -r requirements.txtcp .env.example .env
# Edit .env with your DB credentials
# Or set MOCK_DATA=true to skip DB and use synthetic data# With mock data (no database needed):
python run_pipeline.py --mock
# With real databases:
python run_pipeline.pypython deployment/flask_app.pycurl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"distance_km": 850,
"truck_type": "Large",
"truck_age_years": 9,
"driver_experience": 2,
"cargo_weight_kg": 15000,
"weather_condition": "Rain",
"route_type": "Rural",
"traffic_index": 0.85,
"road_quality": "Poor",
"num_stops": 4
}'Expected response:
{
"prediction": 1,
"label": "Delayed",
"probability": 0.8231,
"confidence": "82.3%",
"risk_level": "High"
}pytest tests/ -v| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| POST | /predict |
Single prediction |
| POST | /predict/batch |
Batch predictions (max 1000) |
| GET | /model/info |
Feature list & model type |
| POST | /reload |
Hot-reload model after retraining |
| Model | CV F1 | Notes |
|---|---|---|
| Random Forest | ~0.84 | Robust, good baseline |
| XGBoost | ~0.86 | Fast, handles missing values |
| LightGBM | ~0.87 | Best — used in production |
- No hardcoded values — everything in
config.yaml - MLflow experiment tracking — compare all runs visually
- Mock data mode — test the full pipeline without any database
- Production Flask API —
/predictand/predict/batchendpoints - Automatic logging — predictions logged to CSV for monitoring
- Unit tests —
pytestcoverage for all pipeline stages
Python · scikit-learn · XGBoost · LightGBM · MLflow · Flask · SQLAlchemy · pandas · pytest