Machine Learning System for Seasonal Respiratory Disease Forecasting
Predicting state-level influenza-like illness (ILI) rates 1-4 weeks in advance with 99.3% accuracy to support pharmaceutical inventory optimization and healthcare resource allocation.
RespiForecast predicts weekly ILI rates 1-4 weeks in advance for 8 key U.S. states, using a comprehensive machine learning approach that combines temporal features, seasonal patterns, and state-specific characteristics.
Target States: California, New York, Texas, Florida, Illinois, Washington, Massachusetts, and New Jersey (personal context - analyst's home state)
Business Value:
- ๐ฆ Pharmaceutical inventory optimization
- ๐ฅ Hospital resource allocation
- ๐ฅ Healthcare staffing decisions
- ๐ Regional distribution planning
Our XGBoost model achieved production-grade accuracy:
| Metric | Naive Baseline | XGBoost | Improvement |
|---|---|---|---|
| Rยฒ Score | 0.916 | 0.993 | +7.7% |
| MAE | 0.378% | 0.083% | -78% |
| MAPE | 10.8% | 2.2% | -80% |
Key Finding: XGBoost with 37 engineered features explains 99.3% of ILI rate variance, with an average prediction error of only 0.083%.
We systematically evaluated multiple approaches:
โ
Naive Baseline (Lag-1) โ Rยฒ = 0.916 (Strong baseline)
โ
Moving Average (4-week) โ Rยฒ = 0.762 (Underperforms)
โ Prophet (Time Series) โ Rยฒ = -1.13 (Failed - data mismatch)
๐ XGBoost (37 features) โ Rยฒ = 0.993 (Best performer)
Why Prophet Failed:
- Designed for daily data, not weekly
- Over-smoothing eliminated rapid ILI fluctuations
- Strong autocorrelation makes simple lag features superior
- Documented as valuable learning in project report
- Python 3.12+
- Poetry (dependency management)
# Clone the repository
git clone https://github.com/CrezC/RespiForecast.git
cd RespiForecast
# Install dependencies using Poetry
poetry install
# Activate virtual environment
poetry shell# 1. Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb
# 2. Feature engineering
jupyter notebook notebooks/02_feature_engineering.ipynb
# 3. Baseline models and Prophet
jupyter notebook notebooks/03_baseline_and_prophet.ipynb
# 4. XGBoost model (best performer)
jupyter notebook notebooks/04_xgboost_model.ipynb# Test XGBoost model
cd finalproject/models
python xgboost_model.py
# Expected output:
# Test Results:
# MAE: 0.0675
# RMSE: 0.1949
# Rยฒ: 0.9912final-project-CrezC/
โโโ data/
โ โโโ raw/ # Original CDC ILINet data
โ โ โโโ ILINet_States.csv # State-level ILI data (2018-2025)
โ โโโ processed/ # Cleaned and engineered data
โ โโโ ili_data_states_clean.csv
โ โโโ ili_data_with_features.csv (37 features)
โ
โโโ finalproject/
โ โโโ config/
โ โ โโโ config.yaml # Project configuration
โ โโโ data/
โ โ โโโ processors.py # Data loading and cleaning
โ โโโ features/
โ โ โโโ build_features.py # Feature engineering (37 features)
โ โโโ models/
โ โ โโโ baseline.py # Naive and Moving Average
โ โ โโโ prophet_model.py # Prophet (time series)
โ โ โโโ xgboost_model.py # XGBoost (best model)
โ โโโ evaluation/
โ โโโ metrics.py # Performance evaluation
โ
โโโ notebooks/
โ โโโ 01_data_exploration.ipynb # EDA and visualization
โ โโโ 02_feature_engineering.ipynb # Feature creation analysis
โ โโโ 03_baseline_and_prophet.ipynb # Initial models
โ โโโ 03b_prophet_debugging.ipynb # Prophet failure analysis
โ โโโ 04_xgboost_model.ipynb # Final XGBoost model
โ โโโ 05_final_results.ipynb # Final report
โ
โโโ models/ # Saved model files
โโโ pyproject.toml # Poetry dependencies
โโโ README.md # This file
โโโ DATA_CARD.md # Information file of Data
โโโ MODEL_CARD.md # Information file of Model
โโโ FINAL_REPORT.md # Final report
---
## ๐ฌ Methodology
### Data Collection
- **Source**: CDC ILINet (Influenza-Like Illness Surveillance Network)
- **Geography**: 8 U.S. states
- **Time Range**: October 2018 - September 2025 (7 years)
- **Frequency**: Weekly data
- **Total Records**: 2,920 (365 weeks ร 8 states)
### Feature Engineering (37 Features)
**1. Lag Features (3)**
- `ili_rate_lag_1`, `ili_rate_lag_2`, `ili_rate_lag_4`
**2. Rolling Window Statistics (12)**
- 2-week, 4-week, 8-week windows
- Mean, Std Dev, Min, Max for each window
**3. Seasonal Features (11)**
- Cyclical encoding: `month_sin`, `month_cos`, `week_sin`, `week_cos`
- Season indicators: winter, spring, summer, fall
- Flu season indicator
**4. Trend Features (3)**
- `ili_rate_diff_1`, `ili_rate_diff_2`
- `ili_rate_momentum`
**5. State Features (8)**
- One-hot encoding for 8 states
### Model Development
**Phase 1: Baseline Models**
- Naive (Lag-1): Simple but strong (Rยฒ=0.916)
- Moving Average: Underperformed baseline
**Phase 2: Time Series Models**
- Prophet: Comprehensive testing revealed data mismatch
- *Key Learning*: Not all sophisticated models work better
**Phase 3: Gradient Boosting**
- XGBoost: Leveraged all 37 features
- Result: **99.3% variance explained**
---
## ๐ Feature Importance
Top 5 most important features (by gain):
| Rank | Feature | Importance | Category |
|------|---------|------------|----------|
| 1 | `ili_rate_lag_1` | 63.62 | Autocorrelation |
| 2 | `ili_rate_rolling_min_2` | 12.45 | Rolling Window |
| 3 | `ili_rate_momentum` | 3.04 | Trend |
| 4 | `ili_rate_rolling_max_2` | 2.75 | Rolling Window |
| 5 | `ili_rate_rolling_mean_2` | 2.23 | Rolling Window |
**Insight**: While lag-1 dominates (64%), the remaining 36 features contribute 36% of predictive power, proving the value of comprehensive feature engineering.
---
## ๐ Reproducibility
### Data Split Strategy
Train: 70% (2018-10 to 2023-08) - 2,021 samples
Validation: 15% (2023-08 to 2024-09) - 433 samples
Test: 15% (2024-09 to 2025-09) - 434 samples
**Important**: Temporal split (no shuffling) to prevent data leakage in time series.
### Reproduce Results
```bash
# 1. Install dependencies
poetry install
# 2. Run feature engineering
cd finalproject/features
python build_features.py
# 3. Train and evaluate XGBoost
cd ../models
python xgboost_model.py
# Expected Test Rยฒ: 0.991 ยฑ 0.002
Key visualizations in notebooks:
-
State-by-State ILI Trends (
01_data_exploration.ipynb)- 7-year time series for all 8 states
- Seasonal patterns and geographic differences
-
Feature Correlation Heatmap (
02_feature_engineering.ipynb)- Lag-1 correlation: 0.96 with target
- Feature relationships
-
Model Predictions (
04_xgboost_model.ipynb)- Actual vs Predicted (all states)
- Residual analysis
-
Feature Importance (
04_xgboost_model.ipynb)- Bar chart of top 15 features
- Category breakdown
- Highest ILI: New Jersey (avg 3.65%)
- Lowest ILI: Washington (avg 1.76%)
- Range: 1.90% difference between states
- Winter average: 3.97%
- Summer average: 1.47%
- Seasonal ratio: 2.71ร (winter is 2.71 times higher)
- Strong autocorrelation: Last week's value explains 96% of variance
- Feature synergy: Additional 36 features add 7.7% improvement
- Short-term patterns: 2-week rolling windows most predictive
- Data Card: See
DATA_CARD.mdfor detailed data documentation - Model Card: See
MODEL_CARD.mdfor XGBoost model details - Final Report: Complete analysis and findings in project report
Author: Zhicheng Chen (Kenny Chen)
Personal Context: New Jersey included as analyst's home state, providing both geographic diversity and personal familiarity with local healthcare infrastructure.
Potential improvements identified:
-
External Data Integration
- Weather data (temperature, humidity)
- Search trends (Google Flu Trends)
- Mobility data
-
Model Enhancements
- LSTM for sequence modeling
- Ensemble methods (XGBoost + others)
- State-specific models
-
Operational Deployment
- Real-time data pipeline
- Automated weekly updates
- Interactive dashboard
-
Extended Horizons
- Multi-week forecasting (4-8 weeks)
- Confidence intervals
- Scenario analysis
This project is licensed under the MIT License - see the LICENSE file for details.
- CDC ILINet for providing public ILI surveillance data
- Professor for project guidance and template
- Facebook Prophet team (despite model not working, valuable learning)
- XGBoost developers for excellent ML library
Kenny Chen
For questions or collaboration: Gmail: zhichengchen12@gmail.com Github: https://github.com/CrezC
Built with โค๏ธ using Python, XGBoost, Prophet, and Poetry
Last Updated: December 2024