RespiForecast 🌡️

Machine Learning System for Seasonal Respiratory Disease Forecasting

Predicting state-level influenza-like illness (ILI) rates 1-4 weeks in advance with 99.3% accuracy to support pharmaceutical inventory optimization and healthcare resource allocation.

🎯 Project Overview

RespiForecast predicts weekly ILI rates 1-4 weeks in advance for 8 key U.S. states, using a comprehensive machine learning approach that combines temporal features, seasonal patterns, and state-specific characteristics.

Target States: California, New York, Texas, Florida, Illinois, Washington, Massachusetts, and New Jersey (personal context - analyst's home state)

Business Value:

📦 Pharmaceutical inventory optimization
🏥 Hospital resource allocation
👥 Healthcare staffing decisions
📊 Regional distribution planning

🏆 Key Results

Our XGBoost model achieved production-grade accuracy:

Metric	Naive Baseline	XGBoost	Improvement
R² Score	0.916	0.993	+7.7%
MAE	0.378%	0.083%	-78%
MAPE	10.8%	2.2%	-80%

Key Finding: XGBoost with 37 engineered features explains 99.3% of ILI rate variance, with an average prediction error of only 0.083%.

📊 Model Comparison

We systematically evaluated multiple approaches:

✅ Naive Baseline (Lag-1)      → R² = 0.916  (Strong baseline)
✅ Moving Average (4-week)     → R² = 0.762  (Underperforms)
❌ Prophet (Time Series)       → R² = -1.13  (Failed - data mismatch)
🏆 XGBoost (37 features)       → R² = 0.993  (Best performer)

Why Prophet Failed:

Designed for daily data, not weekly
Over-smoothing eliminated rapid ILI fluctuations
Strong autocorrelation makes simple lag features superior
Documented as valuable learning in project report

🚀 Quick Start

Prerequisites

Python 3.12+
Poetry (dependency management)

Installation

# Clone the repository
git clone https://github.com/CrezC/RespiForecast.git
cd RespiForecast

# Install dependencies using Poetry
poetry install

# Activate virtual environment
poetry shell

Running the Models

# 1. Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb

# 2. Feature engineering
jupyter notebook notebooks/02_feature_engineering.ipynb

# 3. Baseline models and Prophet
jupyter notebook notebooks/03_baseline_and_prophet.ipynb

# 4. XGBoost model (best performer)
jupyter notebook notebooks/04_xgboost_model.ipynb

Quick Model Test

# Test XGBoost model
cd finalproject/models
python xgboost_model.py

# Expected output:
# Test Results:
#   MAE:  0.0675
#   RMSE: 0.1949
#   R²:   0.9912

📁 Project Structure

final-project-CrezC/
├── data/
│   ├── raw/                          # Original CDC ILINet data
│   │   └── ILINet_States.csv         # State-level ILI data (2018-2025)
│   └── processed/                    # Cleaned and engineered data
│       ├── ili_data_states_clean.csv
│       └── ili_data_with_features.csv (37 features)
│
├── finalproject/
│   ├── config/
│   │   └── config.yaml               # Project configuration
│   ├── data/
│   │   └── processors.py             # Data loading and cleaning
│   ├── features/
│   │   └── build_features.py         # Feature engineering (37 features)
│   ├── models/
│   │   ├── baseline.py               # Naive and Moving Average
│   │   ├── prophet_model.py          # Prophet (time series)
│   │   └── xgboost_model.py          # XGBoost (best model)
│   └── evaluation/
│       └── metrics.py                # Performance evaluation
│
├── notebooks/
│   ├── 01_data_exploration.ipynb     # EDA and visualization
│   ├── 02_feature_engineering.ipynb  # Feature creation analysis
│   ├── 03_baseline_and_prophet.ipynb # Initial models
│   ├── 03b_prophet_debugging.ipynb   # Prophet failure analysis
│   ├── 04_xgboost_model.ipynb        # Final XGBoost model
│   └── 05_final_results.ipynb        # Final report
│
├── models/                           # Saved model files
├── pyproject.toml                    # Poetry dependencies
├── README.md                         # This file
├── DATA_CARD.md                      # Information file of Data
├── MODEL_CARD.md                     # Information file of Model
└── FINAL_REPORT.md                   # Final report

---

## 🔬 Methodology

### Data Collection
- **Source**: CDC ILINet (Influenza-Like Illness Surveillance Network)
- **Geography**: 8 U.S. states
- **Time Range**: October 2018 - September 2025 (7 years)
- **Frequency**: Weekly data
- **Total Records**: 2,920 (365 weeks × 8 states)

### Feature Engineering (37 Features)

**1. Lag Features (3)**
- `ili_rate_lag_1`, `ili_rate_lag_2`, `ili_rate_lag_4`

**2. Rolling Window Statistics (12)**
- 2-week, 4-week, 8-week windows
- Mean, Std Dev, Min, Max for each window

**3. Seasonal Features (11)**
- Cyclical encoding: `month_sin`, `month_cos`, `week_sin`, `week_cos`
- Season indicators: winter, spring, summer, fall
- Flu season indicator

**4. Trend Features (3)**
- `ili_rate_diff_1`, `ili_rate_diff_2`
- `ili_rate_momentum`

**5. State Features (8)**
- One-hot encoding for 8 states

### Model Development

**Phase 1: Baseline Models**
- Naive (Lag-1): Simple but strong (R²=0.916)
- Moving Average: Underperformed baseline

**Phase 2: Time Series Models**
- Prophet: Comprehensive testing revealed data mismatch
- *Key Learning*: Not all sophisticated models work better

**Phase 3: Gradient Boosting**
- XGBoost: Leveraged all 37 features
- Result: **99.3% variance explained**

---

## 📈 Feature Importance

Top 5 most important features (by gain):

| Rank | Feature | Importance | Category |
|------|---------|------------|----------|
| 1 | `ili_rate_lag_1` | 63.62 | Autocorrelation |
| 2 | `ili_rate_rolling_min_2` | 12.45 | Rolling Window |
| 3 | `ili_rate_momentum` | 3.04 | Trend |
| 4 | `ili_rate_rolling_max_2` | 2.75 | Rolling Window |
| 5 | `ili_rate_rolling_mean_2` | 2.23 | Rolling Window |

**Insight**: While lag-1 dominates (64%), the remaining 36 features contribute 36% of predictive power, proving the value of comprehensive feature engineering.

---

## 🔄 Reproducibility

### Data Split Strategy

Train: 70% (2018-10 to 2023-08) - 2,021 samples Validation: 15% (2023-08 to 2024-09) - 433 samples
Test: 15% (2024-09 to 2025-09) - 434 samples


**Important**: Temporal split (no shuffling) to prevent data leakage in time series.

### Reproduce Results
```bash
# 1. Install dependencies
poetry install

# 2. Run feature engineering
cd finalproject/features
python build_features.py

# 3. Train and evaluate XGBoost
cd ../models
python xgboost_model.py

# Expected Test R²: 0.991 ± 0.002

📊 Visualizations

Key visualizations in notebooks:

State-by-State ILI Trends (01_data_exploration.ipynb)
- 7-year time series for all 8 states
- Seasonal patterns and geographic differences
Feature Correlation Heatmap (02_feature_engineering.ipynb)
- Lag-1 correlation: 0.96 with target
- Feature relationships
Model Predictions (04_xgboost_model.ipynb)
- Actual vs Predicted (all states)
- Residual analysis
Feature Importance (04_xgboost_model.ipynb)
- Bar chart of top 15 features
- Category breakdown

💡 Key Insights

Geographic Patterns

Highest ILI: New Jersey (avg 3.65%)
Lowest ILI: Washington (avg 1.76%)
Range: 1.90% difference between states

Seasonality

Winter average: 3.97%
Summer average: 1.47%
Seasonal ratio: 2.71× (winter is 2.71 times higher)

Predictability

Strong autocorrelation: Last week's value explains 96% of variance
Feature synergy: Additional 36 features add 7.7% improvement
Short-term patterns: 2-week rolling windows most predictive

📚 Documentation

Data Card: See DATA_CARD.md for detailed data documentation
Model Card: See MODEL_CARD.md for XGBoost model details
Final Report: Complete analysis and findings in project report

🎓 Project Context

Author: Zhicheng Chen (Kenny Chen)

Personal Context: New Jersey included as analyst's home state, providing both geographic diversity and personal familiarity with local healthcare infrastructure.

🔮 Future Work

Potential improvements identified:

External Data Integration
- Weather data (temperature, humidity)
- Search trends (Google Flu Trends)
- Mobility data
Model Enhancements
- LSTM for sequence modeling
- Ensemble methods (XGBoost + others)
- State-specific models
Operational Deployment
- Real-time data pipeline
- Automated weekly updates
- Interactive dashboard
Extended Horizons
- Multi-week forecasting (4-8 weeks)
- Confidence intervals
- Scenario analysis

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

CDC ILINet for providing public ILI surveillance data
Professor for project guidance and template
Facebook Prophet team (despite model not working, valuable learning)
XGBoost developers for excellent ML library

📧 Contact

Kenny Chen

For questions or collaboration: Gmail: zhichengchen12@gmail.com Github: https://github.com/CrezC

Built with ❤️ using Python, XGBoost, Prophet, and Poetry

Last Updated: December 2024

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.github		.github
data		data
finalproject		finalproject
notebooks		notebooks
tests		tests
.cspell.json		.cspell.json
.gitattributes		.gitattributes
.gitignore		.gitignore
DATA_CARD.md		DATA_CARD.md
FINAL_REPORT.md		FINAL_REPORT.md
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test_state_data.py		test_state_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RespiForecast 🌡️

🎯 Project Overview

🏆 Key Results

📊 Model Comparison

🚀 Quick Start

Prerequisites

Installation

Running the Models

Quick Model Test

📁 Project Structure

📊 Visualizations

💡 Key Insights

Geographic Patterns

Seasonality

Predictability

📚 Documentation

🎓 Project Context

🔮 Future Work

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RespiForecast 🌡️

🎯 Project Overview

🏆 Key Results

📊 Model Comparison

🚀 Quick Start

Prerequisites

Installation

Running the Models

Quick Model Test

📁 Project Structure

📊 Visualizations

💡 Key Insights

Geographic Patterns

Seasonality

Predictability

📚 Documentation

🎓 Project Context

🔮 Future Work

📄 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages