Skip to content

CrezC/RespiForecast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

RespiForecast ๐ŸŒก๏ธ

Python Poetry License: MIT GitHub stars Rutgers

Machine Learning System for Seasonal Respiratory Disease Forecasting

Predicting state-level influenza-like illness (ILI) rates 1-4 weeks in advance with 99.3% accuracy to support pharmaceutical inventory optimization and healthcare resource allocation.



๐ŸŽฏ Project Overview

RespiForecast predicts weekly ILI rates 1-4 weeks in advance for 8 key U.S. states, using a comprehensive machine learning approach that combines temporal features, seasonal patterns, and state-specific characteristics.

Target States: California, New York, Texas, Florida, Illinois, Washington, Massachusetts, and New Jersey (personal context - analyst's home state)

Business Value:

  • ๐Ÿ“ฆ Pharmaceutical inventory optimization
  • ๐Ÿฅ Hospital resource allocation
  • ๐Ÿ‘ฅ Healthcare staffing decisions
  • ๐Ÿ“Š Regional distribution planning

๐Ÿ† Key Results

Our XGBoost model achieved production-grade accuracy:

Metric Naive Baseline XGBoost Improvement
Rยฒ Score 0.916 0.993 +7.7%
MAE 0.378% 0.083% -78%
MAPE 10.8% 2.2% -80%

Key Finding: XGBoost with 37 engineered features explains 99.3% of ILI rate variance, with an average prediction error of only 0.083%.


๐Ÿ“Š Model Comparison

We systematically evaluated multiple approaches:

โœ… Naive Baseline (Lag-1)      โ†’ Rยฒ = 0.916  (Strong baseline)
โœ… Moving Average (4-week)     โ†’ Rยฒ = 0.762  (Underperforms)
โŒ Prophet (Time Series)       โ†’ Rยฒ = -1.13  (Failed - data mismatch)
๐Ÿ† XGBoost (37 features)       โ†’ Rยฒ = 0.993  (Best performer)

Why Prophet Failed:

  • Designed for daily data, not weekly
  • Over-smoothing eliminated rapid ILI fluctuations
  • Strong autocorrelation makes simple lag features superior
  • Documented as valuable learning in project report

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12+
  • Poetry (dependency management)

Installation

# Clone the repository
git clone https://github.com/CrezC/RespiForecast.git
cd RespiForecast

# Install dependencies using Poetry
poetry install

# Activate virtual environment
poetry shell

Running the Models

# 1. Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb

# 2. Feature engineering
jupyter notebook notebooks/02_feature_engineering.ipynb

# 3. Baseline models and Prophet
jupyter notebook notebooks/03_baseline_and_prophet.ipynb

# 4. XGBoost model (best performer)
jupyter notebook notebooks/04_xgboost_model.ipynb

Quick Model Test

# Test XGBoost model
cd finalproject/models
python xgboost_model.py

# Expected output:
# Test Results:
#   MAE:  0.0675
#   RMSE: 0.1949
#   Rยฒ:   0.9912

๐Ÿ“ Project Structure

final-project-CrezC/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                          # Original CDC ILINet data
โ”‚   โ”‚   โ””โ”€โ”€ ILINet_States.csv         # State-level ILI data (2018-2025)
โ”‚   โ””โ”€โ”€ processed/                    # Cleaned and engineered data
โ”‚       โ”œโ”€โ”€ ili_data_states_clean.csv
โ”‚       โ””โ”€โ”€ ili_data_with_features.csv (37 features)
โ”‚
โ”œโ”€โ”€ finalproject/
โ”‚   โ”œโ”€โ”€ config/
โ”‚   โ”‚   โ””โ”€โ”€ config.yaml               # Project configuration
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ””โ”€โ”€ processors.py             # Data loading and cleaning
โ”‚   โ”œโ”€โ”€ features/
โ”‚   โ”‚   โ””โ”€โ”€ build_features.py         # Feature engineering (37 features)
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ”œโ”€โ”€ baseline.py               # Naive and Moving Average
โ”‚   โ”‚   โ”œโ”€โ”€ prophet_model.py          # Prophet (time series)
โ”‚   โ”‚   โ””โ”€โ”€ xgboost_model.py          # XGBoost (best model)
โ”‚   โ””โ”€โ”€ evaluation/
โ”‚       โ””โ”€โ”€ metrics.py                # Performance evaluation
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ”œโ”€โ”€ 01_data_exploration.ipynb     # EDA and visualization
โ”‚   โ”œโ”€โ”€ 02_feature_engineering.ipynb  # Feature creation analysis
โ”‚   โ”œโ”€โ”€ 03_baseline_and_prophet.ipynb # Initial models
โ”‚   โ”œโ”€โ”€ 03b_prophet_debugging.ipynb   # Prophet failure analysis
โ”‚   โ”œโ”€โ”€ 04_xgboost_model.ipynb        # Final XGBoost model
โ”‚   โ””โ”€โ”€ 05_final_results.ipynb        # Final report
โ”‚
โ”œโ”€โ”€ models/                           # Saved model files
โ”œโ”€โ”€ pyproject.toml                    # Poetry dependencies
โ”œโ”€โ”€ README.md                         # This file
โ”œโ”€โ”€ DATA_CARD.md                      # Information file of Data
โ”œโ”€โ”€ MODEL_CARD.md                     # Information file of Model
โ””โ”€โ”€ FINAL_REPORT.md                   # Final report

---

## ๐Ÿ”ฌ Methodology

### Data Collection
- **Source**: CDC ILINet (Influenza-Like Illness Surveillance Network)
- **Geography**: 8 U.S. states
- **Time Range**: October 2018 - September 2025 (7 years)
- **Frequency**: Weekly data
- **Total Records**: 2,920 (365 weeks ร— 8 states)

### Feature Engineering (37 Features)

**1. Lag Features (3)**
- `ili_rate_lag_1`, `ili_rate_lag_2`, `ili_rate_lag_4`

**2. Rolling Window Statistics (12)**
- 2-week, 4-week, 8-week windows
- Mean, Std Dev, Min, Max for each window

**3. Seasonal Features (11)**
- Cyclical encoding: `month_sin`, `month_cos`, `week_sin`, `week_cos`
- Season indicators: winter, spring, summer, fall
- Flu season indicator

**4. Trend Features (3)**
- `ili_rate_diff_1`, `ili_rate_diff_2`
- `ili_rate_momentum`

**5. State Features (8)**
- One-hot encoding for 8 states

### Model Development

**Phase 1: Baseline Models**
- Naive (Lag-1): Simple but strong (Rยฒ=0.916)
- Moving Average: Underperformed baseline

**Phase 2: Time Series Models**
- Prophet: Comprehensive testing revealed data mismatch
- *Key Learning*: Not all sophisticated models work better

**Phase 3: Gradient Boosting**
- XGBoost: Leveraged all 37 features
- Result: **99.3% variance explained**

---

## ๐Ÿ“ˆ Feature Importance

Top 5 most important features (by gain):

| Rank | Feature | Importance | Category |
|------|---------|------------|----------|
| 1 | `ili_rate_lag_1` | 63.62 | Autocorrelation |
| 2 | `ili_rate_rolling_min_2` | 12.45 | Rolling Window |
| 3 | `ili_rate_momentum` | 3.04 | Trend |
| 4 | `ili_rate_rolling_max_2` | 2.75 | Rolling Window |
| 5 | `ili_rate_rolling_mean_2` | 2.23 | Rolling Window |

**Insight**: While lag-1 dominates (64%), the remaining 36 features contribute 36% of predictive power, proving the value of comprehensive feature engineering.

---

## ๐Ÿ”„ Reproducibility

### Data Split Strategy

Train: 70% (2018-10 to 2023-08) - 2,021 samples Validation: 15% (2023-08 to 2024-09) - 433 samples
Test: 15% (2024-09 to 2025-09) - 434 samples


**Important**: Temporal split (no shuffling) to prevent data leakage in time series.

### Reproduce Results
```bash
# 1. Install dependencies
poetry install

# 2. Run feature engineering
cd finalproject/features
python build_features.py

# 3. Train and evaluate XGBoost
cd ../models
python xgboost_model.py

# Expected Test Rยฒ: 0.991 ยฑ 0.002

๐Ÿ“Š Visualizations

Key visualizations in notebooks:

  1. State-by-State ILI Trends (01_data_exploration.ipynb)

    • 7-year time series for all 8 states
    • Seasonal patterns and geographic differences
  2. Feature Correlation Heatmap (02_feature_engineering.ipynb)

    • Lag-1 correlation: 0.96 with target
    • Feature relationships
  3. Model Predictions (04_xgboost_model.ipynb)

    • Actual vs Predicted (all states)
    • Residual analysis
  4. Feature Importance (04_xgboost_model.ipynb)

    • Bar chart of top 15 features
    • Category breakdown

๐Ÿ’ก Key Insights

Geographic Patterns

  • Highest ILI: New Jersey (avg 3.65%)
  • Lowest ILI: Washington (avg 1.76%)
  • Range: 1.90% difference between states

Seasonality

  • Winter average: 3.97%
  • Summer average: 1.47%
  • Seasonal ratio: 2.71ร— (winter is 2.71 times higher)

Predictability

  • Strong autocorrelation: Last week's value explains 96% of variance
  • Feature synergy: Additional 36 features add 7.7% improvement
  • Short-term patterns: 2-week rolling windows most predictive

๐Ÿ“š Documentation

  • Data Card: See DATA_CARD.md for detailed data documentation
  • Model Card: See MODEL_CARD.md for XGBoost model details
  • Final Report: Complete analysis and findings in project report

๐ŸŽ“ Project Context

Author: Zhicheng Chen (Kenny Chen)

Personal Context: New Jersey included as analyst's home state, providing both geographic diversity and personal familiarity with local healthcare infrastructure.


๐Ÿ”ฎ Future Work

Potential improvements identified:

  1. External Data Integration

    • Weather data (temperature, humidity)
    • Search trends (Google Flu Trends)
    • Mobility data
  2. Model Enhancements

    • LSTM for sequence modeling
    • Ensemble methods (XGBoost + others)
    • State-specific models
  3. Operational Deployment

    • Real-time data pipeline
    • Automated weekly updates
    • Interactive dashboard
  4. Extended Horizons

    • Multi-week forecasting (4-8 weeks)
    • Confidence intervals
    • Scenario analysis

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • CDC ILINet for providing public ILI surveillance data
  • Professor for project guidance and template
  • Facebook Prophet team (despite model not working, valuable learning)
  • XGBoost developers for excellent ML library

๐Ÿ“ง Contact

Kenny Chen

For questions or collaboration: Gmail: zhichengchen12@gmail.com Github: https://github.com/CrezC


Built with โค๏ธ using Python, XGBoost, Prophet, and Poetry

Last Updated: December 2024

About

ML system predicting flu outbreaks 1-4 weeks ahead with 99.3% accuracy. XGBoost | CDC Data | Rutgers AI Course

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors