# Forecasting Regional Influenza Activity Using Global Panel Models
>Time Series Forecasting Project

**Objective**

In this project we build short-term forecasting models for weekly influenza-like illness (ILI%) in all ten U.S. HHS regions.

The idea is to move from a reactive approach (“we wait and see what happens”) to a more predictive one. Using historical surveillance data, we want to forecast flu activity by region and compare different modelling families:

- simple baselines,
- classical time-series models (SARIMA),
- and a modern global machine-learning model (Random Forest).

The final goal is to see which approach works best when the system is stressed by events like COVID and changing flu seasons.

## Problem Statement

Seasonal influenza is a recurring pressure on the healthcare system. It affects:

- hospital admissions and bed capacity,
- staff availability,
- vaccine demand and public communication.

Our main question is:

> Can we forecast next-week ILI% for each U.S. HHS region using historical flu surveillance data and a few key exogenous drivers?

We treat this as a **panel forecasting problem**:  
the target is weekly regional ILI% (ILI_PERCENT) and we use:

- past ILI%,
- lab positivity (PERCENT_POSITIVE),
- and time indicators (week of year, year, seasonal encodings)

to predict **one week ahead**.

## Dataset Description

## Dataset Overview

The dataset is a merged panel of weekly influenza surveillance for:

- **10 U.S. HHS regions**
- Weekly frequency
- From **October 1997 to November 2025**

Each row is one region–week pair. After cleaning and feature creation we have a long, balanced panel with many flu seasons per region.

Key variables:

- **ILI_PERCENT** – % of outpatient visits due to flu-like illness  
- **PERCENT_POSITIVE**, **TOTAL_SPECIMENS**, **TOTAL_A**, **TOTAL_B** – lab-confirmed activity  
- **DATE**, **YEAR**, **WEEKOFYEAR** – calendar structure  
- **REGION**, **REGION_ID** – region labels and numeric ID for ML

Overall, the data give us a rich history of flu seasons across regions, suitable for both per-region time-series models and global models trained on all regions together.

---

## Data Quality and Missing Values

Before modelling we checked missingness and basic data quality.

- Age-specific ILI variables had high and uneven missingness across years and regions, so we dropped them and worked with aggregate indicators instead.
- For lab variables such as **PERCENT_POSITIVE**, missingness was moderate and mostly in early years. We interpolated these values within each region to keep the time series smooth.
- Core variables for modelling (ILI_PERCENT, specimen counts, patient totals) are complete after these steps.

Dates were converted to proper datetime, and we verified that each region has a continuous weekly series with no duplicate region–week rows. ILI% is already a percentage, so we did not scale it further to keep the interpretation simple.


---