# Main Insights

Revised research questions:
1. How does the frequency with which agencies collect samples affect the number of HPAI cases detected?
    - Which agencies collect samples most frequently?
    - When do agencies collect samples most frequently (time of year)?
    - Where do agencies collect samples most frequently (geographic region)?
2. Is there a correlation between geographic region and season, and number of HPAI cases?
    - How does agency collection frequency influence this correlation?
3. What is the forecast for the number of HPAI cases in each region over the next 6 months?
    - With consideration for agency collection frequency and its impact on case detection.

The goal of this notebook is to answer these questions through statistical analysis.

## Load Data

In [2]:
import pandas as pd

dataset_path = '../data/HPAI Detections in Wild Birds.csv'
hpai_data = pd.read_csv(dataset_path)
hpai_data.head(n=3)

Unnamed: 0,State,County,Collection Date,Date Detected,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency
0,North Dakota,Cass,9/12/2025,9/19/2025,EA H5,Canada goose,Wild bird,Morbidity/Mortality,ND Game and Fish
1,Pennsylvania,Bucks,9/8/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,PA Game Commission
2,Pennsylvania,Delaware,9/4/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,PA Game Commission


## Define Regions

We'll need to define regions for geographic analysis.

In [3]:
import us

region_map = {
    'Northeast': ['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'],
    'Midwest': ['IL', 'IN', 'MI', 'OH', 'WI', 'IA', 'KS', 'MN', 'MO', 'NE', 'ND', 'SD'],
    'South': ['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'],
    'West': ['AZ', 'CO', 'ID', 'MT', 'NV', 'NM', 'UT', 'WY', 'AK', 'CA', 'HI', 'OR', 'WA']
}

state_to_region = {}
for region, states in region_map.items():
    for state in states:
        state_to_region[state] = region

df1 = hpai_data.copy()
df1['State'] = df1['State'].apply(lambda x: us.states.lookup(x).abbr if us.states.lookup(x) else x)
df1['Region'] = df1['State'].map(state_to_region)

df1.head(n=5)

Unnamed: 0,State,County,Collection Date,Date Detected,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency,Region
0,ND,Cass,9/12/2025,9/19/2025,EA H5,Canada goose,Wild bird,Morbidity/Mortality,ND Game and Fish,Midwest
1,PA,Bucks,9/8/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,PA Game Commission,Northeast
2,PA,Delaware,9/4/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,PA Game Commission,Northeast
3,NJ,Warren,9/11/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,NJ DEP,Northeast
4,NJ,Warren,9/11/2025,9/19/2025,EA H5,Black vulture,Wild bird,Morbidity/Mortality,NJ DEP,Northeast


## Define Seasons

We'll need to define seasons for temporal analysis. We use meteorological seasons:
- Winter: December, January, February
- Spring: March, April, May
- Summer: June, July, August
- Fall: September, October, November

In [4]:
seasons_map = {
    'Winter': ['Dec', 'Jan', 'Feb'],
    'Spring': ['Mar', 'Apr', 'May'],
    'Summer': ['Jun', 'Jul', 'Aug'],
    'Fall': ['Sep', 'Oct', 'Nov']
}

def parse_date(s):
    '''
    Attempts to parse a date string of format 'MM/DD/YYYY' to a datetime.

    If parsing fails, it returns pd.NaT (Not a Time).
    '''

    try:
        return pd.to_datetime(s, format='%m/%d/%Y', errors='coerce')
    except:
        return pd.NaT

month_to_season = {}
for season, months in seasons_map.items():
    for month in months:
        month_to_season[month] = season

df2 = df1.copy()

# convert date strings to datetime objects
# if parsing fails, the value will be set to pd.NaT because sometimes the date strings are invalid
df2['Date Detected'] = df2['Date Detected'].apply(parse_date)
df2['Collection Date'] = df2['Collection Date'].apply(parse_date)

# extract detection month and year
df2['Detection Month'] = pd.to_datetime(df2['Date Detected']).dt.strftime('%b')
df2['Detection Year'] = pd.to_datetime(df2['Date Detected']).dt.strftime('%Y')
df2 = df2.drop(columns=['Date Detected'])

# also extract collection month and year
df2['Collection Month'] = pd.to_datetime(df2['Collection Date']).dt.strftime('%b')
df2['Collection Year'] = pd.to_datetime(df2['Collection Date']).dt.strftime('%Y')
df2 = df2.drop(columns=['Collection Date'])

# map detection month to season
# use detection month because that's when the case is actually confirmed
df2['Season'] = df2['Detection Month'].map(month_to_season)

df2.head(n=1)

Unnamed: 0,State,County,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency,Region,Detection Month,Detection Year,Collection Month,Collection Year,Season
0,ND,Cass,EA H5,Canada goose,Wild bird,Morbidity/Mortality,ND Game and Fish,Midwest,Sep,2025,Sep,2025,Fall


## Question 1 - How does the frequency with which agencies collect samples affect the number of HPAI cases detected?

Goal is to see if there's any bias in the data based on when and where agencies collect samples, and how often they do so. Some agencies may collect more samples in a certain region because of proximity or season because of staffing (among other reasons), which could skew the data.

Let's see which agencies collect the most samples. We'll look at the number of samples collected by each agency across each year and see if there's a pattern.


In [5]:

df3 = df2.groupby(['Submitting Agency', 'Collection Year']).size().reset_index(name='Samples Collected').sort_values(by=['Samples Collected'], ascending=False)
df3_pivot = df3.pivot_table(index='Submitting Agency', columns='Collection Year', values='Samples Collected', fill_value=0).sort_values(by=['2021','2022','2023','2024','2025'], ascending=False)
df3_pivot.head(n=20)

Collection Year,2021,2022,2023,2024,2025
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,4.0,2939.0,1949.0,1950.0,916.0
Private (non-government) submission,0.0,833.0,191.0,303.0,91.0
SCWDS,0.0,256.0,90.0,65.0,4.0
CA DFW/CAHFS,0.0,218.0,113.0,72.0,40.0
NY DEC,0.0,159.0,66.0,77.0,145.0
MI DNR,0.0,145.0,14.0,6.0,81.0
UT DWR,0.0,112.0,33.0,5.0,1.0
FL FWCC,0.0,112.0,4.0,39.0,5.0
OR DFW,0.0,101.0,50.0,36.0,2.0
WA DFW,0.0,100.0,36.0,15.0,0.0


For the sake of brevity (there are 126 agencies), this is just the top 20.

The NWDP (National Wildlife Disease Program) is the agency that collects the most samples by far. See https://www.aphis.usda.gov/national-wildlife-programs/nwdp for more info on the NWDP. Private entities collect the next most samples.

Let's drill down into how many samples each agency collects during the seasons.

In [6]:
from IPython.display import display

df4 = df2.copy()

df4 = df4.groupby(['Submitting Agency', 'Collection Year', 'Season']).size().reset_index(name='Samples Collected').sort_values(by=['Samples Collected'], ascending=False)

# for each year, create a pivot table of agencies vs seasons
# do this for easier viewing
for year in sorted(df4['Collection Year'].unique()):
    year_df = df4.loc[df4['Collection Year'] == year]
    
    # only include years the have data for each season
    if len(year_df['Season'].unique()) < 4:
        continue

    year_pivot = year_df.pivot_table(index=['Submitting Agency'], columns='Season', values='Samples Collected', fill_value=0)
    year_pivot = year_pivot.sort_values(by=['Winter', 'Spring', 'Summer', 'Fall'], ascending=False)
    year_pivot['Collection Year'] = year

    display(year_pivot.head(n=10)) # pretty print the table


Season,Fall,Spring,Summer,Winter,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,1449.0,610.0,95.0,785.0,2022
Private (non-government) submission,236.0,350.0,101.0,146.0,2022
CA DFW/CAHFS,86.0,9.0,29.0,94.0,2022
SCWDS,124.0,51.0,38.0,43.0,2022
KS DWP/SCWDS,3.0,16.0,3.0,34.0,2022
MO Dept of Conservation,6.0,30.0,0.0,31.0,2022
PA Game Commission,9.0,17.0,21.0,27.0,2022
NY DEC,64.0,53.0,18.0,24.0,2022
CO Parks & Wildlife,12.0,20.0,2.0,24.0,2022
AR GFC/SCWDS,0.0,0.0,0.0,23.0,2022


Season,Fall,Spring,Summer,Winter,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,1162.0,43.0,233.0,511.0,2023
Private (non-government) submission,58.0,41.0,7.0,85.0,2023
SCWDS,4.0,12.0,6.0,68.0,2023
CA DFW/CAHFS,27.0,21.0,7.0,58.0,2023
CO Parks & Wildlife,5.0,5.0,3.0,32.0,2023
NY DEC,6.0,24.0,5.0,31.0,2023
USGS,5.0,3.0,1.0,27.0,2023
MT FWP,6.0,5.0,2.0,22.0,2023
WY GFD,12.0,5.0,1.0,20.0,2023
UT DWR,5.0,11.0,2.0,15.0,2023


Season,Fall,Spring,Summer,Winter,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,335.0,75.0,111.0,1429.0,2024
Private (non-government) submission,35.0,101.0,127.0,40.0,2024
CA DFW/CAHFS,2.0,5.0,32.0,33.0,2024
NY DEC,8.0,30.0,14.0,25.0,2024
SCWDS,0.0,40.0,1.0,24.0,2024
APHIS Wildlife Services,0.0,0.0,1.0,15.0,2024
KS DWP/SCWDS,0.0,0.0,0.0,12.0,2024
PA Game Commission,0.0,2.0,2.0,9.0,2024
USGS,0.0,3.0,28.0,6.0,2024
LA Co Dept of Public Health,0.0,2.0,6.0,6.0,2024


Season,Fall,Spring,Summer,Winter,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,87.0,143.0,49.0,637.0,2025
Private (non-government) submission,3.0,11.0,65.0,12.0,2025
MI DNR,0.0,0.0,71.0,10.0,2025
PA Game Commission,5.0,2.0,114.0,1.0,2025
CT VMDL,0.0,0.0,9.0,1.0,2025
APHIS Wildlife Services,0.0,0.0,3.0,1.0,2025
AL DAI,0.0,0.0,0.0,1.0,2025
TX PWD,0.0,0.0,0.0,1.0,2025
NJ DEP,8.0,4.0,29.0,0.0,2025
USGS,0.0,4.0,14.0,0.0,2025


Top 10 for sake of brevity.

Still notice that the NWDP collects the most samples. Interestingly, the number of samples collected by the NWDP seems to peak in the winter and fall seasons regardless of year.

Finally, let's look at where agencies collect the most samples.

In [7]:
from IPython.display import display

df5 = df2.copy()

df5 = df5.groupby(['Submitting Agency', 'Collection Year', 'Region']).size().reset_index(name='Samples Collected').sort_values(by=['Samples Collected'], ascending=False)

# for each year, create a pivot table of agencies vs regions
# do this for easier viewing
for year in sorted(df5['Collection Year'].unique()):
    year_df = df5.loc[df5['Collection Year'] == year]

    if len(year_df['Region'].unique()) < 4:
        continue

    year_pivot = year_df.pivot_table(index=['Submitting Agency'], columns='Region', values='Samples Collected', fill_value=0)
    year_pivot = year_pivot.sort_values(by=['Northeast', 'Midwest', 'South', 'West'], ascending=False)
    year_pivot['Collection Year'] = year

    display(year_pivot.head(n=10)) # pretty print the table


Region,Midwest,Northeast,South,West,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,1074.0,397.0,842.0,626.0,2022
NY DEC,0.0,159.0,0.0,0.0,2022
Tufts University,0.0,78.0,0.0,0.0,2022
Private (non-government) submission,333.0,77.0,271.0,152.0,2022
PA Game Commission,0.0,74.0,0.0,0.0,2022
MA DFW/USGS,0.0,14.0,0.0,0.0,2022
USFWS/SCWDS,1.0,9.0,9.0,0.0,2022
NJ DEP,0.0,6.0,0.0,0.0,2022
CT DEEP,0.0,4.0,0.0,0.0,2022
ME DIFW/USGS,0.0,4.0,0.0,0.0,2022


Region,Midwest,Northeast,South,West,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,731.0,106.0,512.0,600.0,2023
NY DEC,0.0,66.0,0.0,0.0,2023
Private (non-government) submission,27.0,38.0,34.0,92.0,2023
PA Game Commission,0.0,26.0,0.0,0.0,2023
Tufts University,0.0,14.0,1.0,0.0,2023
SCWDS/USFWS National Eagle Repository,10.0,5.0,15.0,10.0,2023
ME DIFW,0.0,5.0,0.0,0.0,2023
USGS,19.0,2.0,1.0,14.0,2023
CT DEEP,0.0,2.0,0.0,0.0,2023
SCWDS,33.0,0.0,53.0,4.0,2023


Region,Midwest,Northeast,South,West,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,364.0,200.0,708.0,678.0,2024
Private (non-government) submission,48.0,123.0,81.0,51.0,2024
NY DEC,0.0,76.0,0.0,1.0,2024
Tufts University,0.0,40.0,0.0,0.0,2024
PA Game Commission,0.0,13.0,0.0,0.0,2024
NJ DEP,0.0,5.0,0.0,0.0,2024
SCWDS/USFWS National Eagle Repository,2.0,4.0,6.0,1.0,2024
USGS,11.0,2.0,10.0,14.0,2024
ME DIFW,0.0,2.0,0.0,0.0,2024
CT DEEP,0.0,1.0,0.0,0.0,2024


Region,Midwest,Northeast,South,West,Collection Year
Submitting Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NWDP,205.0,441.0,155.0,115.0,2025
Tufts University,0.0,207.0,2.0,0.0,2025
NY DEC,0.0,145.0,0.0,0.0,2025
PA Game Commission,0.0,122.0,0.0,0.0,2025
NJ DEP,0.0,41.0,0.0,0.0,2025
Private (non-government) submission,24.0,30.0,23.0,14.0,2025
Cornell University,0.0,23.0,0.0,0.0,2025
CT VMDL,0.0,10.0,0.0,0.0,2025
PA Department of Ag,0.0,8.0,0.0,0.0,2025
NJ Dept of Ag,0.0,4.0,0.0,0.0,2025


Some key observations:
- NWDP generally collects the most samples across all years, seasons, and regions.
- NWDP seems to collect samples most frequently in the winter and fall seasons.
- NWDP does not seem to collect as many samples in the Northeast region but does collect a similar (higher) number of samples in the Midwest, South, and West regions.
    - Agencies besides NWDP generally collect significantly more samples in the Northeast than other regions.
    - 2025 is an exception in that the NWDP so far has collected more samples in the Northeast than any other region.

These observations suggest the following:
1. NWDP has the most impact on detections overall.
2. There may be seasonal bias in the data due to the NWDP's collection patterns.
3. There may be regional bias in the data due to the NWDP's smaller number of samples collected in the Northeast region.

## Question 2 - Is there a correlation between geographic region and season, and number of HPAI cases?

Goal is to assess how region and season impact number of HPAI cases, if at all.

Let's see which regions have the most cases overall across all years.

In [15]:
df6 = df2.copy()
df6 = df6.groupby(['Region', 'Detection Year']).size().reset_index(name='HPAI Cases').sort_values(by=['HPAI Cases'], ascending=False)
df6_pivot = df6.pivot_table(index='Region', columns='Detection Year', values='HPAI Cases', fill_value=0)
df6_pivot['Total'] = df6_pivot.sum(axis=1)
df6_pivot

Detection Year,2022,2023,2024,2025,Total
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Midwest,2039.0,859.0,567.0,637.0,4102.0
Northeast,800.0,273.0,471.0,1092.0,2636.0
South,1463.0,582.0,990.0,492.0,3527.0
West,1619.0,1083.0,1026.0,504.0,4232.0


Recall previously that the NWDP collects the most samples overall, and specifically doesn't collect as many samples in the Northeast region. This may explain why the Northeast has the fewest cases overall by a significant margin.

Let's see how many cases each region has during the seasons. From our observations, we would expect the winter and fall seasons to have the most cases.

In [18]:
df7 = df2.copy()
df7 = df7.groupby(['Region', 'Detection Year', 'Season']).size().reset_index(name='HPAI Cases').sort_values(by=['HPAI Cases'], ascending=False)
df7_pivot = df7.pivot_table(index=['Region', 'Season'], columns='Detection Year', values='HPAI Cases', fill_value=0)
df7_pivot['Total'] = df7_pivot.sum(axis=1)
df7_pivot


Unnamed: 0_level_0,Detection Year,2022,2023,2024,2025,Total
Region,Season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Midwest,Fall,688.0,510.0,132.0,67.0,1397.0
Midwest,Spring,1039.0,82.0,47.0,99.0,1267.0
Midwest,Summer,137.0,14.0,143.0,290.0,584.0
Midwest,Winter,175.0,253.0,245.0,181.0,854.0
Northeast,Fall,308.0,22.0,35.0,16.0,381.0
Northeast,Spring,165.0,91.0,153.0,89.0,498.0
Northeast,Summer,140.0,4.0,107.0,580.0,831.0
Northeast,Winter,187.0,156.0,176.0,407.0,926.0
South,Fall,586.0,48.0,333.0,5.0,972.0
South,Spring,209.0,71.0,50.0,48.0,378.0
