# Phase 1 - Initial Exploration

Goals:
- Load data
- Gather basic info (shape, column names, data types, etc.)
- Generate summary statistics
- Identify and understand distribution of key variables
- Establish 3 questions to answer w/ data

Deliverables/Outcomes:
- Quarto slidedeck (5-10 slides) summarizing findings

## Load Data

We are working with the Detections of Highly Pathogenic Avian Influenza (HPAI) in Wild Birds dataset from the USDA. This tabular CSV dataset contains data regarding confirmed HPAI cases for birds in the US. Data goes back to 2022 and is regularly updated.

In [14]:
import pandas as pd

dataset_path = '../data/HPAI Detections in Wild Birds.csv'
hpai_data = pd.read_csv(dataset_path)

print(hpai_data.head(n=5)) # view the first 5 rows of the dataset


          State    County Collection Date Date Detected HPAI Strain  \
0  North Dakota      Cass       9/12/2025     9/19/2025       EA H5   
1  Pennsylvania     Bucks        9/8/2025     9/19/2025       EA H5   
2  Pennsylvania  Delaware        9/4/2025     9/19/2025       EA H5   
3    New Jersey    Warren       9/11/2025     9/19/2025       EA H5   
4    New Jersey    Warren       9/11/2025     9/19/2025       EA H5   

    Bird Species WOAH Classification      Sampling Method   Submitting Agency  
0   Canada goose           Wild bird  Morbidity/Mortality    ND Game and Fish  
1  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
2  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
3  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  
4  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  


## Gather Basic Info

- Shape (number of rows and columns)
- Column names and data types
- Possible values and their meanings

In [15]:
num_rows = len(hpai_data)

print("number of rows:", num_rows)

num_columns = len(hpai_data.columns)

print("number of columns:", num_columns)

print("column names: ", hpai_data.columns)

print("States: ", hpai_data.State.unique())

print("HPAI Strains: ", hpai_data["HPAI Strain"].unique())

print("Species: ", hpai_data["Bird Species"].unique())

print("WOAH Classifications: ", hpai_data["WOAH Classification"].unique())

print("Sampling Methods: ", hpai_data["Sampling Method"].unique())

number of rows: 14497
number of columns: 9
column names:  Index(['State', 'County', 'Collection Date', 'Date Detected', 'HPAI Strain',
       'Bird Species', 'WOAH Classification', 'Sampling Method',
       'Submitting Agency'],
      dtype='object')
States:  ['North Dakota' 'Pennsylvania' 'New Jersey' 'Iowa' 'Wyoming' 'Wisconsin'
 'Illinois' 'North Carolina' 'Montana' 'Minnesota' 'Idaho' 'Maryland'
 'Colorado' 'New Hampshire' 'Alaska' 'Ohio' 'Utah' 'New York' 'Vermont'
 'Rhode Island' 'Connecticut' 'Virginia' 'Nebraska' 'South Dakota' 'Maine'
 'Massachusetts' 'Michigan' 'Florida' 'Missouri' 'Nevada' 'Louisiana'
 'California' 'Arkansas' 'Georgia' 'West Virginia' 'Indiana' 'Kentucky'
 'Alabama' 'Oregon' 'Texas' 'Oklahoma' 'Arizona' 'Tennessee' 'Washington'
 'Mississippi' 'Delaware' 'South Carolina' 'Kansas' 'New Mexico' 'Hawaii'
 'DC']
HPAI Strains:  ['EA H5' 'EA/AM H5N1' 'EA H5N5' 'EA H5N1' 'EA/AM H5 mixed'
 'EA/AM H5N1 mixed' 'EA H5 mixed' 'EA/AM H5' nan 'EA H5N1 mixed'
 'EA/AM H5N2' 

In [16]:
# Replaced 'Wild Bird' rows with 'Wild bird' since they are the same category and having two of the same category was redundant.

df1 = hpai_data.replace("Wild Bird", "Wild bird")

print("WOAH Classifications: ", df1["WOAH Classification"].unique())

WOAH Classifications:  ['Wild bird' 'Captive wild bird']


## Generate Summary Statistics



In [17]:
df1.describe()

Unnamed: 0,State,County,Collection Date,Date Detected,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency
count,14497,14497,14497,14497,14496,14497,14497,14497,14497
unique,51,974,1017,610,12,236,2,5,127
top,Florida,Brevard,Unknown,10/25/2022,EA/AM H5N1,Mallard,Wild bird,Morbidity/Mortality,NWDP
freq,844,434,204,375,6479,2548,13446,7548,7758


In [18]:
df1.mode()

Unnamed: 0,State,County,Collection Date,Date Detected,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency
0,Florida,Brevard,Unknown,10/25/2022,EA/AM H5N1,Mallard,Wild bird,Morbidity/Mortality,NWDP


## Identifying Distributions of Key Variables

The least important variables are probably County and Submitting Agency. Collection Date is another one since it only tells us when the bird was sampled, but not if/when it was found to be sick with anything, so we can't really infer anything from it...

In [19]:
# let's get the top 10 states with the most cases across all years

df1['State'].value_counts(ascending=False).head(n=10)

State
Florida          844
California       830
Minnesota        810
New York         701
Oregon           571
Michigan         535
Massachusetts    520
Washington       510
North Dakota     490
Alaska           436
Name: count, dtype: int64

In [20]:
# let's see which months have the most cases (detections) across all years
# since we only care about month and year, we'll split the Date Detected column into 2 new columns
# and we'll drop the Collection Date column since we don't need it

df1[['Month Detected', 'Day Detected', 'Year Detected']] = df1['Date Detected'].str.split('/', expand=True)
df2 = df1.drop(columns=['Collection Date', 'Day Detected', 'Date Detected']) # drop the columns we don't need anymore
df2 = df2[['State','County','Year Detected','Month Detected','HPAI Strain','Bird Species','WOAH Classification','Sampling Method','Submitting Agency']] # reorder the columns
print(df2.head(n=5))

df2['Month Detected'].value_counts(ascending=False).head(n=12)


          State    County Year Detected Month Detected HPAI Strain  \
0  North Dakota      Cass          2025              9       EA H5   
1  Pennsylvania     Bucks          2025              9       EA H5   
2  Pennsylvania  Delaware          2025              9       EA H5   
3    New Jersey    Warren          2025              9       EA H5   
4    New Jersey    Warren          2025              9       EA H5   

    Bird Species WOAH Classification      Sampling Method   Submitting Agency  
0   Canada goose           Wild bird  Morbidity/Mortality    ND Game and Fish  
1  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
2  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
3  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  
4  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  


Month Detected
12    2220
11    1857
1     1391
10    1316
9     1309
2     1232
6     1179
5      970
3      838
4      828
7      827
8      530
Name: count, dtype: int64

In [None]:
# TODO: let's see which years have the most cases


In [None]:
# TODO: let's see which bird species have the most cases across all years


In [23]:
# TODO: let's see which HPAI strains are most commonly detected across all years


In [26]:
# let's see the frequency of cases in each WOAH Classification category across all years

df2['WOAH Classification'].value_counts(ascending=False)

WOAH Classification
Wild bird            13446
Captive wild bird     1051
Name: count, dtype: int64

## Establish 3 Main Questions

1. TODO
2. TODO
3. TODO