# Phase 1 - Initial Exploration

Goals:
- Load data
- Gather basic info (shape, column names, data types, etc.)
- Generate summary statistics
- Identify and understand distribution of key variables
- Establish 3 questions to answer w/ data

Deliverables/Outcomes:
- Quarto slidedeck (5-10 slides) summarizing findings

## Load Data

We are working with the Detections of Highly Pathogenic Avian Influenza (HPAI) in Wild Birds dataset from the USDA. This tabular CSV dataset contains data regarding confirmed HPAI cases for birds in the US. Data goes back to 2022 and is regularly updated.

In [2]:
import pandas as pd

dataset_path = '../data/HPAI Detections in Wild Birds.csv'
hpai_data = pd.read_csv(dataset_path)

print(hpai_data.head(n=5)) # view the first 5 rows of the dataset


          State    County Collection Date Date Detected HPAI Strain  \
0  North Dakota      Cass       9/12/2025     9/19/2025       EA H5   
1  Pennsylvania     Bucks        9/8/2025     9/19/2025       EA H5   
2  Pennsylvania  Delaware        9/4/2025     9/19/2025       EA H5   
3    New Jersey    Warren       9/11/2025     9/19/2025       EA H5   
4    New Jersey    Warren       9/11/2025     9/19/2025       EA H5   

    Bird Species WOAH Classification      Sampling Method   Submitting Agency  
0   Canada goose           Wild bird  Morbidity/Mortality    ND Game and Fish  
1  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
2  Black vulture           Wild bird  Morbidity/Mortality  PA Game Commission  
3  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  
4  Black vulture           Wild bird  Morbidity/Mortality              NJ DEP  


## Gather Basic Info

- Shape (number of rows and columns)
- Column names and data types
- Possible values and their meanings

In [12]:
num_rows = len(hpai_data)

print("number of rows:", num_rows)

num_columns = len(hpai_data.columns)

print("number of columns:", num_columns)

print("column names: ", hpai_data.columns)

print("States: ", hpai_data.State.unique())

print("HPAI Strains: ", hpai_data["HPAI Strain"].unique())

print("Species: ", hpai_data["Bird Species"].unique())

print("WOAH Classifications: ", hpai_data["WOAH Classification"].unique())

print("Sampling Methods: ", hpai_data["Sampling Method"].unique())

number of rows: 14497
number of columns: 9
column names:  Index(['State', 'County', 'Collection Date', 'Date Detected', 'HPAI Strain',
       'Bird Species', 'WOAH Classification', 'Sampling Method',
       'Submitting Agency'],
      dtype='object')
States:  ['North Dakota' 'Pennsylvania' 'New Jersey' 'Iowa' 'Wyoming' 'Wisconsin'
 'Illinois' 'North Carolina' 'Montana' 'Minnesota' 'Idaho' 'Maryland'
 'Colorado' 'New Hampshire' 'Alaska' 'Ohio' 'Utah' 'New York' 'Vermont'
 'Rhode Island' 'Connecticut' 'Virginia' 'Nebraska' 'South Dakota' 'Maine'
 'Massachusetts' 'Michigan' 'Florida' 'Missouri' 'Nevada' 'Louisiana'
 'California' 'Arkansas' 'Georgia' 'West Virginia' 'Indiana' 'Kentucky'
 'Alabama' 'Oregon' 'Texas' 'Oklahoma' 'Arizona' 'Tennessee' 'Washington'
 'Mississippi' 'Delaware' 'South Carolina' 'Kansas' 'New Mexico' 'Hawaii'
 'DC']
HPAI Strains:  ['EA H5' 'EA/AM H5N1' 'EA H5N5' 'EA H5N1' 'EA/AM H5 mixed'
 'EA/AM H5N1 mixed' 'EA H5 mixed' 'EA/AM H5' nan 'EA H5N1 mixed'
 'EA/AM H5N2' 

In [None]:
# Replaced 'Wild Bird' rows with 'Wild bird' since they are the same category and having two of the same category was redundant.

df1 = hpai_data.replace("Wild Bird", "Wild bird")

print("WOAH Classifications: ", df1["WOAH Classification"].unique())

WOAH Classifications:  ['Wild bird' 'Captive wild bird']


## Generate Summary Statistics



In [16]:
df1.describe()

Unnamed: 0,State,County,Collection Date,Date Detected,HPAI Strain,Bird Species,WOAH Classification,Sampling Method,Submitting Agency
count,14497,14497,14497,14497,14496,14497,14497,14497,14497
unique,51,974,1017,610,12,236,2,5,127
top,Florida,Brevard,Unknown,10/25/2022,EA/AM H5N1,Mallard,Wild bird,Morbidity/Mortality,NWDP
freq,844,434,204,375,6479,2548,13446,7548,7758


## Identifying Distributions of Key Variables