# Analysis of the White Fronted Goose (Anser albifrons) and Associated Subspecies

Author: Waley Wang


In [None]:
#
# Import nessesary libraries and do nessesary non-df related prepwork
#

import numpy as np
import pandas as pd

import warnings

import GooseUtils


# Supress warning related to data types (will get on first import of the csv file)
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)

Import and trim relevant data. This notebook focuses only on data related to the Anser albifrons (species id: 1710) and its subspecies Anser albifrons elgasi (species id: 1719). This roughly corresponds to $222,322$ rows of data.

In [None]:
#
# Retrieve the data from the csv file and filter out irrelevant species
#

# Retrieve group 1 data from the relevant CSV file
goose_data_raw = pd.read_csv('Bird_Banding_Data/NABBP_2023_grp_01.csv')

# Filter out all irrelevant species
goose_data_raw = goose_data_raw[(goose_data_raw['SPECIES_ID'] == 1710) | (goose_data_raw['SPECIES_ID'] == 1719)]

In [None]:
#
# Get all relevant columns and display basic information about the data
#

# Retrieve all relevant columns
goose_data = goose_data_raw[['BAND', 
                             'ORIGINAL_BAND', 
                             'OTHER_BANDS', 
                             'EVENT_DATE', 
                             'EVENT_DAY', 
                             'EVENT_MONTH', 
                             'EVENT_YEAR', 
                             'LAT_DD', 
                             'LON_DD', 
                             'COORD_PREC']]

# Display number of non-null entries in each column
display(goose_data.count())

RECORD_ID          222322
EVENT_TYPE         222322
BAND               222322
ORIGINAL_BAND      222322
OTHER_BANDS           900
EVENT_DATE         222322
EVENT_DAY          222322
EVENT_MONTH        222322
EVENT_YEAR         222322
ISO_COUNTRY        222322
ISO_SUBDIVISION    220769
LAT_DD             222099
LON_DD             222099
COORD_PREC         222263
BIRD_STATUS        187292
PERMIT             187344
BAND_STATUS        185695
BAND_TYPE          222322
dtype: int64

A large number of the date cells (~ 5,000) do not work with the `pd.to_datetime()` function. Since this is vital information for the analysis, the below cell aims specifically to clean the dates and remove any unessesary columns after. The following is the process by which dates are chosen.

1. If the `'EVENT_DATE'` column already has a valid date that works with `pd.to_datetime()`, it will be the date used.
2. Otherwise, if the `'EVENT_DAY'`, `'EVENT_MONTH'`, and `'EVENT_YEAR'` column all form a date that works with `pd.to_datetime()`, it will be the date used.
3. If neither of the above work, `NaT` will be assigned and the row will be dropped.

In [52]:
#
# Clean time-related columns as described above.
#

# Attempt to apply pd.to_datetime() to the EVENT_DATE column.
goose_data['EVENT_DATE'] = pd.to_datetime(goose_data['EVENT_DATE'], format='%m/%d/%Y', errors='coerce')

# Assemble date guesses from the EVENT_MONTH, EVENT_DAY, and EVENT_YEAR columns.
dates_from_columns = pd.to_datetime(goose_data['EVENT_MONTH'].apply(str) + '/' + goose_data['EVENT_DAY'].apply(str) + '/' + goose_data['EVENT_YEAR'].apply(str), format='%m/%d/%Y', errors='coerce')

# Fill in all NaT values that can be filled with the guesses from the previous line.
goose_data['EVENT_DATE'] = goose_data['EVENT_DATE'].fillna(dates_from_columns)

# Remove all rows where EVENT_DATE is still NaT after the above operations.
goose_data = goose_data[goose_data['EVENT_DATE'].notna()]

# Drop the EVENT_MONTH, EVENT_DAY, and EVENT_YEAR columns as they are no longer needed.
goose_data = goose_data.drop(labels=['EVENT_MONTH', 'EVENT_DAY', 'EVENT_YEAR'], axis=1)

Location data is also vital for analysis, so abit of cleaning will have to be done.

First, rows fitting any of the following conditions will be excluded:
1. Rows that do not have values for either `LAT_DD` or `LON_DD` because this issue cannot be rectified.
2. Rows whose `COORD_PREC` values are `8`, `12`, `18`, `28`, `33`, `38`, `72` because an uncertainty given either cannot be determined or is too big to be useful (Corresponds to $\sim 228$ entries).

In [None]:
#
# Clean the coordinates columns as described above.
#

# Filter out all rows where LAT_DD or LON_DD are NaN. Cannot rectify rows with this issue.
goose_data = goose_data[goose_data['LAT_DD'].notna() & goose_data['LON_DD'].notna()]

# Filter out all rows with unusable or useless coordinate precision values as outlined above.
goose_data = goose_data[~((goose_data['COORD_PREC'] == 8)  | \
                     (goose_data['COORD_PREC'] == 12) | \
                     (goose_data['COORD_PREC'] == 18) | \
                     (goose_data['COORD_PREC'] == 28) | \
                     (goose_data['COORD_PREC'] == 33) | \
                     (goose_data['COORD_PREC'] == 38) | \
                     (goose_data['COORD_PREC'] == 72))]

Unnamed: 0,RECORD_ID,EVENT_TYPE,BAND,ORIGINAL_BAND,OTHER_BANDS,EVENT_DATE,ISO_COUNTRY,ISO_SUBDIVISION,LAT_DD,LON_DD,COORD_PREC,BIRD_STATUS,PERMIT,BAND_STATUS,BAND_TYPE
1490311,-42397131,B,B49465071455,B49465071455,,1962-09-25,CA,CA-AB,53.50000,-110.50000,60.0,7.0,P3939169,0.0,11
1490312,11845018,B,B48215946459,B48215946459,,2019-02-16,CA,CA-BC,48.50000,-123.50000,60.0,7.0,P3836474,0.0,01
1490313,-74968635,B,B49449673857,B49449673857,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
1490314,-74968634,B,B79449673814,B79449673814,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
1490315,-74968633,B,B79449673754,B79449673754,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1712628,947925,E,B48739846377,B48739846377,,2022-11-12,US,US-OR,42.95154,-120.76332,11.0,,,,W1
1712629,-12449252,E,B19009393199,B19009393199,,2005-04-28,US,US-WA,47.08333,-124.08333,10.0,,,,01
1712630,-8806747,E,B99059535475,B99059535475,,2009-09-11,US,US-WA,46.58333,-122.91667,10.0,,,,01
1712631,-8717299,E,B19059537113,B19059537113,,2011-09-24,US,US-WA,48.25000,-122.41667,10.0,,,,01


Unnamed: 0,RECORD_ID,EVENT_TYPE,BAND,ORIGINAL_BAND,OTHER_BANDS,EVENT_DATE,ISO_COUNTRY,ISO_SUBDIVISION,LAT_DD,LON_DD,COORD_PREC,BIRD_STATUS,PERMIT,BAND_STATUS,BAND_TYPE
1490311,-42397131,B,B49465071455,B49465071455,,1962-09-25,CA,CA-AB,53.50000,-110.50000,60.0,7.0,P3939169,0.0,11
1490312,11845018,B,B48215946459,B48215946459,,2019-02-16,CA,CA-BC,48.50000,-123.50000,60.0,7.0,P3836474,0.0,01
1490313,-74968635,B,B49449673857,B49449673857,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
1490314,-74968634,B,B79449673814,B79449673814,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
1490315,-74968633,B,B79449673754,B79449673754,,1960-07-16,CA,CA-NT,69.50000,-129.50000,60.0,3.0,P9951575,0.0,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1712628,947925,E,B48739846377,B48739846377,,2022-11-12,US,US-OR,42.95154,-120.76332,11.0,,,,W1
1712629,-12449252,E,B19009393199,B19009393199,,2005-04-28,US,US-WA,47.08333,-124.08333,10.0,,,,01
1712630,-8806747,E,B99059535475,B99059535475,,2009-09-11,US,US-WA,46.58333,-122.91667,10.0,,,,01
1712631,-8717299,E,B19059537113,B19059537113,,2011-09-24,US,US-WA,48.25000,-122.41667,10.0,,,,01


Additionally, a new column with lattitude and longitude uncertainties will be made whose values obey the following rules:
1. If the `COORD_PREC` corresponds to an exact location (is `0`), then the uncertainty is $5*10^{6}$ to account for limits in the number of significant digits given by the data.
2. If the `COORD_PREC` corresponds to a 1-minute block (is `1`), then the uncertainty is $\frac{1}{120} \approx 0.01$ degrees (rounded up) since the coordinates are in the centroid of the block.
3. If the `COORD_PREC` corresponds to a 10-minute block (is `10`), then the uncertainty is $\frac{1}{12} \approx 0.1$ degrees (rounded up) since the coordinates are in the centroid of the block.
4. If the `COORD_PREC` corresponds to a 1-degree block (is `60`), then the uncertainty is $0.5$ degrees since the coordinates are in the centroid of the block.
5. If the `COORD_PREC` corresponds to a county (is `7`), then the uncertainty will be $0.25$ degrees by estimate (since the average county land area is 1090.69 degrees and a sqaure of that size is around $0.5$ degrees in lattitude and longitude)
6. If the `COORD_PREC` corresponds to a town/area (is `11`), then the uncertainty will be $0.25$ degrees by estimate (since each town should be smaller than a county and thus have less uncertainty associated with it)

In [None]:
#
# Perform the coordinate precision conversion as described above.
#

# Compute coording uncertainties
goose_data['COORD_UNC'] = goose_data['COORD_PREC'].apply(lambda x : GooseUtils.get_coord_precision(x))

# Drop the old column
goose_data = goose_data.drop(labels=['COORD_PREC'], axis=1)