# Analysis of the White Fronted Goose (Anser albifrons) and Associated Subspecies

Author: Waley Wang


In [89]:
#
# Import nessesary libraries and do nessesary non-df related prepwork
#

import numpy as np
import pandas as pd

import warnings

import GooseUtils


# Supress warning related to data types (will get on first import of the csv file)
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

Import and trim relevant data. This notebook focuses only on data related to the Anser albifrons (species id: 1710) and its subspecies Anser albifrons elgasi (species id: 1719). This roughly corresponds to $222,322$ rows of data.

In [90]:
#
# Retrieve the data from the csv file and filter out irrelevant species
#

# Retrieve group 1 data from the relevant CSV file
goose_data_raw = pd.read_csv('Bird_Banding_Data/NABBP_2023_grp_01.csv')

# Filter out all irrelevant species
goose_data_raw = goose_data_raw[(goose_data_raw['SPECIES_ID'] == 1710) | (goose_data_raw['SPECIES_ID'] == 1719)]

In [91]:
#
# Get all relevant columns and display basic information about the data
#

# Retrieve all relevant columns
goose_data = goose_data_raw[['BAND', 
                             'ORIGINAL_BAND', 
                             'OTHER_BANDS', 
                             'EVENT_DATE', 
                             'EVENT_DAY', 
                             'EVENT_MONTH', 
                             'EVENT_YEAR', 
                             'LAT_DD', 
                             'LON_DD', 
                             'COORD_PREC']]

# Display number of non-null entries in each column
print(f'Rows: {goose_data.shape[0]}')

Rows: 222322


## 1: Data Cleaning

Here all irrelevant entries in the data are filtered out and the data is formatted for use.

### 1.1: Formatting Date

A large number of the date cells (~ $3,295$) do not work with the `pd.to_datetime()` function. Since this is vital information for the analysis, the below cell aims specifically to clean the dates and remove any unessesary columns after. The following is the process by which dates are chosen.

1. If the `'EVENT_DATE'` column already has a valid date that works with `pd.to_datetime()`, it will be the date used.
2. Otherwise, if the `'EVENT_DAY'`, `'EVENT_MONTH'`, and `'EVENT_YEAR'` column all form a date that works with `pd.to_datetime()`, it will be the date used.
3. If neither of the above work, `NaT` will be assigned and the row will be dropped.


The `'EVENT_DAY'`, `'EVENT_MONTH'`, and `'EVENT_YEAR'` columns are all updated with the relevant information (will be used for grouping later).

In [92]:
#
# Clean time-related columns as described above.
#

# Attempt to apply pd.to_datetime() to the EVENT_DATE column.
goose_data['EVENT_DATE'] = pd.to_datetime(goose_data['EVENT_DATE'], format='%m/%d/%Y', errors='coerce')

# Assemble date guesses from the EVENT_MONTH, EVENT_DAY, and EVENT_YEAR columns.
dates_from_columns = pd.to_datetime(goose_data['EVENT_MONTH'].apply(str) + '/' + goose_data['EVENT_DAY'].apply(str) + '/' + goose_data['EVENT_YEAR'].apply(str), format='%m/%d/%Y', errors='coerce')

# Fill in all NaT values that can be filled with the guesses from the previous line.
goose_data['EVENT_DATE'] = goose_data['EVENT_DATE'].fillna(dates_from_columns)

# Remove all rows where EVENT_DATE is still NaT after the above operations.
goose_data = goose_data[goose_data['EVENT_DATE'].notna()]

# Ammend the EVENT_DAY, EVENT_MONTH, and EVENT_YEAR columns based on the EVENT_DATE column.
goose_data['EVENT_DAY'] = goose_data['EVENT_DATE'].apply(lambda x: x.day)
goose_data['EVENT_MONTH'] = goose_data['EVENT_DATE'].apply(lambda x: x.month)
goose_data['EVENT_YEAR'] = goose_data['EVENT_DATE'].apply(lambda x: x.year)

print(f'Rows: {goose_data.shape[0]}')

Rows: 219027


### 1.2: Formatting Coordinates and Deriving a Coordinate Uncertainy

Location data is also vital for analysis, so abit of cleaning will have to be done.

First, rows fitting any of the following conditions will be excluded:
1. Rows that do not have values for either `LAT_DD` or `LON_DD` because this issue cannot be rectified.
2. Rows whose `COORD_PREC` values are `8`, `12`, `18`, `28`, `33`, `38`, `72`, or `NaN` because an uncertainty given either cannot be determined or is too big to be useful (Corresponds to $\sim 1015$ entries).

In [93]:
#
# Clean the coordinates columns as described above.
#

# Filter out all rows where LAT_DD or LON_DD are NaN. Cannot rectify rows with this issue.
goose_data = goose_data[goose_data['LAT_DD'].notna() & goose_data['LON_DD'].notna()]

# Filter out all rows with unusable or useless coordinate precision values as outlined above.
goose_data = goose_data[~((goose_data['COORD_PREC'] == 8)  | \
                     (goose_data['COORD_PREC'] == 12) | \
                     (goose_data['COORD_PREC'] == 18) | \
                     (goose_data['COORD_PREC'] == 28) | \
                     (goose_data['COORD_PREC'] == 33) | \
                     (goose_data['COORD_PREC'] == 38) | \
                     (goose_data['COORD_PREC'] == 72) | \
                     (goose_data['COORD_PREC'].isna()))]

print(f'Rows: {goose_data.shape[0]}')

Rows: 218012


Additionally, a new column with lattitude and longitude uncertainties will be made whose values obey the following rules:
1. If the `COORD_PREC` corresponds to an exact location (is `0`), then the uncertainty is $5*10^{6}$ to account for limits in the number of significant digits given by the data.
2. If the `COORD_PREC` corresponds to a 1-minute block (is `1`), then the uncertainty is $\frac{1}{120} \approx 0.01$ degrees (rounded up) since the coordinates are in the centroid of the block.
3. If the `COORD_PREC` corresponds to a 10-minute block (is `10`), then the uncertainty is $\frac{1}{12} \approx 0.1$ degrees (rounded up) since the coordinates are in the centroid of the block.
4. If the `COORD_PREC` corresponds to a 1-degree block (is `60`), then the uncertainty is $0.5$ degrees since the coordinates are in the centroid of the block.
5. If the `COORD_PREC` corresponds to a county (is `7`), then the uncertainty will be $0.25$ degrees by estimate (since the average county land area is 1090.69 degrees and a sqaure of that size is around $0.5$ degrees in lattitude and longitude)
6. If the `COORD_PREC` corresponds to a town/area (is `11`), then the uncertainty will be $0.25$ degrees by estimate (since each town should be smaller than a county and thus have less uncertainty associated with it)

In [94]:
#
# Perform the coordinate precision conversion as described above.
#

# Compute coording uncertainties
goose_data['COORD_UNC'] = goose_data['COORD_PREC'].apply(lambda x : GooseUtils.get_coord_unc(x))

# Drop the old column
goose_data = goose_data.drop(labels=['COORD_PREC'], axis=1)

## 2: Perliminary Analysis

### 2.1: Question 1: Does the White Fronted Goose Actually Migrate?

For this, we will compare the average location of the bird over all datapoints taken within each month of the year.

In [95]:
months_dict = goose_data.groupby('EVENT_MONTH')

for month, group in months_dict:
    print(f'Month: {month} - Rows: {group.shape[0]}')

Month: 1 - Rows: 6933
Month: 2 - Rows: 1720
Month: 3 - Rows: 2983
Month: 4 - Rows: 1082
Month: 5 - Rows: 362
Month: 6 - Rows: 4749
Month: 7 - Rows: 148863
Month: 8 - Rows: 7582
Month: 9 - Rows: 14402
Month: 10 - Rows: 15366
Month: 11 - Rows: 7557
Month: 12 - Rows: 6413
