In [1]:
from IPython.display import Image, display, HTML

HTML(
    """
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/
jquery.min.js "></script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
"""
)

# Why Do Birds Suddenly Appear?
### Analyzing Global Duck Migration Trends with eBird Data

![Title](Maps.png)

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">I. ABSTRACT</h1><a id='ExecSum'></a><a id='Title'></a><a


This study investigates the seasonal migration patterns of various duck species across the Americas using data from the eBird Science dataset. By leveraging PySpark for efficient data processing and visualization tools such as Matplotlib and GeoPandas, we provide a comprehensive analysis of bird sightings data. Our primary objectives are to identify the seasonal distribution of duck species, understand species-specific migratory behaviors, and support conservation efforts by highlighting critical habitats and migration corridors.

The dataset comprises extensive records of duck sightings, including geographic coordinates and timestamps, contributed by citizen scientists and researchers. We processed this data to extract relevant information on species, year, and season, enabling detailed analysis of migratory patterns.

Interactive visualizations were created to dynamically explore the data, revealing significant insights into the migratory routes and seasonal habitats of various duck species. Our findings highlight the importance of preserving key habitats and offer valuable information for conservation strategies. This study demonstrates the utility of large-scale citizen science data in understanding avian ecology and supporting biodiversity conservation.


<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">II. INTRODUCTION</h1><a id='ExecSum'></a><a id='Title'></a><a

Ducks are a diverse group of waterfowl belonging to the family Anatidae, which also includes swans and geese. They are found in a variety of habitats across the globe, including freshwater lakes, rivers, marshes, and coastal wetlands. Ducks are characterized by their broad, flat bills, webbed feet, and relatively short legs, adaptations that make them well-suited for an aquatic lifestyle. They exhibit a range of behaviors and dietary preferences, from dabbling in shallow waters for plants and small invertebrates to diving deep underwater to catch fish and other prey (Wikipedia Contributors, 2019).

Migration is a fundamental aspect of duck ecology, driven primarily by seasonal changes in resource availability and climatic conditions. Ducks typically migrate between their breeding grounds, often located in the temperate or Arctic regions, and their wintering grounds, which are usually in warmer, more southerly areas. This migration allows them to exploit different habitats at different times of the year, ensuring access to food and suitable breeding sites (Birdfact, 2022).


Understanding the migration patterns of ducks is crucial for several reasons. Firstly, it helps in identifying critical habitats that need protection, such as stopover sites where ducks rest and refuel during their long migrations. These sites are often rich in food resources and provide essential respite for the birds. Secondly, analyzing migration data can reveal broader ecological changes, such as shifts in climate or habitat availability, which may impact not only duck populations but also other wildlife and plant species dependent on the same ecosystems.


In this lab, we utilize extensive bird observation data to analyze the migration patterns and population dynamics of ducks. Our goal is to determine the optimal locations and times for observing these species, providing insights into their migratory behavior. This analysis will also contribute to the development of effective conservation strategies, ensuring the long-term health and sustainability of duck populations and the ecosystems they inhabit.

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; 
padding-left: 20px;">III. PROJECT OBJECTIVES AND PURPOSE</h1><a id='ExecSum'></a><a id='Title'></a><a

<h2 style="color: #183318;">A. Problem Statement</h2>

How can we leverage comprehensive bird observation data to understand the migration patterns and population dynamics of ducks, while also identifying optimal duck-spotting locations and proposing conservation strategies to mitigate ecological impacts?

<h2 style="color: #183318;">B. Motivation</h2>



Understanding the migration patterns and population dynamics of ducks is crucial for several reasons. Ducks are not only a vital component of wetland ecosystems, but they also serve as key indicators of environmental health. As migratory species, ducks face numerous threats from habitat loss, climate change, and human activities. These challenges necessitate a comprehensive analysis of their movements and population trends to inform conservation strategies effectively.

Birdwatchers and citizen scientists contribute vast amounts of data, providing an unprecedented opportunity to gain insights into duck behavior and migration. By leveraging this extensive dataset, we can identify critical habitats, optimal observation locations, and periods of significant movement. This information is invaluable for researchers, conservationists, and policymakers aiming to preserve these species and their habitats.

Moreover, public interest in birdwatching and wildlife conservation is growing. By creating accessible and engaging visualizations of duck migration patterns, we can raise awareness about the importance of preserving wetland ecosystems and inspire more people to participate in conservation efforts. Ultimately, this project seeks to contribute meaningfully to the scientific understanding and conservation of migratory ducks, ensuring their survival for future generations.

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">V. DATA COLLECTION</h1><a id='ExecSum'></a><a id='Title'></a><a

<h2 style="color: #183318;">A. Data Source</h2>

The data source of this analysis is the eBird Status and Trends data, which consists of bird observation records collected from birdwatchers worldwide. This data includes specific details such as species identification, the time of observation, and geographic coordinates. These comprehensive datasets are instrumental in understanding bird species distribution and population trends over time.

In [2]:
# whole dataset
!du -sh /mnt/data/public/ebird/

1.4T	/mnt/data/public/ebird/


The entire dataset is 1.4T, consisting of CSV files of bird sightings from North America, and visualizations in a proprietary format.

In [3]:
# duck CSVs
!du -ch \
/mnt/data/public/ebird/bbwduc-ERD2018-EBIRD_SCIENCE-20191105-dc3957b5/data/*.csv \
/mnt/data/public/ebird/fuwduc-ERD2018-EBIRD_SCIENCE-20191108-754039df/data/*.csv \
/mnt/data/public/ebird/musduc-ERD2018-EBIRD_SCIENCE-20191113-70538914/data/*.csv \
/mnt/data/public/ebird/wooduc-ERD2018-EBIRD_SCIENCE-20191030-e29a115f/data/*.csv \
/mnt/data/public/ebird/buwtea-ERD2018-EBIRD_SCIENCE-20191026-e03b029e/data/*.csv \
/mnt/data/public/ebird/cintea-ERD2018-EBIRD_SCIENCE-20191105-145ef846/data/*.csv \
/mnt/data/public/ebird/norsho-ERD2018-EBIRD_SCIENCE-20191103-421d971b/data/*.csv \
/mnt/data/public/ebird/gadwal-ERD2018-EBIRD_SCIENCE-20191103-481aa100/data/*.csv \
/mnt/data/public/ebird/amewig-ERD2018-EBIRD_SCIENCE-20191103-90051c9b/data/*.csv \
/mnt/data/public/ebird/mallar3-ERD2018-EBIRD_SCIENCE-20191101-708131f3/data/*.csv \
/mnt/data/public/ebird/mexduc-ERD2018-EBIRD_SCIENCE-20191109-c1d8d500/data/*.csv \
/mnt/data/public/ebird/ambduc-ERD2018-EBIRD_SCIENCE-20191029-e2ed9548/data/*.csv \
/mnt/data/public/ebird/motduc-ERD2018-EBIRD_SCIENCE-20191105-f26b05f2/data/*.csv \
/mnt/data/public/ebird/norpin-ERD2018-EBIRD_SCIENCE-20191104-e1db18cd/data/*.csv \
/mnt/data/public/ebird/gnwtea-ERD2018-EBIRD_SCIENCE-20191103-8d623ff8/data/*.csv \
/mnt/data/public/ebird/canvas-ERD2018-EBIRD_SCIENCE-20191105-4eb1360b/data/*.csv \
/mnt/data/public/ebird/redhea-ERD2018-EBIRD_SCIENCE-20191029-6d235b26/data/*.csv \
/mnt/data/public/ebird/rinduc-ERD2018-EBIRD_SCIENCE-20191103-9f256851/data/*.csv \
/mnt/data/public/ebird/gresca-ERD2018-EBIRD_SCIENCE-20191104-7189d431/data/*.csv \
/mnt/data/public/ebird/lessca-ERD2018-EBIRD_SCIENCE-20191103-98c9bc5d/data/*.csv \
/mnt/data/public/ebird/steeid-ERD2018-EBIRD_SCIENCE-20191029-dba0f3bf/data/*.csv \
/mnt/data/public/ebird/speeid-ERD2018-EBIRD_SCIENCE-20191029-c6b3af35/data/*.csv \
/mnt/data/public/ebird/kineid-ERD2018-EBIRD_SCIENCE-20191108-3a80b49c/data/*.csv \
/mnt/data/public/ebird/comeid-ERD2018-EBIRD_SCIENCE-20191105-1f6d42f0/data/*.csv \
/mnt/data/public/ebird/harduc-ERD2018-EBIRD_SCIENCE-20191108-cf76de4d/data/*.csv \
/mnt/data/public/ebird/sursco-ERD2018-EBIRD_SCIENCE-20191104-fa814127/data/*.csv \
/mnt/data/public/ebird/whwsco4-ERD2018-EBIRD_SCIENCE-20191105-0cba4c36/data/*.csv \
/mnt/data/public/ebird/blksco2-ERD2018-EBIRD_SCIENCE-20191107-4b9c946b/data/*.csv \
/mnt/data/public/ebird/lotduc-ERD2018-EBIRD_SCIENCE-20191105-009ca0f9/data/*.csv \
/mnt/data/public/ebird/buffle-ERD2018-EBIRD_SCIENCE-20191103-76984696/data/*.csv \
/mnt/data/public/ebird/comgol-ERD2018-EBIRD_SCIENCE-20191103-accd7641/data/*.csv \
/mnt/data/public/ebird/bargol-ERD2018-EBIRD_SCIENCE-20191108-57c9ffb3/data/*.csv \
/mnt/data/public/ebird/hoomer-ERD2018-EBIRD_SCIENCE-20191103-c76d0aee/data/*.csv \
/mnt/data/public/ebird/commer-ERD2018-EBIRD_SCIENCE-20191103-150d2fe1/data/*.csv \
/mnt/data/public/ebird/rebmer-ERD2018-EBIRD_SCIENCE-20191103-79149b24/data/*.csv \
/mnt/data/public/ebird/rudduc-ERD2018-EBIRD_SCIENCE-20191103-c634b0f9/data/*.csv \
| grep total$

28G	total


Limiting it only to the sighting CSVs of duck species, the dataset is now 28Gb.

Each CSV has 77 features:
1. SAMPLING_EVENT_ID
2. OBSERVER_ID
3. LONGITUDE
4. LATITUDE
5. I_STATIONARY
6. YEAR
7. DAY
8. SOLAR_NOON_DIFF
9. EFFORT_HRS
10. EFFORT_DISTANCE_KM
11. NUMBER_OBSERVERS
12. EASTNESS_MEDIAN
13. EASTNESS_SD
14. ELEV_MEDIAN
15. ELEV_SD
16. INTERTIDAL_FS_C1_1500_ED
17. INTERTIDAL_FS_C1_1500_PLAND
18. ISLAND
19. MCD12Q1_LCCS1_FS_C1_1500_ED
20. MCD12Q1_LCCS1_FS_C1_1500_PLAND
21. MCD12Q1_LCCS1_FS_C2_1500_ED
22. MCD12Q1_LCCS1_FS_C2_1500_PLAND
23. MCD12Q1_LCCS1_FS_C11_1500_ED
24. MCD12Q1_LCCS1_FS_C11_1500_PLAND
25. MCD12Q1_LCCS1_FS_C12_1500_ED
26. MCD12Q1_LCCS1_FS_C12_1500_PLAND
27. MCD12Q1_LCCS1_FS_C13_1500_ED
28. MCD12Q1_LCCS1_FS_C13_1500_PLAND
29. MCD12Q1_LCCS1_FS_C14_1500_ED
30. MCD12Q1_LCCS1_FS_C14_1500_PLAND
31. MCD12Q1_LCCS1_FS_C15_1500_ED
32. MCD12Q1_LCCS1_FS_C15_1500_PLAND
33. MCD12Q1_LCCS1_FS_C16_1500_ED
34. MCD12Q1_LCCS1_FS_C16_1500_PLAND
35. MCD12Q1_LCCS1_FS_C21_1500_ED
36. MCD12Q1_LCCS1_FS_C21_1500_PLAND
37. MCD12Q1_LCCS1_FS_C22_1500_ED
38. MCD12Q1_LCCS1_FS_C22_1500_PLAND
39. MCD12Q1_LCCS1_FS_C255_1500_ED
40. MCD12Q1_LCCS1_FS_C255_1500_PLAND
41. MCD12Q1_LCCS1_FS_C31_1500_ED
42. MCD12Q1_LCCS1_FS_C31_1500_PLAND
43. MCD12Q1_LCCS1_FS_C32_1500_ED
44. MCD12Q1_LCCS1_FS_C32_1500_PLAND
45. MCD12Q1_LCCS1_FS_C41_1500_ED
46. MCD12Q1_LCCS1_FS_C41_1500_PLAND
47. MCD12Q1_LCCS1_FS_C42_1500_ED
48. MCD12Q1_LCCS1_FS_C42_1500_PLAND
49. MCD12Q1_LCCS1_FS_C43_1500_ED
50. MCD12Q1_LCCS1_FS_C43_1500_PLAND
51. MCD12Q1_LCCS2_FS_C25_1500_ED
52. MCD12Q1_LCCS2_FS_C25_1500_PLAND
53. MCD12Q1_LCCS2_FS_C35_1500_ED
54. MCD12Q1_LCCS2_FS_C35_1500_PLAND
55. MCD12Q1_LCCS2_FS_C36_1500_ED
56. MCD12Q1_LCCS2_FS_C36_1500_PLAND
57. MCD12Q1_LCCS3_FS_C27_1500_ED
58. MCD12Q1_LCCS3_FS_C27_1500_PLAND
59. MCD12Q1_LCCS3_FS_C50_1500_ED
60. MCD12Q1_LCCS3_FS_C50_1500_PLAND
61. MCD12Q1_LCCS3_FS_C51_1500_ED
62. MCD12Q1_LCCS3_FS_C51_1500_PLAND
63. MOD44W_OIC_FS_C1_1500_ED
64. MOD44W_OIC_FS_C1_1500_PLAND
65. MOD44W_OIC_FS_C2_1500_ED
66. MOD44W_OIC_FS_C2_1500_PLAND
67. MOD44W_OIC_FS_C3_1500_ED
68. MOD44W_OIC_FS_C3_1500_PLAND
69. NORTHNESS_MEDIAN
70. NORTHNESS_SD
71. NTL_MEAN
72. NTL_SD
73. obs
74. DATE
75. RES
76. DATA_TYPE

For the purpose of brevity, we will only detail the features we used in the following section.

<h2 style="color: #183318;">B. Data Descriptions</h2>



### Table 1. Feature Description

| Feature Name  | Feature Description                                                                                                                                                                      | Sample Record         | Data Type |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|-----------|
| **year**      | The year of the observation, critical for understanding temporal patterns and trends over time.                                                                                           | 2023                  | integer   |
| **day**       | The day of the year of the observation, important for detailed temporal analysis and seasonal trend identification.                                                                       | 115                   | integer   |
| **longitude** | Geographic longitude of the observation, crucial for spatial analysis and mapping migration routes and population distributions.                                                          | -74.0060              | float     |
| **latitude**  | Geographic latitude of the observation, essential for spatial analysis and mapping migration routes and population distributions.                                                          | 40.7128               | float     |


<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">VII. METHODOLOGY</h1><a id='ExecSum'></a><a id='Title'></a><a

1. Setting up Spark Session:
- Imported necessary modules and initialized a SparkSession.
- Specified configurations for Spark such as memory allocation and timeouts.

2. Defining Folder Paths:
- Defined the paths to the folders containing the data files.

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, input_file_name, regexp_extract

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('BirdSightings') \
    .config('spark.executor.memory', '4g') \
    .config('spark.driver.memory', '4g') \
    .config('spark.network.timeout', '600s') \
    .config('spark.executor.heartbeatInterval', '120s') \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()


folder_paths = [
    "/mnt/data/public/ebird/bbwduc-ERD2018-EBIRD_SCIENCE-20191105-dc3957b5/data",
    "/mnt/data/public/ebird/fuwduc-ERD2018-EBIRD_SCIENCE-20191108-754039df/data",
    "/mnt/data/public/ebird/musduc-ERD2018-EBIRD_SCIENCE-20191113-70538914/data",
    "/mnt/data/public/ebird/wooduc-ERD2018-EBIRD_SCIENCE-20191030-e29a115f/data",
    "/mnt/data/public/ebird/buwtea-ERD2018-EBIRD_SCIENCE-20191026-e03b029e/data",
    "/mnt/data/public/ebird/cintea-ERD2018-EBIRD_SCIENCE-20191105-145ef846/data",
    "/mnt/data/public/ebird/norsho-ERD2018-EBIRD_SCIENCE-20191103-421d971b/data",
    "/mnt/data/public/ebird/gadwal-ERD2018-EBIRD_SCIENCE-20191103-481aa100/data",
    "/mnt/data/public/ebird/amewig-ERD2018-EBIRD_SCIENCE-20191103-90051c9b/data",
    "/mnt/data/public/ebird/mallar3-ERD2018-EBIRD_SCIENCE-20191101-708131f3/data",
    "/mnt/data/public/ebird/mexduc-ERD2018-EBIRD_SCIENCE-20191109-c1d8d500/data",
    "/mnt/data/public/ebird/ambduc-ERD2018-EBIRD_SCIENCE-20191029-e2ed9548/data",
    "/mnt/data/public/ebird/motduc-ERD2018-EBIRD_SCIENCE-20191105-f26b05f2/data",
    "/mnt/data/public/ebird/norpin-ERD2018-EBIRD_SCIENCE-20191104-e1db18cd/data",
    "/mnt/data/public/ebird/gnwtea-ERD2018-EBIRD_SCIENCE-20191103-8d623ff8/data",
    "/mnt/data/public/ebird/canvas-ERD2018-EBIRD_SCIENCE-20191105-4eb1360b/data",
    "/mnt/data/public/ebird/redhea-ERD2018-EBIRD_SCIENCE-20191029-6d235b26/data",
    "/mnt/data/public/ebird/rinduc-ERD2018-EBIRD_SCIENCE-20191103-9f256851/data",
    "/mnt/data/public/ebird/gresca-ERD2018-EBIRD_SCIENCE-20191104-7189d431/data",
    "/mnt/data/public/ebird/lessca-ERD2018-EBIRD_SCIENCE-20191103-98c9bc5d/data",
    "/mnt/data/public/ebird/steeid-ERD2018-EBIRD_SCIENCE-20191029-dba0f3bf/data",
    "/mnt/data/public/ebird/speeid-ERD2018-EBIRD_SCIENCE-20191029-c6b3af35/data",
    "/mnt/data/public/ebird/kineid-ERD2018-EBIRD_SCIENCE-20191108-3a80b49c/data",
    "/mnt/data/public/ebird/comeid-ERD2018-EBIRD_SCIENCE-20191105-1f6d42f0/data",
    "/mnt/data/public/ebird/harduc-ERD2018-EBIRD_SCIENCE-20191108-cf76de4d/data",
    "/mnt/data/public/ebird/sursco-ERD2018-EBIRD_SCIENCE-20191104-fa814127/data",
    "/mnt/data/public/ebird/whwsco4-ERD2018-EBIRD_SCIENCE-20191105-0cba4c36/data",
    "/mnt/data/public/ebird/blksco2-ERD2018-EBIRD_SCIENCE-20191107-4b9c946b/data",
    "/mnt/data/public/ebird/lotduc-ERD2018-EBIRD_SCIENCE-20191105-009ca0f9/data",
    "/mnt/data/public/ebird/buffle-ERD2018-EBIRD_SCIENCE-20191103-76984696/data",
    "/mnt/data/public/ebird/comgol-ERD2018-EBIRD_SCIENCE-20191103-accd7641/data",
    "/mnt/data/public/ebird/bargol-ERD2018-EBIRD_SCIENCE-20191108-57c9ffb3/data",
    "/mnt/data/public/ebird/hoomer-ERD2018-EBIRD_SCIENCE-20191103-c76d0aee/data",
    "/mnt/data/public/ebird/commer-ERD2018-EBIRD_SCIENCE-20191103-150d2fe1/data",
    "/mnt/data/public/ebird/rebmer-ERD2018-EBIRD_SCIENCE-20191103-79149b24/data",
    "/mnt/data/public/ebird/rudduc-ERD2018-EBIRD_SCIENCE-20191103-c634b0f9/data"
]

3. Defining User-Defined Function (UDF):
- Defined a Pandas UDF to map month numbers to seasons.

4. Reading and Processing Data:
- Read the data files into a Spark DataFrame.
- Extracted relevant information from the file paths and casted columns to appropriate data types.
- Filtered out rows with missing year or day information.
- Selected necessary columns and renamed them.
- Calculated the month from the day information.
- Applied the UDF to get the season corresponding to each month.
- Grouped the data by species name, year, and season.
- Aggregated coordinates into a list for each group.

5. Outputting Results
-Displayed a sample of the processed DataFrame.
- Wrote the processed data to Parquet files partitioned by species name, year, and season.

![Pipeline](pipeline.png)

In [5]:
from pyspark.sql.functions import lit, input_file_name, regexp_extract, col
from pyspark.sql import functions as F
from pyspark import StorageLevel
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd


@pandas_udf("string")
def get_season_udf(month_series: pd.Series) -> pd.Series:
    """
    Map a month number to its corresponding season.

    Args
    ----
    month_series (pd.Series): A Pandas Series containing month numbers (1-12).

    Returns:
    pd.Series: A Pandas Series with the season corresponding to each month.
    """
    def get_season(month):
        if month in [12, 1, 2]:
            return "Winter"
        elif month in [3, 4, 5]:
            return "Spring"
        elif month in [6, 7, 8]:
            return "Summer"
        else:
            return "Fall"

    return month_series.apply(get_season)


species_season_df = (spark.read.option("header", "true").csv(folder_paths)
                     .withColumn("file_path", input_file_name())
                     .withColumn(
                         "species_name", regexp_extract(
                             "file_path", r'/([^/]+)-ERD', 1)
                     )
                     .withColumn("year_int", col("year").cast("int"))
                     .withColumn("day_int", col("day").cast("int"))
                     .filter(col("year_int").isNotNull() & col("day_int")
                             .isNotNull()
                            )
                     .select(
                         "species_name",
                         "year_int",
                         "day_int",
                         "longitude",
                         "latitude"
                     )
                     .withColumnRenamed("year_int", "year")
                     .withColumnRenamed("day_int", "day")
                     .withColumn("month", ((col("day") - 1) / 30 + 1)
                                 .cast("int")
                                )
                     .withColumn("season", get_season_udf(col("month")))
                     .groupBy("species_name", "year", "season")
                     .agg(F.collect_list(F.struct("longitude", "latitude"))
                          .alias("coordinates")
                         )
                     )

species_season_df.show(3, truncate=False)
species_season_df.write.partitionBy("species_name", "year", "season").parquet(
    "ducks_seasons_parquet", mode="overwrite")

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [6]:
from pyspark import StorageLevel

species_season_df = spark.read.parquet("ducks_seasons_parquet")

6. Visualizing Species Sightings
- Defined a dictionary mapping species codes to their common names and inverted it for easy lookup.
- Created a function display_species_sightings() to display sightings of a specific species in a given year and season on a map.
- Retrieved species sightings data from the DataFrame based on the selected species, year, and season.
- Filtered world map data to include only North and South America for visualization.
- Plotted sightings on a map using longitude and latitude coordinates.
- Displayed interactive widgets for selecting species, year, and season, enabling dynamic visualization of sightings.


In [7]:
from ipywidgets import interact, widgets
import matplotlib.pyplot as plt
import geopandas as gpd
from pyspark.sql.functions import col
from pyspark import StorageLevel
import numpy as np
import warnings

# Suppressing
# /tmp/ipykernel_739/1372574148.py:40: FutureWarning: The geopandas.dataset 
# module is deprecated and will be removed in GeoPandas 1.0. You can get the
# original 'naturalearth_lowres' data from
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
#  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
warnings.simplefilter(action='ignore', category=FutureWarning)

# Dictionary mapping species codes to actual species names
species_dict = {
    "bbwduc": "Black-bellied Whistling Duck",
    "fuwduc": "Fulvous Whistling Duck",
    "musduc": "Muscovy Duck",
    "wooduc": "Wood Duck",
    "buwtea": "Blue-winged Teal",
    "cintea": "Cinnamon Teal",
    "norsho": "Northern Shoveler",
    "gadwal": "Gadwall",
    "amewig": "American Wigeon",
    "mallar3": "Mallard",
    "mexduc": "Mexican Duck",
    "ambduc": "American Black Duck",
    "motduc": "Mottled Duck",
    "norpin": "Northern Pintail",
    "gnwtea": "Green-winged Teal",
    "canvas": "Canvasback",
    "redhea": "Redhead",
    "rinduc": "Ring-necked Duck",
    "gresca": "Greater Scaup",
    "lessca": "Lesser Scaup",
    "steeid": "Steller's Eider",
    "speeid": "Spectacled Eider",
    "kineid": "King Eider",
    "comeid": "Common Eider",
    "harduc": "Harlequin Duck",
    "sursco": "Surf Scoter",
    "whwsco4": "White-winged Scoter",
    "blksco2": "Black Scoter",
    "lotduc": "Long-tailed Duck",
    "buffle": "Bufflehead",
    "comgol": "Common Goldeneye",
    "bargol": "Barrow's Goldeneye",
    "hoomer": "Hooded Merganser",
    "commer": "Common Merganser",
    "rebmer": "Red-breasted Merganser",
    "rudduc": "Ruddy Duck"
}

# Invert the dictionary for easy lookup of species code by name
species_name_to_code = {v: k for k, v in species_dict.items()}

# Modify the display_species_sightings function to accept year_range parameter


def display_species_sightings(species_name, year, season):
    """
    Display the sightings of a specific species in a given year and season 
    on a map.

    Args:
    species_name (str): The common name of the species to display.
    year (int): The year of the sightings to display.
    season (str): The season of the sightings to display.

    Returns:
    None
    """
    if season is None or year is None:
        return

    species_code = species_name_to_code[species_name]
    season_df = species_season_df \
        .filter(
            (col("species_name") == species_code)
            & (col("year") == year)
            & (col("season") == season)
        ).select("coordinates").collect()

    # Read world shapefile
    world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

    # Filter world map to include only North and South America
    world = world[(world['continent'] == 'North America') |
                  (world['continent'] == 'South America')]

    # Create subplots for each season
    fig, ax = plt.subplots(figsize=(8, 6))
    fig.suptitle(
        f"Sightings for {species_name} - {season} {year}", fontsize=16)

    # Set static limits for the Americas
    ax.set_xlim([-170, -30])
    ax.set_ylim([-60, 80])

    if season_df:
        coordinates = [(float(coord["longitude"]), float(coord["latitude"]))
                       for coord in season_df[0]["coordinates"]]
        coordinates_array = np.array(coordinates)

        world.plot(ax=ax, color='lightgrey', edgecolor='black')

        # Plot original points on top
        ax.scatter(
            coordinates_array[:, 0],
            coordinates_array[:, 1],
            color='#183318',
            alpha=0.3,
            s=10)

        ax.set_aspect('auto')
        ax.axis('off')
        plt.draw()


# Get all seasons
all_seasons = ['Winter', 'Spring', 'Summer', 'Fall']
restricted_seasons = ['Winter']

# Get unique years from the data
years = sorted(species_season_df.select(
    "year").distinct().rdd.flatMap(lambda x: x).collect())

# Create species dropdown with actual names
species_dropdown = widgets.Dropdown(options=list(
    species_dict.values()), description='Species:')

# Create year dropdown using the dynamically retrieved years
year_dropdown = widgets.Dropdown(options=years, description='Year:')

# Create season button
season_button = widgets.ToggleButtons(
    options=all_seasons, description='Season:', button_style='')


def update_season_options(*args):
    """
    Update the available seasons based on the selected year.
    """
    selected_year = year_dropdown.value
    if selected_year == 2019:
        season_button.options = restricted_seasons
    else:
        season_button.options = all_seasons


# Update season options when the year is changed
year_dropdown.observe(update_season_options, 'value')

# Display widgets
interact(display_species_sightings, species_name=species_dropdown,
         year=year_dropdown, season=season_button)

interactive(children=(Dropdown(description='Species:', options=('Black-bellied Whistling Duck', 'Fulvous Whist…

<function __main__.display_species_sightings(species_name, year, season)>

<video width="720" height="720" 
       src="results.mp4"  
       controls>
</video>

Above is a screen recording of our output, since the notebook needs to be continuously running for the full dynamic functionality.

As you can see, the visualizations are being created very quickly in real time, despite the size of the data. This is owing to the partition struction of  ducks_seasons_parquet, which allows Spark to filter the dataset as needed to duck, year, and season very fast.

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">IX. RESULTS AND DISCUSSION</h1><a id='ExecSum'></a><a id='Title'></a><a

### Results

Visualizations
- Interactive Charts and Maps

Created interactive charts and maps displaying global duck population distributions. These visualizations effectively highlight the geographical spread and density of different duck species.

- The visualizations illustrate the seasonal migration patterns of the selected duck species. During winter season, most duck species are absent from the north of the map, while in the summer, they are more widely distributed across northern breeding grounds.

### Discussion

- The visualizations show that Canada is an especially good place to see ducks in the summer, and Mexico in the winter. They can also be seen much year round in the United States, especially the Eastern United States.

- The fact that even duck species known to be migratory species seem to be constantly present in certain areas such as the US is a possible cause for concern--when migratory animals stop migrating, it is sometimes due to climate change or human intervention (such as when animals settle in areas where they are being fed). For example, there's a population of birds that has stopped migrating "as rubbish dumps provide winter feeding grounds" (Sample, 2016). This behavior can disrupt ecosystems and impact local biodiversity.



<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">X. CONCLUSION</h1><a id='ExecSum'></a><a id='Title'></a><a

Overall, this study demonstrates the value of large-scale citizen science data in ornithological research. The insights gained from this analysis not only contribute to our understanding of avian ecology but also support biodiversity conservation efforts by identifying important areas for preservation and management. Future research can build on this work by incorporating additional environmental variables and extending the analysis to other bird species, further enriching our knowledge of avian migration and ecology.

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">XI. LIMITATIONS AND RECOMMENDATIONS</h1><a id='ExecSum'></a><a id='Title'></a><a

### A. Limitations

1. Data Biases
   - The eBird data is based on citizen science contributions, which may introduce biases due to uneven observation efforts across different regions and times. 
   - There may be gaps in the dataset, particularly in remote or less accessible areas where fewer observations are recorded.

### B. Recommendations

1. Enhanced Data Collection
   - Encourage more consistent and widespread participation in bird observation programs to reduce data gaps and biases.
   - Integrate additional data sources, such as satellite tracking and habitat quality assessments, to complement citizen science data.

2. Conservation Strategies
   - Protect critical stopover sites identified in the study to ensure ducks have safe resting and refueling areas during migration.

3. Further Research
   - Conduct species-specific studies to understand unique ecological requirements and threats faced by different duck species.
   - Investigate the impacts of climate change on migration timing and routes to predict future changes and plan conservation efforts accordingly.

<a id='Header2'></a>
<h1 style="color:#E4DEBE; background-color:#183318; padding: 20px 0; text-align: left; font-weight: bold; padding-left: 20px;">XII. REFERENCES</h1><a

Birdfact. (2022, January 16). *Do Ducks Migrate (All You Need To Know)*. Bird Fact. https://birdfact.com/articles/do-ducks-migrate

Sample, I. (2016, January 22). Birds stop migrating as rubbish dumps provide winter feeding grounds. *The Guardian.* https://www.theguardian.com/science/2016/jan/22/birds-stop-migrating-as-rubbish-dumps-provide-winter-feeding-grounds

Wikipedia Contributors. (2019, May 1). *Duck*. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Duck