# Bird Population Decline in the United States
Calvin Davis, Sabine May, Braden Nowicki, Hudson Jones


## Introduction
Recent studies have shown a staggering decline in bird populations across North America since 1970 ([Nearly 3 Billion Birds Gone](https://www.birds.cornell.edu/home/bring-birds-back/#:~:text=NARRATOR%3A%20Birds%20are%20losing%20the,toxic%20pesticides%20and%20insect%20declines.)) Worldwide, we are seeing a similarly dramatic loss of biodiversity. These are consequences of a changing world, a world pushed to the brink by human activity. It is of utmost importance that we understand, address, and potentially alleviate these harmful changes.

More than ever, data science is an integral part of developing such understanding and promoting positive change in the world. The subject of bird loss - along with a wider range of environmental/conservation subjects - will benefit greatly from the application of quality data science. This project is therefore a pertinent exercise in using data science for positive change, as well as a fantastic example of the power of data science in the current day.

In this report, we present the complete data science pipeline, incorporating the following steps:

1. Data Collection
2. Data Processing
3. Exploratory Data Analysis and Visualization
4. Hypothesis Testing and Machine Learning
5. Insight and Policy Decision

The issue of bird population decline is a complex and nuanced one, but we hope to provide the reader with an improved understanding of some critical factors through our analysis, such as habitat loss, urbanization/human activity, commercial building density, tree cover loss, and temperature increase due to climate change. These factors barely scratch the surface of the complex ways ecosystems and biodiversity respond to widespread change, but analyzing them in connection with bird population decline will hopefully prompt us to consider the environmental impacts of what we do. On a local level, a national level, and an international level, we can implement policies that curtail this dramatic bird loss; to do so, we must first understand what the data tells us.

This project was done using [IPython](https://ipython.org/) and [Jupyter Notebooks](https://jupyter.org/), which provide easy and interactive coding environments for Python.

## Data Collection
### North American Breeding Bird Survey Dataset
In this part of the data science pipeline, we bring together multiple trustworthy sources of data.

The primary data set we will be working with is the [North American Breeding Bird Survey Dataset](https://www.sciencebase.gov/catalog/item/52b1dfa8e4b0d9b325230cd9). This is a cohesive, well-maintained, and annually-updated dataset containing bird count data across North America. The data is organized by state and, within each state, by route. Routes are designated paths/locations where skilled avian identifiers (birders) use their ears and eyes to count all birds in the area (by species). Typically, bird counts at a route are done in June of each year. This dataset is widely used by environmental and federal agencies to track the population health of over 700 species of birds.

We will be using this dataset to measure bird population data and, in particular, population decline over the years 1966-2022 (included in the 2023 data release). Eventually, we will bring in other environmental datasets to get an idea for what factors lead to bird population decline.

The most recent data release can be downloaded from the North American Breeding Bird Survey Dataset page linked above. Under "Child Items," download "2023 Release - North American Breeding Bird Survey Dataset (1966-2022)." The README contains useful information on the structure of the data. We will mainly be working with the CSV files within the "States" folder; these files contain information about the number of birds of each species counted at each stop for each year. We will also use information in "routes.csv", which includes coordinates and information about each route that we can use to connect the population counts with other environmental data sets.

We store the desired files in a directory titled 'Data'. Feel free to change the directory structure and adapt the paths in the code to your specific file structure.

TODO - also SpeciesList?

TODO - consider pesticide use?

### Global Forest Watch Deforestation Data

Download the excel file from the download icon next to "Share Dashboard" in [Global Forest Watch](https://www.globalforestwatch.org/dashboards/country/USA/?category=forest-change). 

Each county has eight entries in the dataset detailing different "thresholds" of forest coverage. The data was created using remote sensing, dividing the US into a very large grid of 30 meter by 30 meter squares. The thresholds detail which squares are included in the numbers by the percentage of forest coverage in each square; for example, the data point for Prince George's County, MD at threshold 50 details the forest loss for all of the squares within PG county that started with at least 50% forest coverage, but ignores squares in PG county with less initial forest coverage than that.

There are two CSV files (one that maintains the by-year distinction, and one that sums the forest loss across the years) saved for each threshold level: 0, 10, 15, 20, 25, 30, 50, 75. There are also two CSV files saved with all thresholds included.

The data is divided into counties, and further divided into forest loss by year. There is also a metric indicating the percentage of the forest loss based on the total forest area in each county. 

Not every bird thrives in forested habitat. However, deforestation is a form of habitat loss for many birds. This is a significant factor in areas with lots of forested space, such as Maryland, and so may prove important when we zoom in and analyze this bird population decline in Maryland.

### National Center for Health Statistics Urban-Rural Classification (By County)

This data is a survey from 2013 rating each US county's level of urban development. This may be an important factor affecting bird population health. Migrating birds often die by striking tall buildings. Also, urban expansion is linked with habitat loss.

1 = Large central metropolitan (Big cities)

2 = Large fringe metropolitan

3 = Medium metropolitan

4 = Small metropolitan

5 = Micropolitan

6 = Non-core (Rural)

You'll need to download this data from [CDC NCHS Urban-Rural Classification](https://www.cdc.gov/nchs/data_access/urban_rural.htm#:~:text=NCHSurbruralcodes,XLS%20%E2%80%93%20175%20KB%5D)
This link will highlight the thing you need to download--the "NCHSUrbRuralCodes" XLS file.
Download it and place it in the data folder, making sure it's named "NCHSURCodes2013.xlsx".

### Open Energy Data Initiative: City and County Commercial Building Inventories

This dataset includes information about a significant amount of commercial buildings in the United States. At [Open Energy Data Intiative](https://data.openei.org/submissions/906), download the data in .xlsb files for all regions (Midwest, Northeast, South Atlantic, South Central, West). This is a very large dataset (over two million entries!), so once you run the data processing code for this dataset (shown below), **do not** run it again. Processing the data to a more concise, usable form takes several minutes. After processing the relevant data will be stored as a CSV in the SharedData folder (or any folder you want).

Similarly to the urban-rural classification data, building density (which we will extract from this data) provides a fantastic metric for tracking human developments which may cause habitat loss and population decline.

### National Centers for Environmental Information: Average Temperature Time Series
We would like to consider temperature changes in the US (related to climate change). Temperature changes results in habitat changes and potential habitat loss. Birds have evolved and adapted to survive in very specific ecological niches; as temperatures change, food sources and landscapes within a habitat change. Because of this, many birds develop trouble finding food and shelter in habitats where they previously thrived(TODO: LINK).

The [National Centers for Environmental Information Website](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/national/time-series/110/tavg/12/0/1966-2022?base_prd=true&begbaseyear=1970&endbaseyear=2022) provides National Time Series data for a variety of metrics. For this project, we want the Average Temperature on a 12-Month timescale (considering All Months) for the years 1966-2022. Download the data as a CSV file and store where desired.

**Imports**

In [1]:
# Imports

# Data Collection, Storage/Manipulation
import pandas as pd
import numpy as np
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy
from shapely.geometry import Point
import requests # TODO

# Data Visualization
import matplotlib.pyplot as plt

# Data Analysis/Hypothesis Testing
from sklearn.linear_model import LinearRegression

# Additional Useful Imports
import os
import datetime # TODO


[Pandas](https://pandas.pydata.org/) is a versatile and expansive data analysis and manipulation tool built for Python. Pandas incorporates useful data structures like data frames and data series. We will make extensive use of Pandas for this project.

[Numpy](https://numpy.org/) is a widely used package for scientific computing.

[Geopandas](https://geopandas.org/en/stable/) and [Cartopy](https://scitools.org.uk/cartopy/docs/latest/) are used primarily for working with geospatial data and maps in Python. [Shapely](https://pypi.org/project/shapely/) also allows for more effective manipulation and analysis of geometric objects.

[Requests](https://pypi.org/project/requests/) is an HTTP library used for scraping data from websites.

[Matplotlib](https://matplotlib.org/) and [PyPlot](https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html) are used for plotting and visualization in Python.

[SciKit Learn's Linear Regression Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) is used to perform Ordinary Least Squares Linear Regression.

[OS](https://docs.python.org/3/library/os.html) is for dealing with operating system interfaces and filepaths.

[Datetime](https://docs.python.org/3/library/datetime.html) is used for working with dates and times.



## Data Processing
In Data Processing, we clean the data from our sources, adjust them for usability, and combine them together. One particularly useful piece of information for combining datasets is the FIPS code. This code uniquely identifies counties in the United States. As you will see, we are able to use the coordinates of each bird route in the North American Breeding Bird Survey Dataset to find the county, and the FIPS code, of each route. Many environmental datasets use FIPS code and/or longitude and latitude coordinates, so we are able to find environmental data associated with each route.

### Processing North American Breeding Bird Survey Dataset

### Processing Global Forest Watch Deforestation Data

### Processing National Center for Health Statistics Urban-Rural Classification (By County)

### Processing Open Energy Data Initiative: City and County Commercial Building Inventories

### Processing National Centers for Environmental Information: Average Temperature Time Series

## Exploratory Data Analysis and Visualization
Now that we have all of this data, we want to develop an understanding of what the data tells us and how different factors interact with each other. This section is largely exploratory, meaning we are playing with some different visualizations and relationships to determine what questions we want to ask and what we want to consider in our models. Some of the visualizations are also useful for outreach purposes; the ways in which data is presented dictate how people interpret it. In other words, a data scientist focusing on outreach must carefully think about how they want data to be received and understood to powerfully impact the target audience.

We are interested in both spatial and temporal relationships in the data. In other words, we want to understand how our metrics (bird population/population decline, environmental factors) vary both in time and across the country. We therefore make extensive use of plots showing time series (with date or some analogous metric on the independent axis) and of color gradient plots showing how certain metrics vary over a map of the United States.

## Hypothesis Testing and Machine Learning
Now, we attempt to create several predictive models which will allow us to predict things like bird population and population decline based on myriad environmental factors. Such models will allow us to predict likely future trends and values based on previous data. Additionally, analyzing the data through these models will allow us to understand which environmental factors are the most significant predictors of bird population decline. Through this statistical analysis, we can begin to understand what practices are most harmful to bird populations, giving us some insight about policy decision which will reduce population decline in the future.

## Insight and Policy Decision
In this part of the data science pipeline, we interpret what our visualizations and models are telling us about the data. We considered a wide variety of factors and tested them for significance in their effects on bird population. 



TODO - include links throughout pointing to different aspects of data science process