# Effects of C. finmarchicus Distribution on Right Whale Population 

By Annabelle Platt  
Dedicated to Mark and the Bois

In [20]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import process_data as pro_dat
import visualizations as vis

## Introduction

### Project Goals 
In this project, I investigate what effect the changing distribution of the zooplankton species *Calanus finmarchicus*, the primary food source for North Atlantic Right Whales, has on sightings of North Atlantic Right Whales. I also look at whether C. finmarchicus distribution has been changing over time, and if this lines up with a shift in whale locations. This is important because if *C. finmarchicus* is moving, and right whales are moving along with it, the environmental implications are many and serious. 

### *Calanus Finmarchicus* Background
*Calanus finmarchicus* is a very important species of zooplankton, ecologically speaking. It is what is known as a keystone species, meaning that its ecosystem depends on it so much that if it was removed, the ecosystem would change dramatically. It plays a critical role in the food web, eaten by cod, herring, shrimp, and right whales, among many others. Furthermore, it is a much-studied species, especially with regards to climate change, because it responds quickly to changing environmental factors.

### Project Significance
A move in *C. Finmarchicus* distribution would imply a movement toward more favorable environmental factors. If this shift occurs northward, we could reasonably conclude warmer waters are the reason for this. Further analysis with temperature data would of course be necessary to verify this. Second, if *C. finmarchicus* is moving, this could potentially affect the migration of the right whales. Right whales migrate south in the winter, where calves are born in the warmer water. However, that same warm water has little in the way of nutrients, and they migrate north to the nutrient-rich waters to feed in the summer. If *C. finmarchicus* is moving further north, this will mean a longer migration, less feeding time, and less calving time, which would have potentially detrimental effects on the breeding of this already highly endangered species. 


### Findings 
In this project, I make map plots showing copepod density and right whale sightings year by year. I find that the copepod data was taken in inconsistent locations, making it difficult to detect any systematic shift in density. However, I also find that there has been significant *C. finarchicus* presence at high northern latitudes since the 1930s, so it seems unlikely that a systematic northward shift is occurring. I also find that right whales do not seem to be exhibiting any kind of northward shift, and in fact seem to shift ever so slightly southwards, but with only about 20 years of data in this respect it's hard to tell if this is a long-term trend. In short, the results were inconclusive. There are many ways in which this project could be extended or improved. For more on this, see the suggested improvements in the conclusion

## Methodology

### Data Used

There are two datasets used in this project. The first contains data from an aerial wildlife survey, from 1999 - 2017 (see the readme for citations). From this, we will plot the location and number of right whale sightings over the years. 

The second dataset contains data on many species of zooplankton from plankton tows from 1934 - 2020. This data includes the density of the species, and the latitude and longitude the measurements were taken at, as well as the date the observations were taken. From this data, we will plot the location and density of Calanus finmarchicus over time and compare to the whale sightings over time. 

### Retrieving the Data

Data is downloaded using the Python `requests` library. 

The whale data is formatted as a Darwin Core Archive file, so it first needs to be downloaded and extracted using the `zipfile` and `io` libraries (both part of the Python standard library). The .txt file inside can then be read into a .csv file, using the `python-dwca-reader` library. For install instructions, see the readme. 

The copepod data is in a .txt file from the NOAA's plankton database that can be downloaded and written straight to a .csv. 

Due to licensing constraints, the dataset .csv files are not tracked in this Github repo. However, both .csvs used in this project can be obtained by running the script `retrieve_data.py`. Once this is done, the processing section below reads the data into Pandas dataframes and cleans it up for use. 

### Processing Data

Once all the data has been read to the .csv files, the aerial survey data can be read directly into a Pandas dataframe. We will read only the columns that contain relevent data for us, but the remaining columns can be accessed at any time, since we are not altering the original .csv. We can then filter the dataframe to only observations of North Atlantic Right Whales.

In [21]:
# Read the data into a Pandas dataframe
whale_survey = pd.read_csv("data/whale_df.csv", 
                           usecols=["id", "recordNumber", "individualCount", 
                                    "vernacularName", "verbatimEventDate", 
                                    "decimalLatitude", "decimalLongitude",
                                    "scientificName", "vernacularName"])

# Assign a smaller dataframe containing only the right whale observations
right_whales = pro_dat.quick_filter(whale_survey, "vernacularName", "North Atlantic Right Whale")

# Split the date and time into separate columns
right_whales = pro_dat.split_column(right_whales, " ", "verbatimEventDate", 
                                   ["date", "time"], reset_index=True)

# Now split the date into year, month, and day
right_whales = pro_dat.split_column(right_whales, "-", "date",
                                    ["year", "month", "day"])

# Sort so the date is in chronological order
right_whales = right_whales.sort_values(by=["year", "month", "day"])
right_whales

Unnamed: 0,index,id,recordNumber,individualCount,decimalLatitude,decimalLongitude,scientificName,vernacularName,time,year,month,day
1262,40249,513_1616,513_1616,2.0,42.699500,-70.006000,Eubalaena glacialis,North Atlantic Right Whale,13:10:59,1999,03,26
959,39946,513_1688,513_1688,1.0,43.117500,-69.680000,Eubalaena glacialis,North Atlantic Right Whale,12:00:26,1999,04,25
962,39949,513_1689,513_1689,2.0,43.128833,-69.606000,Eubalaena glacialis,North Atlantic Right Whale,12:18:02,1999,04,25
983,39970,513_1690,513_1690,1.0,43.151500,-69.594333,Eubalaena glacialis,North Atlantic Right Whale,12:31:35,1999,04,25
1020,40007,513_1691,513_1691,1.0,43.148833,-69.589000,Eubalaena glacialis,North Atlantic Right Whale,12:32:33,1999,04,25
...,...,...,...,...,...,...,...,...,...,...,...,...
1137,40124,513_90508,513_90508,10.0,47.545400,-64.121500,Eubalaena glacialis,North Atlantic Right Whale,12:55:22,2017,07,26
1,38988,513_90590,513_90590,2.0,48.042010,-62.930500,Eubalaena glacialis,North Atlantic Right Whale,15:34:25,2017,07,29
515,39502,513_90618,513_90618,7.0,47.766370,-63.894520,Eubalaena glacialis,North Atlantic Right Whale,10:12:01,2017,07,29
786,39773,513_90575,513_90575,24.0,47.675000,-63.968480,Eubalaena glacialis,North Atlantic Right Whale,08:56:45,2017,07,29


The copepod data needs a bit more processing to convert to a .csv. Fortunately, most of its formatting oddities occur in columns that don't contain data we care about, so we can deal with it as we did before, by only reading specific columns into our dataframe. Once again, all the original data is still contained within the .csv and we can quickly add or remove columns from the dataframe at any time. 

In [22]:
# Read specific columns of the copepod data into a dataframe. 
copepods = pd.read_csv("data/copepod_data.csv", header=3, 
                       skiprows=[4], index_col=False,
                       usecols=["YEAR", "MON", "DAY", "TIMEgmt", "TIMEloc",
                                "LATITUDE", "LONGITDE", "LIF", "PSC", "SEX",
                                "V", "Water Strained", "Original-VALUE",
                                "Orig-UNITS", "VALUE-per-volu", "UNITS",
                                "F1", "F2", "F3", "F4", "VALUE-per-area",
                                "UNITS", "F1", "F2", "F3", "F4",
                                "SCIENTIFIC NAME -[ modifiers ]-"], 
                      low_memory=False)
copepods.head()

Unnamed: 0,YEAR,MON,DAY,TIMEgmt,TIMEloc,LATITUDE,LONGITDE,LIF,PSC,SEX,...,Original-VALUE,Orig-UNITS,VALUE-per-volu,UNITS,F1,F2,F3,F4,VALUE-per-area,SCIENTIFIC NAME -[ modifiers ]-
0,1995,6,11,20.117,-99.0,-22.43,-41.02,0,0,0,...,423.025,#/m3,423.025,#/m3,0,0,0,-1,10068.0,Copepoda -[ ]-
1,1995,6,11,20.117,-99.0,-22.43,-41.02,0,0,0,...,28.5692,g/m3,,----,0,0,0,-1,,Copepoda -[ ]-
2,1995,6,11,20.117,-99.0,-22.43,-41.02,0,0,0,...,5.0014,g/m3,,----,0,0,0,-1,,Copepoda -[ ]-
3,1995,6,11,20.117,-99.0,-22.43,-41.02,0,0,0,...,4.6246,mg/m3,,----,0,0,0,-1,,Copepoda -[ ]-
4,1995,6,12,3.9,-99.0,-23.35,-41.08,0,0,0,...,891.2492,#/m3,891.2492,#/m3,0,0,0,-1,98305.0,Copepoda -[ ]-


Notice that the column `SCIENTIFIC NAME -[modifiers]-` contains two sets of data - the scientific name, and the modifiers, which are things like stage of life. Because we will want to filter by scientific name for *Calanus finmarchicus* later, we need to split this column into two, with the scientific name in one and the modifiers in the other. Fortunately, the modifiers are clearly marked within the `-[]-` structure. Note that the separator includes a space before the `-` symbol. This is because otherwise there will be a trailing space at the end of each scientific name, since each modifier is preceded by a space. 

Once done, we can actually drop the modifiers columns since we won't be using it. 

In [23]:
copepods_split = pro_dat.split_column(copepods, " -",
                                      "SCIENTIFIC NAME -[ modifiers ]-",
                                      ["scientific_name", "modifiers"], n=1)
copepods_split = copepods_split.drop("modifiers", axis=1)
copepods_split

Unnamed: 0,YEAR,MON,DAY,TIMEgmt,TIMEloc,LATITUDE,LONGITDE,LIF,PSC,SEX,...,Original-VALUE,Orig-UNITS,VALUE-per-volu,UNITS,F1,F2,F3,F4,VALUE-per-area,scientific_name
0,1995,6,11,20.117,-99.0,-22.430,-41.02,0,0,0,...,423.0250,#/m3,423.0250,#/m3,0,0,0,-1,10068.,Copepoda
1,1995,6,11,20.117,-99.0,-22.430,-41.02,0,0,0,...,28.5692,g/m3,,----,0,0,0,-1,,Copepoda
2,1995,6,11,20.117,-99.0,-22.430,-41.02,0,0,0,...,5.0014,g/m3,,----,0,0,0,-1,,Copepoda
3,1995,6,11,20.117,-99.0,-22.430,-41.02,0,0,0,...,4.6246,mg/m3,,----,0,0,0,-1,,Copepoda
4,1995,6,12,3.900,-99.0,-23.350,-41.08,0,0,0,...,891.2492,#/m3,891.2492,#/m3,0,0,0,-1,98305.,Copepoda
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
588890,1970,9,8,17.000,-99.0,1.215,103.83,9,2,0,...,96472.,#/haul,557.1908,#/m3,0,0,0,-1,,Copepoda
588891,1970,9,24,17.000,-99.0,1.215,103.83,9,2,0,...,14771.,#/haul,,-----,-6,-4,-2,-1,,Copepoda
588892,1970,10,8,17.000,-99.0,1.215,103.83,9,2,0,...,73069.,#/haul,306.0097,#/m3,0,0,0,-1,,Copepoda
588893,1970,10,22,17.000,-99.0,1.215,103.83,9,2,0,...,61104.,#/haul,482.6540,#/m3,0,0,0,-1,,Copepoda


Finally, we want to filter our data to include only observations on *C. finmarchicus*. We can now see that the only species of copepod in the dataframe is *C. finmarchicus*. Now that we have all our data in a workable format, we can start visualizing it. So that our time scale is chronological when we visualize, we also want to sort the values by year. 

In [24]:
# Filter to only C. finmarchicus observations
cal_fin = pro_dat.quick_filter(copepods_split, "scientific_name", "Calanus finmarchicus")
# Rename these columns to values we can actually call
cal_fin = cal_fin.rename(columns={"VALUE-per-area": "value_per_area", 
                                  "VALUE-per-volu": "value_per_volume"})

# Drop any n/a or "null" values. This is a pain because Pandas doesn't recognize
# them as n/a so we need our own filtering function
cal_fin["value_per_volume"] = cal_fin["value_per_volume"].str.strip()
cal_fin = pro_dat.anti_filter(cal_fin, "value_per_volume", "n/a")
cal_fin = pro_dat.anti_filter(cal_fin, "value_per_volume", "null")

# Cast the density as a float
cal_fin = cal_fin.astype({"value_per_volume": "float64"})

# Sort the values in ascending order by data
cal_fin = cal_fin.sort_values(by=["YEAR", "MON", "DAY"])

cal_fin

Unnamed: 0,YEAR,MON,DAY,TIMEgmt,TIMEloc,LATITUDE,LONGITDE,LIF,PSC,SEX,...,Original-VALUE,Orig-UNITS,value_per_volume,UNITS,F1,F2,F3,F4,value_per_area,scientific_name
96798,1938,7,9,99.990,-99.000,81.180,137.420,10,2,0,...,60.,#/haul,3.721,#/m3,0,0,0,-1,558.1,Calanus finmarchicus
96800,1938,7,9,99.990,-99.000,81.180,137.420,10,2,0,...,14.,#/haul,1.302,#/m3,0,0,0,-1,130.2,Calanus finmarchicus
96802,1938,7,9,99.990,-99.000,81.180,137.420,25,2,0,...,1.,#/haul,0.062,#/m3,0,0,-4,-1,9.3,Calanus finmarchicus
96804,1938,7,9,99.990,-99.000,81.180,137.420,25,2,0,...,1.,#/haul,0.093,#/m3,0,0,-4,-1,9.3,Calanus finmarchicus
96806,1938,7,9,99.990,-99.000,81.180,137.420,25,2,0,...,1.,#/haul,0.093,#/m3,0,0,-4,-1,9.3,Calanus finmarchicus
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568452,2001,6,9,7.567,7.567,38.925,-71.996,0,0,0,...,100.43,#/m3,100.430,#/m3,0,0,0,-1,,Calanus finmarchicus
568482,2001,6,9,9.917,9.917,38.447,-71.402,0,0,0,...,33.48,#/m3,33.480,#/m3,0,0,0,-1,,Calanus finmarchicus
568618,2001,10,13,0.150,0.150,40.376,-73.756,0,0,0,...,100.43,#/m3,100.430,#/m3,0,0,0,-1,,Calanus finmarchicus
568665,2001,10,13,12.950,12.950,37.655,-70.593,0,0,0,...,33.48,#/m3,33.480,#/m3,0,0,0,-1,,Calanus finmarchicus


## Results

Let's plot the *C. finmarchicus* map first. Due to the creature's small size, its distributions are generally described by a density, so a density plot will display this data well here. We can call the `make_density_plot` function from `visualizations.py` to do this.

This map is created using Plotly, a Python library based on R's `ggplot2` library. It is intended to make quickly iterable graphs, and as such the parameters frequently change as the plot is improved. For this reason, it doesn't make sense to put the actual visualization call in a function, since the function can't accomodate all the possible plotting arguments and every time the function is updated the kernel needs to be restarted. Accordingly, I have placed the figure call below. Feel free to mess around with any of the styling parameters. 

What does make sense, however, is to write a function to add the map and show the figure, which I have done in the `visualizations.py` file, and called below.

In [36]:
copepod_fig = px.density_mapbox(cal_fin, lat="LATITUDE", lon="LONGITDE", 
                                radius=7,
                                zoom=.75,
                                animation_frame="YEAR",
                                animation_group="value_per_volume",
                                height=600,
                                mapbox_style="stamen-terrain")
vis.add_map(copepod_fig)

In this figure, we encounter a massive limitation of our data - the collection locations are all over the place. This makes it nearly impossible to discern any systematic movement over time. 

What is clear, however, is that there has been significant, consistent *C. finmarchicus* presence at high north latitudes since at least the 1930s. In 1930 we can see a large clump above Russia, and in the 1960s, we can see a large bunch above Scandinavia. In 1991 we can see a large clump near Greenland that increases the next two years, potentially representing the possibility of a shift, but in 1986 we see a clump much further north, so this would actually represent a downward shift. In any case, it is clear that *C. finmarchicus* at high latitudes is nothing new. 

The lack of data in consistent locations makes it very difficult to tell if there is any kind of shift present, northward or otherwise. We do see a consistent presence of *C. finmarchicus* off the northeast coast of the US and Canada throughout the year, which, as we will see in the graph below, is consistent with where the whales tend to be sited. However, there is no evidence of a long-term shift in the density of *C. finmarchicus*. 

Now let's plot the whale data. Because whales are described discretely, we can use a map scatterplot for this. We can assign the variable `individualCount` to make the size of each dot represent how many whales were sighted at each location. We can also make the color of the dot represent the month in which the sighting was recorded. 

As above, I have placed the map plot creation below for quick improvement access. The same function is called to create the map as for the previous figure. 

In [35]:
whale_fig = px.scatter_mapbox(right_whales, lat="decimalLatitude", lon="decimalLongitude",
                        animation_frame="year",
                        animation_group="individualCount",
                        hover_data=["month"],
                        size="individualCount",
                        color="month",
                        title="Right Whale Distribution Over Time",
                        zoom=4, height=600,
                        mapbox_style="stamen-terrain")

vis.add_map(whale_fig)

Although this data is limited, since it comes from an aerial survey, which is not a reliable population indicator, this plot still reveals some interesting things. 

These results are actually somewhat baffling. In general, instead of a northward trend in whale sightings, we actually see a southward trend. Up until about 2008, there were usually several whale sightings north of the Cape Cod hook of Massachusetts, but only one or two below. There were usually some fairly far above, near the Bay of Fundy. However, around 2006 we start to see a shift, with more whales appearing below Cape Cod and fewer near the Bay of Fundy. Just to confound all this, in the final year of available data we see several large clumps of whales much further north in the Gulf of St. Lawrence, in June and July. 

I was curious whether any part of these shifts were simply due to when the data was collected - maybe if in earlier years it was collected later in the season, the whales hadn't yet migrated as far north as the Bay of Fundy. This is why I plotted the color of each dot as the month the sighting was observed in. In general there does seem to be a trend northward in later months - this is especially clear in 2003, 2010, and 2014. But then there are year like 1999 - 2003, 2006 - 2009, and 2012, where the sightings in each month are all over the map, or all clustered together. The more neatly grouped years seem to be anomalies, although we would need more data to verify this. However, the Gulf of St. Lawrence groups were spotted in June and July, later on in the season, which would support the hypothesis that the lower latitude shift *might* be due to earlier seasonal observations. Given all this contradictory evidence, the only thing that could clear this up is more and better data. 

Unfortunately, the copepod data only goes up until 2001, while the whale data begins in 1999. If the copepod data had exhibited any noticable trend, we might have been able to see the pattern continued in the whale data. Since it didn't, we can't. However, we can look at the three overlapping years of data that we have. In 1999, we actually see the May whale observations line up nicely with the *C. finmarchicus* distribution along the east coast. However, the measurements in 2000 and 2001 are in a neat line far below where the whales were seen. From this, we can conclude that our copepod data, due to the measurements being taken all over the place, is effectively useless for the kind of comparisons we're trying to make.  

As a side note, I noticed there were many, larger groups of whales sighted in 2002, as compared to the previous and following years. I wonder if this may have something to do with 9/11. A study conducted in 2012 (linked in the citation section of the readme) found that the reduction in noise from shipping lanes due to low traffic after 9/11 reduced stress in North Atlantic Right Whales. While sightings shown here may have been documented too late after 9/11 for the shipping lanes to still be quieter (by hovering over the dots on the map, we can tell they were observed between April and December), I wonder if the higher number of whale sightings, and the larger groups, are related to the decrease in noise.  

## Conclusion

### Insights  
Overall, the results of this analysis are thoroughly inconclusive. It is impossible to find any kind of pattern in the *C. finmarchicus* data with the inconsistent location. There is no evidence of systematic copepod movement in any direction, and only loose, inconclusive evidence of any kind of movement from the whales. With only 17 years of data, we can't tell if the slight patterns detected are detected are evidence of a long-term trend or just a slight oscillations in a regular pattern. Furthermore, the whale data is from an aerial survey which does not accurately represent the whale population, and, to a lesser degree, *may* fall victim to the same problem as the copepod data, being dependent on where the plane flies to collect data. 

In the end, the few years of overlap between the two datasets didn't end up mattering, since the copepod data was so mixed up anyway there was no way to draw conclusions from it. 

### Contextual Implications

Although this analysis was inconclusive, the question being investigated remains an important one. Determining *if* any kind of shift (or diminishment) is occurring in *C. finmarchicus* as the ocean warms and acidifies, and what the impact on North Atlantic Right Whales would be remains important in protecting a critically endangered species. 

### Lessons Learned

One of my biggest personal takeaways from this project is that "publicly available" and "publicly accessible" data are not necessarily the same thing. As a government agency, most of NOAA's data is required to be publicly available, and indeed, they do have many databases containing houndreds of gigabytes of data. However, much of the data is in enormous files that are in a format inaccessible without special software. 

Along the same vein, finding data on right whale population proved surprisingly difficult as well. Although the North Atlantic Right Whale Catalog is technically publicly available, it's nearly impossible to access without belonging to or being affiliated with some sort of research group or institution. The Continuous Plankton Record is similarly restricted. In addition, although right whale sighting data was available in live map format, the data itself was not downloadable, even through HTML. This is the reason I ended up using a relatively small aerial survey to estimate population - it was the only thing I could find. 

My point in each of these cases is that it's incredibly difficult for casually interested observers to get access to this kind of information and draw their own conclusions. When people tell say to "do your own research", taking this suggestion literally turns out to be an incredible time sink, and in some cases not to be possible at all. This further widens the divide between the scientific community and everyday people, a divide that has been growing like an unhealthy tumor lately. Scientists are notoriously bad at communicating their findings effectively, and if their data is not readily available, it becomes even more difficult to discern fact from poor media coverage. Although this isn't coding related, I would say this is the most important, and troubling, lesson I learned about freedom of information during this project. 

### Challenges

As mentioned before, finding whale data was incredibly difficult. It turns out that, first of all, we haven't been studying whales for very long, and secondly, the data we do have is very difficult to access without being affiliated with a research group. I spent hours trying to find any kind of population data I could on right whales, and eventually ended up using the only dataset I could find, an aerial survey. 

Once I found datasets, reading and writing them, and then cleaning them, turned out to be more difficult than expected. It turns out real world data is messy, full of typos, and sometimes formatted inconveniently. Wrangling it into a usable dataframe is hard and time consuming. I also found it frustrating, having been working with R and the ggplot2 library recently, not being able to perform functions that I knew I could very quickly accomplish in R but that forced me to write long, laborious functions to accomplish in Python. However, it taught me a good lesson about how much processing happens under the hood for those quick functions I use without thinking about. 

Of course, it's also frustrating to do all this data wrangling and analysis and then find the data totally inconclusive, but I think there's more important not to impose false connections on the data just because I felt I needed to have connections for this project to be worthwhile. That's how data is sometimes: sometimes what you're looking for just isn't there, and it doesn't do any good to fool yourself into thinking it is. 

### Extensions

Given more time, I would try to find more *C. finmarchicus* data beyond 2001, hopefully taken from a consistent area. I would also try to find some data on sea surface temperature (SST) and perhaps other environmental factors, and see if there was any connection between those variables and both copepod density and whale population. I would also like to find proper whale population data, maybe attempting to access the official catalog. At a bare minimum, I think it would be interesting to explore the rest of the data above. There are many variables I didn't get a chance to play around with. I think it would be interesting to experiment with different species other than right whales and *C. finmarchicus*, analyze other variables, or try visualizing the data differently. There are infinite jumping points from here. 