<b><font size=20, color='#A020F0'>Homework 2</font></b>

#### In this homework you'll further explore pandas by working with oceanographic research cruise data from the Arctic

<b><font color='red'>Due Date: 10 October 2022</font></b><br>(by the beginning of class)

<b>How you will turn in this assignment</b><br> When you are ready to turn in your homework, do the following steps:
1. Execute all cells in your notebook so that the results are visible, and save one more time. It is ok if you have code that you practiced with, but <b><u>make sure your final answers to each question are clearly marked so that your TA and I know what to grade</u></b>. (You can also collapse the code and outputs that you _don't_ want us to grade; options to collapse and expand code are in the 'View' menu in the upper left)
2. Open a terminal and navigate to your local `aos573_completed_assignments` repository and make a new directory called `completed_HW2`
3. Move your completed jupyter notebook into this directory
4. `add` and `commit` the `completed_HW2` directory and its contents to your local `aos573_completed_assignments` repository
5. Finally, `push` your changes to your remote `aos573_completed_assignments` repository: `git push finished_work main` (you'll need to enter your username and personal access token)

---

## Part 1: Getting the data and summarizing it
The data we'll be using for this homework is from the [Global Ocean Data Analysis Project (GLODAP)](https://www.glodap.info/), which is a massive, global synthesis of quality-controlled ocean biogeochemical observations. GLODAP contains data from thousands of cruises from the early 1970s to the present. Here we'll only work with data from the Arctic Ocean.

Run the following commands to download and unzip the GLODAP Arctic Ocean data:
```bash
!curl -O https://www.glodap.info/glodap_files/v2.2021/GLODAPv2.2021_Arctic_Ocean.csv.zip
!unzip GLODAPv2.2021_Arctic_Ocean.csv.zip
```

The dataset has far more variables than you'll be working with in this assigment, so to help you out, below I've given you the code to read in only a subset of the variables, and I've renamed the columns slightly. If you want to compare what I've done here to the unsubsetted dataset, feel free to read the entire thing in on your own with `pandas.read_csv()`

In [None]:
import pandas as pd

In [None]:
#List of variables 
names=['station','year','month','day','hour','minute','latitude','longitude','depth','theta','salinity',
                      'salinityf','sigma0','oxygen','oxygenf','cfc11','cfc11f']
#Read in only the variables above with `usecols`
df=pd.read_csv('GLODAPv2.2021_Arctic_Ocean.csv',sep=',',usecols=['G2'+i for i in names])
#Reset the column names to be those in the list above and not the original names, 
#which all have 'G2' appended to the front
df.columns=names

Some dataset info that you will find helpful (or access the entirety of the [metadata](https://www.ncei.noaa.gov/data/oceans/ncei/ocads/metadata/0237935.html) if you would like):

1. year, month, day, hour, minute = sampling date and time
2. latitude/longitude = geographical coordinates of the sampling location
3. depth (m) = the depth of the water sample in meters
4. missing fill value = -9999.0 (the fill value used when data is missing)

Here is a list of the remaining variables, their descriptions, and their units:

| Variable | Description | Unit | 
| - | - | - |
| theta | potential temperature | $^\circ$C |
| salinity | salinity on the practical salinity scale | none |
| salinityf | salinity flag; 0 = interpolated, 2 = good data, 9 = missing data | none |
| sigma0 | potential density referenced to the ocean surface | kg m$^{-3}$ |
| oxygen | oxygen content | $\mu$mol kg$^{-1}$ |
| oxygenf | oxygen flag; 0 = interpolated, 2 = good data, 9 = missing data | none |
| cfc-11 | chlorofluorcarbon-11 content | pmol kg$^{-1}$ |
| cfc-11f | cfc-11 flag; 2 = good data, 9 = missing data | none |

### Q1.1 Get some basic information about your dataframe

#### Q1.1.1 Print the summary information about your dataframe. 
1. How many samples are there for any given variable?
2. How many different datatypes are there?

#### Q1.1.2 Print the summary statistics for your dataframe
1. What is the range of latitude and longitude that the dataset covers?
2. What is the mean potential temperature? Does this seem reasonable? What do you think is the cause?

### Q1.2 Replacing missing values
Use [where()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html) to replace all of the missing values with NaNs. Rerun your summary statistics from Q1.1.2. What is the mean potential temperature now? For this question, please print _only_ the new mean value of potential temperature, and not the entire summary statistics table.

### Q1.3 A quick look at the data


#### Q1.3.1 Show the last 8 values of the dataset. 
This is the data collected at one stop (a 'station') on one reasearch cruise. You can tell this because the lat/lon values don't change, nor does the date. 

#### Q1.3.2 For which variables were data not collected at this station?

#### Q1.3.3 Over what depth range were water samples collected at this station?

---

## Part 2: Working with the data

### Q2.1 Sorting

#### Q2.1.1 Make a very quick line plot of the years in the dataset
Use pandas built-in plotting and don't bother making any adjustments to the output. What does the plot tell you about how the years are arranged in your dataframe?

#### Q2.1.2 Sort your dataframe so that the years are increasing
Use your sorted dataframe for the rest of the assigment

### Q2.2 Counting

#### Q2.2.1 How many samples were taken above and below 500 m?
Base your answer off of depth, not one of the other variables like salinity, etc, as some of those will have missing values

#### Q2.2.2 How many salinity, oxygen, and CFC-11 samples in the entire dataset were interpolated?
Provide your answers as percentages of the total number of samples for each variable

#### Q2.2.3 Between 1980 and 1990, how many oxygen samples were taken north of 80˚N?

#### Q2.2.4 How many distinct locations in the Arctic were sampled?
In this question, I want you to find the total number of _unique_ lat/lon _pairs_. <b> Hint:</b> Take a look at [drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

#### Q2.2.5 In what years were the largest and smallest number of samples taken?

#### Q2.2.6 In how many years were data collected?
How does this compare to the total number of years that the dataset spans?

### Q2.3 Make a histogram of the number of samples in each month 
For this problem, you'll need to create a new dataframe using `drop_duplicates()`.  Because the number of sample depths varies per station, if we make a histogram based solely on the 'months' column, we'll skew our results based on the number of depths sampled at a given station. To avoid counting mulitple samples across depth at the same station, what we truly want is the unique _date_ and _time_ that sampling occurred at a particular station, so you should make your histogram based on your new dataframe that ignores duplicate entries for the date and time.

No need to bother making your plot look pretty for this question--as long as you can answer the questions below then it's fine (but do make sure that your plot has one bin per month)!

1. In what month were the largest number of samples taken?
2. In what month were the smallest number of samples taken?
3. Pretend like the data you plotted on your histogram is the data you have for one year. If you took the annual mean of your data, how might you expect your results to be biased (i.e. toward what season would your annual mean skew)?

### Q2.4 Visualizing data with a box plot

#### Q2.4.1 Make a box plot of the spread in number of samples across all months for each year
Use your new dataframe from Q2.3. Your [box plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) should have one box showing the spread in monthly sample numbers for each year. Your x-axis should be the years of the dataset and your y-axis should be the months (these should be numerical). To help get you started, I've set up a bit of the code so that your plot output won't be scrunched together.  All you need to do is add a line of code that makes the box plot and then do the following:

1. Set the ax keyword argument in your boxplot function to be the name of the axis I've created below
2. Turn off the grid
3. Set the x-axis label rotation to be 45˚ 
4. Add a y-axis label
5. Add a title (but first get rid of the default suptitle by setting the suptitle of the plot to an empty string)

In [None]:
fig,ax=plt.subplots()
fig.set_size_inches(15,4)
###YOUR BOX PLOT CODE HERE####

#### Q2.4.2 Do all years have the same median sampling month?
Do the annual data generally agree with the histogram in Q2.3?

### Q2.5 Make a set of histograms based on density

#### Q2.5.1
In oceanography, it is often instructive to look at data on density surfaces rather than by depth. In this question, you'll make a 1x3 set of subplots consisting of histograms for the variables theta, oxygen, and cfc-11, binned by potential density (sigma0). In other words, you'll be making summary plots showing the total number of samples of these variables that fall within specific density ranges. 

You'll need to do the following:

1. Group your theta, oxygen, and cfc-11 data by potential density (sigma0). Use the following array as your density bin edges: 
`bin_edges=np.arange(19.9,28.2,0.2)
` <b>Hint</b>: Take a look at [cut()](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) 
2. Make a set of counts for each variable based on the density bins
3. Use [matplotlib bar graphs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) to make the 1x3 set of plots. Be sure to include axes labels, titles, etc. Change the color and edgecolor of the bars and set the bar width to be the width of a single density bin. <b>Note:</b> To make these plots properly with a bar graph, you will need to compute the bin centers from the bin edges and plot your counts vs your bin centers

#### Q2.5.2 What density bin contains the largest number of samples?

#### Q2.5.3 Understanding plt.hist() vs plt.bar()
Look up the documentation for matplotlib's `hist()` function. Given the inputs of this function, why do you suppose we couldn't just feed our output from Q2.5.1 into it in order to make the histograms?