# Day 4 Challenge: Compare data from lidar and a nearby buoy (SOLUTIONS)

In today's challenge, your goal is to compare the lidar wind speed data (what we looked at this morning) to a new dataset from a nearby meteorological buoy.

The lidar is located at 39.9695, -72.160. We can grab measurements from a meteorological buoy, courtesy of [NOAA National Data Buoy Center](https://www.ndbc.noaa.gov/). [Buoy 44025](https://www.ndbc.noaa.gov/station_history.php?station=44025) is at 40.251, -73.164.

Your objectives are to:
1. Read in and proccess the lidar data.
2. Read in and process the buoy data.
3. Plot the processed data.
4. Compare the two datasets.

<font color='blue'>Tasks you must complete are in blue</font>, <font color='red'>while extra challenge tasks are in red</font>.

## 1. Read in and process the lidar data

In this step, repeat most of what you did this morning to read in the `lidar_winds.csv` file and prepare it for plotting.

<font color='blue'>Begin by importing the libraries that you'll be using. You'll need the same four that you used this morning: `datetime`, `numpy`, `pandas`, and `matplotlib.pyplot`. You can just copy and paste this step from the notebook we used earlier if you'd like.</font>

In [None]:
import datetime as datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<font color='blue'>Read in the tabular data using `pandas`. Since you'll have two datasets, rather than calling the dataframe `df` like we did before, name it `df_lidar` so we can distinguish between the two datasets.</font>

In [None]:
df_lidar = pd.read_csv('lidar_winds.csv')

We've already looked at this dataset and we know what it looks like, so we can move right into setting the timestamps.

<font color='blue'>Use `datetime` to convert the data in the `timestamp` column from an `object` format to a `datetime64` format.</font>

In [None]:
df_lidar['timestamp'] = pd.to_datetime(df_lidar['timestamp'])

<font color='blue'>Next, set the dataframe index as `timestamp`.</font>

In [None]:
df_lidar = df_lidar.set_index(df_lidar['timestamp'])

<font color='blue'>Finally, type the name of the dataframe (`df_lidar`) to view the final product.</font>

In [None]:
df_lidar

## 2. Read in and process the buoy data

Now that you've had more practice reading in data, go ahead and read in and prepare the data from the meteorological data.

The file is called `buoy_data.csv`, and it's the same file type as the lidar dataset. Therefore, you'll want to use `pandas` to process and analyze the data, just like you did earlier.

We've given you the column names and units to help out a bit:

Column units:
* WDIR: degT
* WSPD: m/s
* GST: m/s
* WHVT: m
* DPD: sec
* APD: sec
* MWD: degT
* PRES: hPa
* ATMP: degC
* WTMP: degC
* DEWP: degC
* VIS: mi
* TIDE: ft

<font color='blue'>Start by reading in the tabular data with `pandas`. Name the dataframe `df_buoy` so you don't get it confused with the lidar dataframe.</font>

In [None]:
df_buoy = pd.read_csv('buoy_data.csv')

<font color='blue'>Just like we did this morning, use the `.head()` command to view the first 5 lines of the dataframe.</font>

In [None]:
df_buoy.head()

This dataset has much more than just wind speed, which is the only type of measurement from the lidar. For now, let's just focus on the wind speed, which is abbreviated `WSPD` and is in units of meters per second (m/s).

Let's check out the timestamp column to see what the current data type is.

<font color='blue'>View the time column using the same method we used this morning.</font>

In [None]:
df_buoy['time']

You should notice now that the data type ("dtype") is `object`. 

<font color='blue'>Use the same methods that we used this morning and in Step 1 to turn the timestamp into a `datetime64` format.</font>

In [None]:
df_buoy['time'] = pd.to_datetime(df_buoy['time'])

<font color='blue'>Now set the index as the newly formatted timestamps.</font>

In [None]:
df_buoy = df_buoy.set_index('time')

<font color='blue'>Finally, type the name of the dataframe (`df_buoy`) to view the final product.</font>

In [None]:
df_buoy

We checked out plots of the lidar data this morning, but we haven't taken a look at the buoy data yet. Let's go ahead and plot the buoy data to see what it looks like.

<font color='blue'>Plot the buoy wind speed as a function of the index (which we already set as the time stamp). The basics for creating the plot have already been written for you.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(df_buoy.index, df_buoy['WSPD'])

# Show the figure
plt.show()

Woah! What's going on with those really high values? Wind speeds of around 100 m/s (which is over 200 mph) seem a little high... Sometimes datasets use numbers like 99, 999, or -999 to denote missing or corrupt data. Sometimes it means that the instrument collecting the data has malfunctioned or there was some issue with the measurements. Regardless, those readings make it difficult to view the real record, which has wind speeds from 0 to about 20 m/s. Let's check to see what those values are by printing the maximum wind speed value.

<text color='blue'>Type out `max(df_buoy['WSPD'])` to print the maximum value for that column.</text>

In [None]:
max(df_buoy['WSPD'])

It looks like we'll need to change every value of 99 to something that `pandas` will ignore. This is important not only for plotting, but also for any statistics we use. Imagine how much higher the mean would be if we included a bunch of 99's in our calculation!

There are many ways we could handle these data points, but in this example, we'll simply replace "99" wtih "NaN", which stands for "not a number." NaN's are ignored when you plot or calculate statistics (though sometimes you have to specifically tell the function to ignore them).

To replace "99" with "NaN", type and run the following line: `df_buoy = df_buoy.replace(99, np.nan)`

In [None]:
df_buoy = df_buoy.replace(99, np.nan)

<font color='blue'>Plot the data again to see if that worked. You can copy and paste the code we used above to plot.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(df_buoy.index, df_buoy['WSPD'])

# Show the figure
plt.show()

Success! Now instead of high values, you'll notice that there are gaps of missing data. Although it doesn't look "pretty," we're now able to view the data and calculate statistics that won't be skewed by the high values of 99.

Can you think of any alternative methods we could use to deal with bad data?

<font color='red'>As an extra challenge, look up some other methods for dealing with bad or missing data and apply one to the dataset. HINT: look into interpolating to fill gaps...</font>

## 3. Plot the data

We have now successfully prepared our two datasets to plot and compare. The buoy wind speeds are only measured at a single altitude, while the lidar wind speeds are measured at a range of heights. Let's start by plotting the buoy wind speed and the lidar wind speed at the lowest altitude (18 m). 

<font color='blue'>Following the same methods that we used for plotting above, plot the lidar wind speed at 18 m, which is the column labeled `wspd18m` in the lidar dataframe. Plot the buoy wind speed, which is the column labeled `WSPD` in the buoy dataframe.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(df_lidar.index, df_lidar['wspd18m'])
ax.plot(df_buoy.index, df_buoy['WSPD'])

# Show the figure
plt.show()

We should now have one plot with both measured wind speeds. However, we don't have any labels on the data so it's hard to tell which record is which. Let's go ahead and label them and then add a legend. 

<font color='blue'>To do so, copy and paste the code from the cell above where we plotted both records.</font>

1. <font color='blue'>In the line where you plot the lidar data, before the `)`, type `, label='Lidar'`.</font>
2. <font color='blue'>In the line where you plot the buoy data, before the `)`, type `, label='Buoy'`.</font>
3. <font color='blue'>After those lines, type `plt.legend()`.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(df_lidar.index, df_lidar['wspd18m'], label='Lidar')
ax.plot(df_buoy.index, df_buoy['WSPD'], label='Buoy')

# Show legend
plt.legend()

# Show the figure
plt.show()

<font color='red'>There are many ways to customize plots and add your own style! If you feel like playing around with the visualization of your data, try changing things like the line color or width.</font>

You might notice now that the records do not cover the same time periods. To make it easier to compare the two records, let's select the overlapping time period. It looks like the lidar record begins after the buoy record, so let's find the start date of the lidar.

<font color='blue'>Print the first 5 lines of the lidar dataframe using `.head()`.</font>

In [None]:
df_lidar.head()

The first timestamp shows us that the first measurement was on 12 August 2019 at midnight. Therefore, we'll want to trim the buoy dataframe to start at that same date. 

<font color='blue'>Type out and run the following line to ignore all buoy data before 12 August 2019: `df_buoy = df_buoy.loc['2019-8-12':]`</font>

In [None]:
df_buoy = df_buoy.loc['2019-8-12':]

Now the start dates should be the same, but the end dates are different. The buoy dataset ends before the lidar, so let's find the last date of the buoy measurements.

<font color='blue'>Print the last 5 lines of the buoy dataframe using `.tail()`.</font>

In [None]:
df_buoy.tail()

The last measurement is on 31 December 2020, so let's trim the lidar dataframe to end on that date. 

<font color='blue'>Use the same method that we used to trim the buoy dataset to now trim the lidar data.</font>

**Hint:** When we trimmed the buoy data, we wanted all of the data *after* a certain date so we used the format `['date':]`. If we want all of the data *before* a certain date, we should use the format `[:'date']`.

In [None]:
df_lidar = df_lidar.loc[:'2020-12-31']

Now that the datasets are the same lengths, let's plot them again and see how they look.

<font color='blue'>Copy and paste the block of code that we used to plot both wind speeds below in order to see them over the same period.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(df_lidar.index, df_lidar['wspd18m'], label='Lidar')
ax.plot(df_buoy.index, df_buoy['WSPD'], label='Buoy')

# Show legend
plt.legend()

# Show the figure
plt.show()

## 4. Compare the two datasets

So far we've been processing the two datasets, preparing them for plotting, and getting ready to compare them. Now let's actually compare the two using some basic statistics. There are an infinite number of statistical measures we could use to compare the records. For now, we'll stick to just a few, but you're more than welcome to use any other methods you're familiar with. [Here](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm) is more information on other statistics to calculate. 

Let's start by looking at the averages of lidar versus the buoy wind speed. In the code below, we calculate the average of the lidar wind speeds at 18 m.

In [None]:
df_lidar['wspd18m'].mean()

<font color='blue'>Use the same method to find the average of the buoy data.</font>

In [None]:
df_buoy['WSPD'].mean()

The lidar average is higher than the buoy average, but let's take a look at the whole records again. Earlier, we smoothed the data by averaging winds every week. Let's try that again so we can more easily view the data.

<font color='blue'>For both datasets, find the weekly averages using `.resample()` and `.mean()` like we did this morning. Name the weekly averages `weekly_lidar` and `weekly_buoy`.</font>

In [None]:
weekly_lidar = df_lidar.resample('W').mean()
weekly_buoy = df_buoy.resample('W').mean()

<font color='blue'>Now plot the weekly averages for both the lidar and the buoy data on the same plot. Don't forget to add labels!</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(8,3))

# Add axis labels
ax.set_ylabel("Wind Speed [m/s]", fontsize=12)
ax.set_xlabel("Date", fontsize=12)

# Plot the data
ax.plot(weekly_lidar.index, weekly_lidar['wspd18m'], label='Lidar')
ax.plot(weekly_buoy.index, weekly_buoy['WSPD'], label='Buoy')

# Show legend
plt.legend()

# Show the figure
plt.show()

It looks like the lidar wind speeds are consistently higher than the buoy wind speeds. Why do you think that is?

<font color='blue'>Brainstorm a few reasons why the lidar speeds are higher than the buoy speeds.</font>

We could also compare the weekly averages by making a scatter plot and fitting a line to the data.

<font color='blue'>Below, make a scatter plot of the weekly lidar averages versus the weekly buoy averages.</font>

**Hint:** Most of the code is completed for you already, so you'll just need to fill in the parentheses in the `ax.scatter()` line. First type out the lidar wind speed column that you want to use (`weekly_lidar['wspd18m']`). Type a comma after that and then type out the buoy wind speed column (`weekly_buoy['WSPD']`).

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(5,5))

# Set axis labels
ax.set_xlabel('Lidar Wind Speed [m/s]', fontsize=12)
ax.set_ylabel('Buoy Wind Speed [m/s]', fontsize=12)

# Set the axis limits
ax.set_xlim(0,14)
ax.set_ylim(0,14)

# Plot a dashed 1:1 line in black with line width of 1
ax.plot([0,14],[0,14],c='k',ls='--',lw=1)

# Plot the data as a scatter plot
ax.scatter(weekly_lidar['wspd18m'], weekly_buoy['WSPD'])

# Show the figure
plt.show()

You'll notice that there is a dashed black line already plotted on the graph. This is a one-to-one line, which is just like plotting `y = x`. If the data fall on the one-to-one line, they they are very well correlated. In our case, they are all below that line, which means that the lidar wind speeds are greater than the buoy wind speeds. This is exactly what we saw before when we plotted them as functions of time. Let's now quantify the relationship we're seeing in the graph.

We're going to use `numpy` for some of these calculations, so let's turn the two weekly averages into arrays, which will be easier to deal with.

<font color='blue'>Run the following code to create the arrays.</font>

In [None]:
x = np.array(weekly_lidar['wspd18m'])
y = np.array(weekly_buoy['WSPD'])

Now, we want to use `np.polyfit` to fit a line to the data. This returns two values, `m` and `b`, which are the slope and intercept of a line, just like in the format: `y = mx + b`. Remember that we had a bunch of NaN's in the buoy dataset, so line 1 in the code block below ignores all infinite or NaN values.

<font color='blue'>Run the following code to fit a polynomial to the data.</font>

In [None]:
idx = np.isfinite(x) & np.isfinite(y)
m, b = np.polyfit(x[idx], y[idx], 1)

Now let's plot!

<font color='blue'>In the plotting line below, fill in the empty parentheses in `ax.plot()` to plot the new line. You'll want to plot it as `x, (m*x)+b`. Change the color of the line so it's easier to see.</font>

In [None]:
# Set up figure for plotting
fig, ax = plt.subplots(1, 1, figsize=(5,5))

# Set axis labels
ax.set_xlabel('Lidar Wind Speed [m/s]', fontsize=12)
ax.set_ylabel('Buoy Wind Speed [m/s]', fontsize=12)

# Set the axis limits
ax.set_xlim(0,14)
ax.set_ylim(0,14)

# Plot a dashed 1:1 line in black with line width of 1
ax.plot([0,14],[0,14],c='k',ls='--',lw=1)

# Plot the data as a scatter plot
ax.scatter(weekly_lidar['wspd18m'], weekly_buoy['WSPD'])

# Plot the line of best fit
ax.plot(x, (m*x)+b,'r')

# Show the figure
plt.show()

This was a simple way to show that the data are not perfectly correlated because the one-to-one line and the line of best fit that we just plotted are different. There are more statistically rigorous ways of showing this, such as performing a linear regression. 

<font color='red'>As an extra challenge, try to perform a linear regression with a different technique and find the r$^2$ value.</font>