# Exercise 2: Comparing data to predictions

In this week's exercise we will work on comparing observations (data) to predictions.
In particular, we will explore two different ways in which we can compare data to a prediction:

1. Comparing individual measured values to their equivalent predicted values using a *goodness-of-fit* equation
2. Fitting a line to *x-y* data using the *least squares regression*

Both of these cases are frequently used, and can even be conducted in commonly used software such as **Microsoft Excel**.
Our goal is to understand what the numbers from these "fits" mean and how they are calculated.

For each problem you need to modify the given notebook, and then upload your files to GitHub.
The answers to the questions in this week's exercise should be given by modifying the document in places where you are asked.

- **Exercise 2 is due by the start of class on on 12.11.**
- Don't forget to check out [the hints for this week's exercise](https://introqg.github.io/qg/lessons/L2/exercise-2.html) if you're having trouble.
- Scores on this exercise are out of 20 points.

# Problem 2: Linear data regression - Detecting changing climate (11 points)

For this problem we will again use the NOAA climate data from Exercises 6 and 7 of the [Geo-Python course](https://geo-python.github.io) with a slight twist.
As a reminder, you will be working with a climate data file from the [US National Oceanographic and Atmospheric Administration (NOAA) climate database](https://www.ncdc.noaa.gov/cdo-web/).
In order to jump right into things, we have provided you with [a data file that has already been processed](data/seasonal-temps.csv) containing the means and standard deviations of seasonal temperatures (in Celsius) for winter and summer for the years 1952-2016.
The first 5 lines of the data file are:

```
year,winter_mean,winter_std,summer_mean,summer_std
1952,-2.770692,1.065687,14.666325,1.128088
1953,-5.116566,3.016217,16.406215,0.913148
1954,-6.226597,4.711928,15.329372,1.236887
1955,-8.276137,2.293563,16.104136,2.578732
```

Your task is to load this data and use a least squares regression to fit trendlines to the temperature data.

*If you're looking for a challenge, you are welcome to load in the [original data file from Exerise 6](data/1091402.txt) and process it as was done for Exercises 6 and 7, or load the [seasonal data output file from Exercise 7](data/helsinki-numpy.csv) as a starting point for this problem.
As a reminder, the [original data file](data/1091402.txt) has daily mean, minimum, and maximum temperatures for the Helsinki-Vantaa airport weather station from January 1952-October 2017.
The [seasonal data output file](data/helsinki-numpy.csv) has columns for the date_monthly, year_monthly, month_monthly, temp_monthly, temp_monthly_celsius, ref_temps_monthly, and anomaly in that order.*

## Part 1: Reading the data file (1 point)

Your first task is to read in [the data file for this problem](data/seasonal-temps.csv) and split the data into column arrays.

For this part you should:

- Read the data file into a varaible called `data`, skipping the header
- Split the data file into separate column arrays called `year`, `winter_mean`, `winter_std`, `summer_mean`, and `summer_std`

In [None]:
# Import NumPy and Matplotlib
import numpy as np
import matplotlib.pyplot as plt

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This test should print the first row of the data file
print("First row of the data file:\n", data[0,:])


## Part 2: Calculating linear regression lines (4 points)

Your next task in this problem is to calculate the slopes `B` and *y*-intercepts `A` for [linear regression lines](https://introqg.github.io/qg/lessons/L2/least-squares.html) of the both seasonal average temperatures versus time.
You will do this for two time intervals: (1) the entire time covered by the data, and (2) 1997-2016, the last 20 years in the data.

For this part you should:

- Create a function `linregress` that can be used to determine the slope `B` and *y*-intercept `A` for a regression line to data `x` and `y`.
In this problem, `x` would represent time data, and `y` would represent temperature.
- Use your `linregress` function to calculate the slopes and intercepts for summer and winter for the time periods mentioned above (the entire range of years, and 1997-2016).
You should name these values as follows:
    - Entire range of years: `Aw` and `Bw` for winter, `As` and `Bs` for summer
    - Last 20 years: `Aw20` and `Bw20` for winter, `As20` and `Bs20` for summer

In [None]:
def linregress(x, y):
    """Calculates the slope and y-intercept of a least squares regression line."""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This test should work
print("Winter (all years) slope {0:.4f} and intercept {1:.1f} should be 0.0382 and -80.8.".format(Bw, Aw))


## Part 3: Plotting your results (4.5 points)

Your last coding task for this problem is to create some plots of your seasonal temperature data and their trendlines.
An example of the types of plots you should create are below.
**Note**: You should create separate plots, not use subplots as shown in the example below.

![Temperature trends in Helsinki](temp-trends.png)

For this part you should:

- Create 2 plots of the seasonal temperature data: (1) one for the entire time covered in the data and (2) another for the last 20 years of data
    - Plot the seasonal average temperature data for both seasons on the same plot, using error bars to show the standard deviation in seasonal temperature
    - Plot the regression lines for each season as well to compare to the yearly temperature data. You may want to consult [the hints for this week's exercise](https://introqg.github.io/qg/lessons/L2/exercise-2.html) about how to plot the regression lines.
- Include axis labels, a title and a legend for each plot, as well as text displaying the slope of the regression lines for both the summer and winter seasons
- Add a figure captions in the Markdown cell below each Python cell for your plots that describe the plots as if they were in a scientific journal article

In [None]:
# Plot for all years goes here

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

In [None]:
# Plot for last 20 years goes here

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Part 4: Questions for Problem 2 (1.5 points)

1. For the linear regressions for the entire age range of seasonal average temperature data, is a line a suitable representation of the changes in seasonal average temperature from 1953-2016?
In other words, does your linear "fit" fall entirely within the uncertainty of your seasonal average temperatures, or are there some years where the line is outside of the standard deviation in seasonal temperature?
Is a linear fit an appropriate choice?
By how much do your linear regression lines suggest seasonal average temperatures have changed per year?
2. What about the linear regressions for the last 20 years?
Is a line a suitable representation of the changes in seasonal average temperature for this shorter time period?
By how much do your linear regression lines suggest seasonal average temperatures have changed per year?
Do you find these results surprising?
Why or why not?
3. Globally, climate data would suggest that not only are temperatures rising on average each year, but that the rate of change in temperature may also be increasing.
Does your analysis from the Helsinki-Vantaa weather data support this?
Does this kind of analysis help demonstrate why some people may have a difficulty in understanding global climate change?

YOUR ANSWER HERE