# Exercise 2: Comparing data to predictions

In this week's exercise we will work on comparing observations (data) to predictions.
In particular, we will explore two different ways in which we can compare data to a prediction:

1. Comparing individual measured values to their equivalent predicted values using a *goodness-of-fit* equation
2. Fitting a line to *x-y* data using the *least squares regression*

Both of these cases are frequently used, and can even be conducted in commonly used software such as **Microsoft Excel**.
Our goal is to understand what the numbers from these "fits" mean and how they are calculated.

For each problem you need to modify the given notebook, and then upload your files to GitHub.
The answers to the questions in this week's exercise should be given by modifying the document in places where you are asked.

- **Exercise 2 is due by the start of class on on 12.11.**
- Don't forget to check out [the hints for this week's exercise](https://introqg.github.io/qg/lessons/L2/exercise-2.html) if you're having trouble.
- Scores on this exercise are out of 20 points.

# Problem 1: Calculating a goodness-of-fit (9 points)

For this problem, you will work on reading some measured and predicted [thermochronometer age data](data/Coutand2014-AFT-ages.txt) (don't worry about what these ages mean for now), calculating a goodness-of-fit for the data, and making a data plot.
Data for this exercise comes from a [recent paper published on the exhumation of the Himalaya in Bhutan](http://dx.doi.org/10.1002/2013JB010891), in case you're curious.

## Part 1: Reading the data file (1 point)

You first task for this problem is to read the data file [Coutand2014-AFT-ages.txt](data/Coutand2014-AFT-ages.txt) and split the data into separate arrays for each variable.

For this part you should:

- Read the data file into a varaible called `data`
- Split the data file into separate column arrays called `longitude`, `latitude`, `elevation`, `measured_age`, `std_dev`, and `predicted_age`

In [None]:
# Import NumPy and Matplotlib
import numpy as np
import matplotlib.pyplot as plt

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This test should print the first row of the data file
print("First row of the data file:\n", data[0,:])


## Part 2: Calculating a goodness-of-fit (3 points)

Next, you should create a function to calculate the goodness-of-fit of the data in the data file.
For this, you can use the reduced chi-squared equation,

\begin{equation}
  \Large
  \chi^{2} = \frac{1}{N} \sum \frac{(O_{i} - E_{i})^{2}}{\sigma_{i}^2}
\end{equation}

where $N$ is the number of ages, $O_{i}$ is the $i$th measured age, $E_{i}$ is the $i$th predicted age, and $\sigma_{i}$ is the $i$th standard deviation.

For this part you should:

- Create a function called `chi_squared` that can be used to calculate the reduced chi-squared value for the data in the age data file
- Use your `chi_squared` function to calculate the goodness-of-fit of the predicted ages data to the measured ages

In [None]:
def chi_squared(measured, predicted, std):
    """Returns the reduced chi-squared value for input age data."""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This test should work
print("My goodness of fit {0:.2f} should be 7.64.".format(chi_squared(measured_age, predicted_age, std_dev)))


## Part 3: Plotting the data (4 points)

To get a sense of how the goodness-of-fit value and the age data relate, your next tasks is to produce a plot of the measured and predicted age data.
An example of the plot similar to that you should produce is below.

![Bhutan age data](img/Bhutan-age-data.png)

For this part you should:

- Produce a plot of the measured and predicted ages as a function of latitude
    - Be sure to also plot the error bars with for the measured ages. There's a [useful Matplotlib function](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.errorbar.html) for this.
- Display the calculated goodness-of-fit value from Part 2 as text on the plot
- Include axis labels and a title
- Add a figure caption in the Markdown cell below the Python cell for your plot that describes the plot as if it were in a scientific journal article

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Part 4: Questions for Problem 1 (1 point)

1. Looking at your plot and without looking at the goodness of fit value, how well would you say the predicted ages fit the measured ages in this example?
Is this difficult to do?
Why or why not?
2. How well would you say the predicted ages fit the measurements using the calculated goodness of fit?
Is your calculated goodness of fit intuitive to use?
Why or why not?

YOUR ANSWER HERE

### References
[Coutand, I., Whipp, D.M., Grujic, D., Bernet, M., Fellin, M.G., Bookhagen, B., Landry, K.R., Ghalley, S.K. and Duncan, C., 2014. Geometry and kinematics of the Main Himalayan Thrust and Neogene crustal exhumation in the Bhutanese Himalaya derived from inversion of multithermochronologic data. *Journal of Geophysical Research: Solid Earth*, *119*(2), pp.1446-1481](https://dx.doi.org/10.1002/2013JB010891)