Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Lab 4: Bivariate Statistics and Regression  

In this lab, you will be using glacier mass-balance data collected by the SFU Glaciology Group. 
You will use these data to explore linear and non-linear relationships between dependent and independent variables.

__Background__:  

Ablation is any form of glacier mass loss, expressed in units of equivalent water depth. 
Ablation can be estimated locally by repeated measurements of stake height over time (stakes get taller as they melt out), combined with a known measured density of the material (e.g. snow, ice) being lost. 
This allows a calculation of mass loss expressed as the equivalent depth of water.

Run the script 1 `load_data.py`. The output shows:  
  - Left Subplot: An array of stake locations from 2009 and 2012 within the glacier outline overlain on a map of surface elevations (colour, m above sea level). The x-y coordinates are UTM Easting (m) and UTM Northing (m).   
  - Right Subplot: Ablation rate (cm of water-equivalent loss per day) measured in 2009 and 2012 at the stake locations plotted as a function of elevation.  

As you can see in right subplot, ablation appears to be a function of elevation. But what is the dependence? To answer this question, complete the following questions.

In [None]:
%matplotlib notebook
%run load_data.py
plt.show()

In [None]:
import numpy as np
import pandas as pd 
import scipy.linalg as LA

ablation_2009 = pd.read_csv('GL1_stake_ablation_2009.csv',index_col=0)
ablation_2012 = pd.read_csv('GL1_stake_ablation_2012.csv',index_col=0)

In [None]:
ablation_2009.head()

---   

### Exercise 1: (9 pts)

1. Why are bivariate statistics useful in Earth Science?  
2. What is covariance and Pearson’s correlation coefficient?  
3. Calculate Pearson’s correlation coefficient between ablation rate and elevation for each of the two data sets individually (2009 ablation rate vs. elevation, 2012 ablation rate vs. elevation). Do these two values suggest a relationship between ablation rate and elevation? If so, is the relationship linear?

---   

---
1. Why are bivariate statistics useful in Earth Science?  
---

YOUR ANSWER HERE

---
2. What is covariance and Pearson’s correlation coefficient?  
---

YOUR ANSWER HERE

---
3. Calculate Pearson’s correlation coefficient between ablation rate and elevation for each of the two data sets individually (2009 ablation rate vs. elevation, 2012 ablation rate vs. elevation). Do these two values suggest a relationship between ablation rate and elevation? If so, is the relationship linear?
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---   

### Exercise 2: (7 pts)

1. Write a function to calculate the slope and intercept of a classical least-squares regression line with generic input data vectors `x` and `y`. You can use the functions `np.sum()`, `len()`, `np.mean()`, and `LA.inv()`. The slope and y-intercept should be the function outputs.  

2. Calculate the slopes and intercepts of the two data sets (2009 ablation rate vs. elevation, 2012 ablation rate vs. elevation) using your function. Print your answer to the console. Use this information to plot a regression line on the same plot as your data.   

---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---  

### Exercise 3 (5 pts):

1. Calculate the correlation coefficient between the observed and estimated values of ablation rate using the ratio of the sum of squares of the regression ($SSR$) and the total sun of the squares ($SST$).  _Hint_ : write a function for each. 

2. What do the correlation coefficients mean? Would the values of the ablation rate estimated by the regression in one year be appropriate estimates of ablation rate in the other year?  

---

---
1. Calculate the correlation coefficient between the observed and estimated values of ablation rate using the ratio of the sum of squares of the regression ($SSR$) and the total sun of the squares ($SST$).  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
2. What do the correlation coefficients mean? Would the values of the ablation rate estimated by the regression in one year be appropriate estimates of ablation rate in the other year?
---

YOUR ANSWER HERE

---
### Exercise 4: (5 pts)

1. Plot a histogram of the residuals of the regression for each dataset. You may you `plt.hist()` if you would like.  

2. How would you determine if the residuals are normally distributed?  

3. What can you say about the data and the regression if the residuals have a Gaussian distribution?  

---

---
1. Plot a histogram of the residuals of the regression for each dataset. You may you `plt.hist()` if you would like.  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
2. How would you determine if the residuals are normally distributed?  
---

YOUR ANSWER HERE

---
3. What can you say about the data and the regression if the residuals have a Gaussian distribution?  
---

YOUR ANSWER HERE

---
### Exercise 5: (5 pts)  


1. What can an _ANOVA_ tell you?  
2. Evaluate the significance of fit of the regressions to the data using an _ANOVA_.  
3. What is your interpretation of the _ANOVA_ results?  
---

---
1. What can an _ANOVA_ tell you?  
---

YOUR ANSWER HERE

---
2. Evaluate the significance of fit of the regressions to the data using an _ANOVA_.  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
3. What is your interpretation of the _ANOVA_ results?  
---

YOUR ANSWER HERE

---
### Exercise 6: (10 pts)  

1. What does the reduced major axis linear regression do?  
2. Write a second function to compute the reduced major axis linear regression.  
3. Plot these regression lines with the data and with the classical linear regression lines.  
4. How do these regressions compare to the classical regressions?  
5. How would you decide to which regression type is the most appropriate for a particular problem?  
---

---
1. What does the reduced major axis linear regression do?  
---

YOUR ANSWER HERE

---
2. Write a second function to compute the reduced major axis linear regression.  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
3. Plot these regression lines with the data and with the classical linear regression lines.  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
4. How do these regressions compare to the classical regressions?  
---

YOUR ANSWER HERE

---
5. How would you decide to which regression type is the most appropriate for a particular problem?  
---

YOUR ANSWER HERE

---
### Exercise 7: (8 pts)  

1. Use the `np.polyfit()` function to fit a quadratic to each of the two datasets. Plot these functions with the data.  
  - _Hint_ : You'll also need to use the `np.polyval()` function to quickly evaluate the polynomial coefficients  
  
2. Describe how you would determine whether the linear fit or the quadratic fit was most appropriate.
---

---
1. Use the `np.polyfit()` function to fit a quadratic to each of the two datasets. Plot these functions with the data.  
  - _Hint_ : You'll also need to use the `np.polyval()` function to quickly evaluate the polynomial coefficients  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
2. Describe how you would determine whether the linear fit or the quadratic fit was most appropriate.
---

YOUR ANSWER HERE