# **Geophysics 310 Lab 1: Python, Linear Regression, and Sea Level Rise** 

The aim of this first 310 lab is to become familiar with using Python to analyse and visualise scientific data. We will also fit some models to sea level data to explain trends over time. 

**Lab work and submission:** Complete all the exercises within the notebook itself, unless otherwise stated. Make sure you save the notebook (ideally to your Google Drive) before beginning, and save regularly throughout. You are encouraged to work on the problems together, but everyone will submit their own work. Submit your .ipynb notebook file along with any other files or data through Canvas.

Linear Algebra is the topic of Chapter 12 of [A Guided Tour of Mathematical Methods for the Physical Sciences](http://www.cambridge.org/nz/academic/subjects/physics/mathematical-methods/guided-tour-mathematical-methods-physical-sciences-3rd-edition#KUoGXYx5FTwytcUg.97). In this notebook we treat linear regression on sea level measurements in the Auckland harbour as an application of linear algebra, as we solve the normal equations that describe linear regression. In the process, this notebook also has bearing on Inverse Problems (Chapter 22), and Statistics (Chapter 21). 

First, we import the python libraries we'll use in this notebook 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import numpy as np # numerical tools
import scipy.stats # linear regression function
#import scipy.optimize # curve fitting


## **Sea Level Data** ##
Sea level measurements for the Auckland harbour are the average of many measurements taken throughout that year to obtain a mean value and standard deviation. 

The first step is to read in our data. The Python library [pandas](https://pandas.pydata.org) can easily read CSV data for us, without needing to know a lot of programming. Then we will convert the *pandas* data into a [NumPy](http://numpy.org) matrix, so we can easily work with the data throughout the rest of this notebook.

In [None]:
harbour_data = pd.read_csv("https://raw.githubusercontent.com/PALab/physnotebooks/main/GEOPHYS310/msl_auckland.csv")
harbour_data_matrix = harbour_data.to_numpy()
# check shape of matrix
print(harbour_data_matrix.shape)
# print tenth row
print(harbour_data_matrix[9]) # remember, 0 is first row
# print first column
print(harbour_data_matrix[:,0]) #format for indexing is [rows,columns]

####<font color=green>**Exercise 1**<font>####

In our case, the matrix has all the time data in the first column, all the sea level height in the second column, and the standard deviations in the third column. In the cell below, create three variables ```time```, ```height```, and ```errors```, and assign the correct data to each.

In [None]:
time   =  #Extract the time data
height =  #Extract the height data
errors =  #Extract the standard deviations

Next, we will plot sea level in the Auckland Harbour, as a function of time, using [matplotlib](https://matplotlib.org/):

In [None]:
plt.errorbar(time, height, yerr=errors, color='r', ecolor='gray', marker='o', linestyle='')
plt.grid()
plt.xlabel('Date (year)')
plt.ylabel('Mean Sea Level (m)')
plt.title('Princes Wharf, Auckland (New Zealand)')
plt.axis('tight')
plt.savefig('test.png', bbox_inches='tight')
plt.show()

## **Linear Regression** ##
There is an obvious spread in these measurements. You can read about the challenges of tidal gauge readings [here](http://www.fig.net/resources/monthly_articles/2010/hannah_july_2010.asp). Nevertheless, a trend of increasing water depth seems clear. Next, we'll attempt to answer the question: "What is the best fitting linear equation to these data?"

Later, in Chapter 22, we will discuss Inverse Problems, where we address important questions about what it means to fit the data. Here, we are more concerned with minimizing the misfit between a vector of $n$ data points, $\mathbf{d}$, and those predicted by a model $\mathbf{m}$ that is represented by 2 variables: the slope and intercept of a straight line.

If we accept that these data can be represented by a straight line, then any datum at time $t$ would have a water depth $d = intercept + slope*t$. This would be true for all data, so we can write this in matrix form. If

$$ \mathbf{A}\mathbf{m}= \mathbf{d},$$

then

$$\mathbf{A} = \begin{pmatrix}1 & t_1 \\ \vdots & \vdots\\ 1& t_n\end{pmatrix}, \mathbf{m} = \begin{pmatrix} \textrm{intercept} \\ \textrm{slope} \end{pmatrix}, \textrm{ and } \mathbf{d} = \begin{pmatrix} d_1 \\ \vdots\\ d_n\end{pmatrix}.$$

## **Normal equations** ##

Finding the slope and intercept that best fit the data in a least--squares sense is derived in many text books, including Section~22.2 of ours. Suffices here to say that we want to manipulate the linear system of equations so that we "free up" $\mathbf{m}$. If $\mathbf{A}$ had an inverse, we could multiply the left and right of $ \mathbf{A}\mathbf{m}= \mathbf{d}$ to achieve our goals. But $\mathbf{A}$ is not even square, so there is no chance of that! The next best scenario is to multiply left and right side of  $\mathbf{A}\mathbf{m}= \mathbf{d}$ by the transpose of $\mathbf{A}$:

$$ \mathbf{A}^T\mathbf{A}\mathbf{m} = \mathbf{A}^T\mathbf{d}.$$ 

These are the so-called *normal equations*, and we can rewrite this system as

$$ \tilde{\mathbf{A}}\mathbf{m} = \tilde{\mathbf{d}},$$

where $ \tilde{\mathbf{A}} = \mathbf{A}^T\mathbf{A}$ and $ \tilde{\mathbf{d}}= \mathbf{A}^T\mathbf{d}$. 

## **Finding the best straight line through a set of points**

Here, we have a set of $n$ data points $\left(x_{i},y_{i}\right) $ for which we assume a linear relationship $y=a+bx$ is
expected to hold between the variables. The data points do not lie exactly on this line due to measurement errors, and we would  like to find the best values of $a$ and $b$ given the data. The
classical least-squares approach is to assume that the $x_{i}$ are
known exactly. With this assumption, the aim is to find a straight
line $\widehat{y}_{i}=a+bx_{i}$ that approximates to the $y_{i}$ such that the
value of the sum of the squared deviations

$$\varepsilon\left( a,b\right) =\left( \widehat{y}_{1}-y_{1}\right)
^{2}+\left( \widehat{y}_{2}-y_{2}\right) ^{2}+\ldots+\left( \widehat{y}
_{n}-y_{n}\right) ^{2}$$

is minimized over all $a$ and $b$. $\varepsilon\left( a,b\right) $ is
called *the misfit function*. We know from calculus that we can find the minimum of a function by differentiating and setting to zero. If we differentiate $\varepsilon\left( a,b\right) $ with respect to each of its variables and set these to zero, we find that the miminum of the cost function can be experssed as:

$$  \begin{pmatrix}
        \sum_{i=1}^{n}x_{i}^{2} & \sum_{i=1}^{n}x_{i}\\
        \sum_{i=1}^{n}x_{i} & n
    \end{pmatrix}  
    \begin{pmatrix}
        b\\
        a
    \end{pmatrix}
     =
     \begin{pmatrix}
        \sum_{i=1}^{n}x_{i}y_{i}\\
        \sum_{i=1}^{n}y_{i}%
     \end{pmatrix}
$$

Note that this new system is the equivalent of $$ \tilde{\mathbf{A}}\mathbf{m} = \tilde{\mathbf{d}},$$

where $ \tilde{\mathbf{A}} = \mathbf{A}^T\mathbf{A}$ and $ \tilde{\mathbf{d}}= \mathbf{A}^T\mathbf{d}$.

####<font color=green>**Exercise 2**<font>####

Python has many ways to solve this system of equations. In the following function definition, add lines to compute the elements of the matrix $\mathbf{A}$ and the vector $d$. Then use $\mathbf{A}$ and $d$ with the ```np.linalg.solve()``` function to find the intercept and slope of our best-fit line. You may find the ```np.dot()``` ([see here](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.dot.html)) and ```np.sum()``` ([see here](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.sum.htmlfunctions)) functions useful.

In [None]:
def my_linregress(t,y):
    ''' Solve the system Am = d to find the intercept and slope'''

    # Solve Am = d
    intercept, slope = 

    return intercept, slope

We can now call our function to calculate the intercept and slope of the best-fit line:

In [None]:
intercept, slope = my_linregress(time, height)
print(slope, intercept)

####<font color=green>**Exercise 3**<font>####
Copy your code from the last plot and add a line of code to plot the best-fit line through the data. Use ```plt.legend()``` to add a label for the line.

In [None]:
# Write code here

## **Residuals or misfit** ##
####<font color=green>**Exercise 4**<font>####
The line looks like a reasonable representation of the data, but how did we do from a quantitative point of view? Let's compute the mean and standard deviation of the residual values (these are the values of the water depth minus the best fitting straight line through the data). Compute and print these values below:

In [None]:
# Write code here


####<font color=green>**Exercise 5**<font>####
The mean is practically zero, which shows that underestimations of the observations are balanced by overestimations. The standard devation turns out to be close to 3.5 cm. Why do you think this is? What factors can you think of that contribute to the standard deviation? Type your responses in the box below: (Note, the scientists involved in collecting these data estimate the standard error in each of these annual means for sea level in Auckland is 2.5 cm)

<font color='red'>*Type your answers here*<font>

If we were to feel that 3.5 cm is a "poor" fit, one could always fit the data better with a model that has more degrees of freedom. Does this experiment warrant a quadratic term? Or even higher-order polynomials? Maybe not over these 100 years of data, but if the rise is due to climate change and we have positive feedbacks, maybe we should account for that in our model. In any case:

**Given enough degrees of freedom in the model, we can fit the (any) data perfectly!**

## **Linear regression "out of the box"**##

By the way, there are many ways to do linear regression, or more advanced polynomial fitting, in Python. Here's one example of linear regression from the stats functions in scipy:

In [None]:
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(time, height)
print(slope, intercept)

####<font color=green>**Exercise 6**<font>####
Add this new best-fit line to the plot and confirm that it is identical to the previous line we calculated. Add the new line to the legend.

In [None]:
# Type your code here

If we decided to model the data with a polynomial instead of a straight line, then we could use the NumPy [polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) function. Here, we fit with a third-degree polynomial:

In [None]:
fit = np.polyfit(time, height, 3) #Least squares polynomial fit. Fit a polynomial p(x) = p[0] * x**deg + ... + p[deg] of degree deg to points (x, y). 

curve = (fit[0] * time ** 3) + (fit[1] * time ** 2) + (fit[2] * time) + fit[3]

## **Climate change? An exercise** ##
####<font color=green>**Exercise 7**<font>####
Australian scientists confirm their historic data also support a [1.6 mm/y rise in sea level](https://en.wikipedia.org/wiki/Sea_level_rise) averaged over the last 100+ years. However, tidal gauge and satellite data from the last decade(s) indicate sea level may now be rising at double this rate! With this info, have another look at the Auckland data. Most sea level values in the 2000s falls *above* the regression line. It would require more than data from one tidal gauge to attribute this significant, of course. Especially when you learn that the Auckland tidal gauge has been moved site three times since 2000. However, for the sake of a fitting exercise, write code to fit these data with a polynomial (see example above) and [an exponentional function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html). Plot the data with the linear, polynomial, and exponential fits on the same plot, and caluclate the residuals for each of the fits. 

In [None]:
# Write code here


####<font color=green>**Exercise 8**<font>####
Are the polynomial residuals smaller than for a linear fit? How about the exponential residuals? Are any of the residuals closer to the reported standard error in the data? What do you conclude?

<font color='red'>*Type your answers here*<font>

## **Using multiple datasets to understand a trend** ##

As we have seen, it is often difficult to confirm or reject particular models based on one dataset. This is why geophysicists combine multiple different types of data to build a better picture of what is physically happening. However, it often takes a bit of geophysics detective work to prove or disprove a particular hypothesis...

####<font color=green>**Exercise 9**<font>####

The following list gives the average sea level height from 2011-2020 recorded at a location somewhere in New Zealand. Write some code to plot the data.

In [None]:
years = range(2011,2021)
sea_level = np.array([2.99, 3.2, 3.28, 3.29, 3.31, 3.21, 2.32, 2.19, 2.25, 2.22])

# Plot the data here

Is this trend different to what you would expect? What do you think could be causing this discrepancy? (Note: measurement error has been ruled out as a possible cause)

<font color='red'>*Type your answer here*<font>

####<font color=green>**Exercise 10**<font>####

In order to confirm or deny our hypothesis for what we observe, we need to use a second type of data. Locate and download a second dataset that will help you explain the trend you observe in the sea level data. Explain why you chose this dataset in the space below. You may find these links helpful:

[Map of sea level gauge locations in New Zealand](https://www.linz.govt.nz/sea/tides/sea-level-data/sea-level-data-downloads)

[Map of geodetic sensors in New Zealand](https://www.geonet.org.nz/data/gnss/map)

[Instructions for how to download data from GeoNet](https://fits.geonet.org.nz/api-docs/endpoint/observation)

<font color='red'>*Type your answer here*<font>

####<font color=green>**Exercise 11**<font>####
Plot your dataset with the sea level data and explain how they combine to help you understand the physical processes at work.

In [None]:
# Make a plot here

<font color='red'>*Type your answer here*<font>