# Homework 02 of EPS 88

## GPS data and subsidence of the San Joaquin Valley

The example data we are going to use today is from continuously operating high-precision GPS stations that are operated by UNAVCO which is a non-profit university-governed consortium that facilitates geoscience research and education using geodesy. 

Let's get an introduction here: https://youtu.be/yxLMk120vMU

This data viewer gives a great summary velocity overview of the available GPS data: https://www.unavco.org/software/visualization/GPS-Velocity-Viewer/GPS-Velocity-Viewer.html

Let's look at data from a GPS station that is in Visalia California. Visalia is in the San Joaquin Valley between Fresno and Bakersfield.

https://www.unavco.org/instrumentation/networks/status/pbo/overview/P566

Today we will be focused on putting tools to use that we have used before with a couple new tricks related to dealing with time-series data. 

## Using pandas to import and filter data

From the previous classes we have used the pandas library to import and filter data. The DataFrame object has been the most common way we have dealt with data.

We have used the `numpy` library of functions to make numerical and statistical calculations. In particular we have put the numpy array data structure to work for us.

In [1]:
import pandas as pd
import numpy as np

One of the strengths of pandas is its ability to read and write different data types. For example, we have used the `pd.read_csv()` function to import .csv files throughout the course. This function can either be pointed to a file that is one your computer or a file that is posted on the internet. There are some online databases where you can use a url to access data of your choosing using a special format (API). We took this approach to get our birthquakes earlier in the course.

Let's import daily data since 2005 for the Visalia California GPS station. The data is in the North American tectonic plate (NAM14) reference frame which means that it takes the interior eastern part of North America functions as a fixed and stable point. 

In [2]:
P566_GPS_data = pd.read_csv('data/P566.cwu.nam14.csv')

Whoops. There was an error. I kept this error in here as a reminder that I get errors all of the time. Remember that the errors are informative, but can also be a bit cryptic. In this case, it says "Expected 2 fields in line 10, saw 4." So it seems like there is a mismatch between the number of columns it is expecting and the number that there are. 

Let's look at the file.

It turns out that there are a bunch of header lines and the header row that contains the column names needs to be specified (`header = 11`).

In [3]:
## Add header = 11 to skip the first 11 rows


We know how to take a peak at a DataFrame by applying the `.head()` function.

In [None]:
P566_GPS_data.head()

We have done a lot where we have used extracted data from a single column. We have used the syntax `DataFrameName['column_name]`. It can be helpful to look at the available columns:

In [None]:
P566_GPS_data.columns

Some columns names have spaces before and after the name. To make it easier to work with the data, let's remove the spaces from the column names.

In [5]:
P566_GPS_data.columns = P566_GPS_data.columns.str.strip()

Let first look at how this point is moving north with respect to stable North America.

In [None]:
P566_GPS_data['North (mm)'].describe()

By themselves these data are pretty cool. It looks like Visalia has moved north relative to stable North America by ~170 mm (17 cm) over the past 18 years (the data starts in November 2005).

### Pandas timeseries

Pandas is good at dealing with time series data. We need to make sure that the data type of the 'Date' column is a time series

In [None]:
P566_GPS_data['Date'][0]

In [None]:
type(P566_GPS_data['Date'][0])

Right now, pandas thinks that the values in the data column are strings (a sequence of characters) rather than datetime values. We can convert them to be datetime values using `pd.to_datetime`.

In [10]:
## Use the pd.to_datetime function to convert the Date column to a datetime object
P566_GPS_data['Date'] = 

In [None]:
P566_GPS_data['Date'][0]

In [None]:
type(P566_GPS_data['Date'][0])

## Making plots using `matplotlib`

We have relied on `matplotlib` to make plots throughout the course which we have imported as follows:

In [13]:
import matplotlib.pyplot as plt

### Plotting with pandas

One thing that you can do using pandas once you have imported matplotlib that we haven't done very much is use built-in plotting functions on the DataFrame. In this case we can use `.plot`.

In [None]:
P566_GPS_data.plot(x='Date', y='North (mm)')

### Plotting using plt functions

We have made a number of different plot types using `matplotlib` such as `plt.hist()`, `plt.plot()` and `plt.scatter()`. When dealing with timedate values, one can use `plt.plot()`, but not `plt.scatter()`. 

Let's visualize both the north and east columns using `plt.plot()`. The data are from every day between Nov. 16, 2005 and now.

In [None]:
## Create a plot between P566_GPS_data['Date'] and P566_GPS_data['North (mm)']


In [None]:
## Create a plot between P566_GPS_data['Date'] and P566_GPS_data['East (mm)']


What is going on with that drop midway through 2019? Let's take a look. 

Through some trial and error, the drop was between indices 4900 and 4950. Let's zoom in on that drop. 

In [None]:
plt.figure(figsize=(10,5))
plt.plot(P566_GPS_data['Date'][4900:4950],P566_GPS_data['East (mm)'][4900:4950],'.')
plt.ylabel('east since start (mm)')
plt.xlabel('date')
plt.title('GPS data from station P566 (Visalia, CA)')
plt.tight_layout()
plt.show()

What happened on July 6, 2019? Add your answer to the cell below.

https://earthquake.usgs.gov/earthquakes/eventpage/ci38457511/executive

## Fitting a line with scikit-learn

Scikit-learn has a function called `LinearRegression` that can be used to calculate best fit lines using the .fit() method. We have used this to fit lines to data in the past.

Recall from class, we can also consider higher order curves by using the `PolynomialFeatures` function in scikit-learn.

This function can be used to transform the data into a higher order polynomial space and then use the `LinearRegression` function to fit a line to the transformed data.

In [48]:
from sklearn.linear_model import LinearRegression

We can calculate the number of days by making a new column in the data frame that is the 'date' value minus the initial date. This will be the number of days since the first date in the data set (Nov. 16 2005). 

In [20]:
P566_GPS_data['days'] = (P566_GPS_data['Date'] - P566_GPS_data['Date'][0])/np.timedelta64(1,'D')

Let's take a look at our DataFrame and make sure it has a new column `days` and that the column looks good.

In [None]:
P566_GPS_data.head()

Now we can do a linear regression between the days (`P566_GPS_data['days']`) and the distance traveled north (`P566_GPS_data['North (mm)']`)

In [None]:
## Define and fit a linear regression model using the 'days' column as the independent variable and the 'North (mm)' column as the dependent variable
## Hint: the independent variable should be a 2D array, you should use double brackets [[]] to select the column 
model_north = 
model_north.fit(

We can get the best fitting slope and intercept of the line using the `.coef_` and `.intercept_` attributes of the `LinearRegression` object.

In [23]:
## Retrieve the slope and intercept from the model
slop, intercept = 

In [None]:
print(f'The slope is {slop:.2f} and the intercept is {intercept:.2f}')

**What are the units of this slope?** Write your answer in the cell below.

 ### Make a plot of prediction

**Plot a best-fit line for the data.** *Recall that you can use model.predict() to predict the values of the best-fit line.*

**Calculate and plot the residual.** *Recall that the residual is the difference between the actual data and the values obtained with the linear model.*

**Use the same function to predict how far north (relative to stable North America) the Visalia station will go in the next 10 years.** *There are 365.25 days in a year.*

In [None]:
## Predict the dependent variable based on the independent variable 'days'
y_pred = 


In [50]:
## Plot both the data 'North (mm)' and the model 'y_pred' on the same plot. Using the 'Date' column as the x-axis


In [None]:
## Calculate the residuals between the data 'North (mm)' and the model prediction y_pred. 
residuals = 


In [None]:
## Plot the residuals over Date.


### Evaluating the model using $R^{2}$

We'd also like to know who well this model fits our data (i.e. how correlated the data are). The $R^{2}$ correlation coefficient can be helpful in this regard. $R^{2}$ is zero for uncorrelated data, and 1 for perfectly linear data (so no misfit between the model line and data). 
Let's calculate the $R^{2}$ value for our model.
Recall that the $R^{2}$ value is calculated as follows:

$$R^{2} = 1 - \frac{\sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \bar{y})^{2}}$$

where $y_{i}$ is the actual data, $\hat{y}_{i}$ is the prediction, and $\bar{y}$ is the mean of the actual data.

In [None]:
## Calculate the R^2 value for the model
y_data = 
y_pred = 

R2 = 

print(f'The R^2 value is {R2:.6f}')

### Defining a function

When you may be doing a calculation more than once it is a good idea to define a function. Let's define a function that will take an east magnitude and a north magnitude and return a direction between 0 and 360.

In [28]:
def GPS_direction(east_magnitude, north_magnitude):
    direction_rad = np.arctan2(east_magnitude, north_magnitude)
    direction = np.rad2deg(direction_rad) % 360
    return direction

In [None]:
GPS_direction(0,-1)

Let's repeat the process for the east data.

Calculate the slope for the east data. Use this slope and the slope for the north data to calculate the direction of the motion of the station using the `GPS_direction` function.

In [None]:
## Define and fit a linear regression model using the 'days' column as the independent variable and the 'East (mm)' column as the dependent variable
model_east = 
model_east.fit(

In [None]:
### Retrieve the slope and intercept from the model
slop_east, intercept_east = 

print(f'The slope is {slop_east:.2f} and the intercept is {intercept_east:.2f}')

Based on estimation of east and north slope, we can calculate the direction of the motion of the station.

In [None]:
# To make it clear, let's redefine the slope and intercept for the North component
slop_north, intercept_north = model_north.coef_[0], model_north.intercept_

direction = GPS_direction(slop_east, slop_north)

print(f'The direction is {direction:.2f} degrees')

## Making a map with `cartopy`

At the start of the course, we made a number of maps using the `cartopy` library. Thecode below that will make a map showing the location of the P566 GPS station. 

Let's define variables giving the latitude and longitude of the P566 GPS station.

In [32]:
P566_lat = 36.32445
P566_lon = -119.22929

In [None]:
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import cartopy.io.img_tiles as cimgt

# Replace these variables with your station's longitude and latitude
P566_lon = -120.0  # Example longitude of station P566
P566_lat = 37.0    # Example latitude of station P566

# Add a background image from Google Maps
tiles = cimgt.GoogleTiles()

# Set up the map
plt.figure(figsize=(10, 10))
ax = plt.axes(projection=tiles.crs)
ax.set_extent((-125, -114, 32, 42.5))  # Set the geographical extent (lon_min, lon_max, lat_min, lat_max)

# Instread of adding features, we add the Google Maps image
# ax.add_feature(cfeature.LAND)
# ax.add_feature(cfeature.OCEAN)
# ax.add_feature(cfeature.STATES)

# Add the Google Maps image to the map
ax.add_image(tiles, 6) # The number 6 is the zoom level. The higher the number, the closer in you zoom

# Plot the station's location
ax.scatter(P566_lon, P566_lat, transform=ccrs.PlateCarree(), color='red', s=50, label='P566 Station')

# Annotate the station with its name
ax.text(P566_lon, P566_lat, 'P566\nstation\n', transform=ccrs.PlateCarree(),
        color='red', horizontalalignment='center', verticalalignment='bottom', size=12)

# Annotation for the direction of motion using an arrow
ax.arrow(P566_lon, P566_lat, np.sin(np.deg2rad(direction)), np.cos(np.deg2rad(direction)), head_width=0.1, head_length=0.1, fc='blue', ec='blue', transform=ccrs.PlateCarree())

plt.legend()
plt.show()


![alt text](https://static.temblor.net/wp-content/uploads/2016/05/eastern-california-21.jpg)

Does the direction of motion of the station you calculated match the direction of the Pacific Plate relative to North America? Write your answer in the cell below.

### Let's take look at the vertical component of the GPS time-series
We have been looking at the north and east components of the GPS time-series. Let's take a look at the vertical component of the GPS time-series.

In [53]:
## Create a plot between P566_GPS_data['Date'] and P566_GPS_data['Vertical (mm)']


What do these data show? What is happening to the land surface? Why? Add your answer to the cell below.

https://earthobservatory.nasa.gov/images/89761/san-joaquin-valley-is-still-sinking

https://www.earthdate.org/californias-sinking-valley

This is a big problem for the San Joaquin Valley. So is the rate of land subsidence increasing? Specifically, was the rate of land subsidence greater during the last 5 years (2018-01-01 and 2023-01-01) than it was in the first 5 years of the record (2006-01-01 and 2011-01-01)?

To answer this question, we need to:
- Filter the DataFrame to only include those years
- Compare the the slopes between the two age ranges. Which one appears to be greater? Is this results significant or do they have overlapping confidence bounds? 

Let's look at a subset of the data for the past 5 years between 2018-01-01 and 2023-01-01. We have done a lot of this filtering using pandas. However, the syntax is hard to remember.

It can be helpful to remember how this is actually working under the hood. When we are passing in a conditional statement like `P566_GPS_data['Date'] >= '2006-01-01'` we are asking pandas to tell us, at every value in the `P566_GPS_data['Date']` column is it true or false that the date is greater than 2018-01-01? 

In [None]:
P566_GPS_data['Date'] >= '2018-01-01'

The result is a list of true/false values. We then use these true/false values to filter the values in the DataFrame only returning those that are true. We can link multiple conditionals together with the `&` symbol such as in the example below:

In [None]:
P566_GPS_18_23 = P566_GPS_data[(P566_GPS_data['Date'] >= '2018-01-01') & (P566_GPS_data['Date'] < '2023-01-01')]
P566_GPS_18_23.head()

In [None]:
plt.figure()
plt.plot(P566_GPS_18_23['Date'],P566_GPS_18_23['Vertical (mm)'])
plt.ylabel('vertical since start (mm)')
plt.xlabel('date')
plt.title('GPS data from station P566 (Visalia, CA)')
plt.show()

In [None]:
## Let's do the same for the period 2006-2011
P566_GPS_06_11 = 
P566_GPS_06_11.head()

Now we can build a linear model for the 2006-2011 data and the 2018-2023 data. So that we can compare the slopes of these two models to see if the rate of land subsidence has increased.

In [None]:
## Define and fit a linear regression model using P566_GPS_18_23['days'] and P566_GPS_18_23['Vertical (mm)']
model_18_23 =
model_18_23.fit(

## Retrieve the slope and intercept from the model
slop_18_23, intercept_18_23 =
print(f'The slope is {slop_18_23:.2f} and the intercept is {intercept_18_23:.2f} for the period 2018-2023')

In [None]:
## Define and fit a linear regression model using P566_GPS_06_11['days'] and P566_GPS_06_11['Vertical (mm)']
model_06_11 = 
model_06_11.fit(

## Retrieve the slope and intercept from the model
slop_06_11, intercept_06_11 =
print(f'The slope is {slop_06_11:.2f} and the intercept is {intercept_06_11:.2f} for the period 2006-2011')

Based on your results, think about the following questions and write your answers in the cell below:

What is the main economic activity around P566? And what resources does that activity require?


What does the vertical component of the GPS time-series tell us about the land movement in the San Joaquin Valley when comparing the periods 2006-2011 and 2018-2023?

What are the implications of the land subsidence in the San Joaquin Valley?

![alt text](https://upload.wikimedia.org/wikipedia/commons/1/1c/Drought_area_in_California.svg)
