# PyEarth: A Python Introduction to Earth Science

## Linear Regression in Earth Science

In [1]:
import json
import pandas as pd
import fsspec
import matplotlib.pyplot as plt
import numpy as np

### Part 1: Data Exploration

1. Load the Ridgecrest earthquake data.

In [2]:
# M6.0 earthquake used in class
# json_url = "https://earthquake.usgs.gov/product/shakemap/ci38443183/atlas/1594160017984/download/stationlist.json"

# M7.1 earthquake for assignment
json_url = "https://earthquake.usgs.gov/product/shakemap/ci38457511/atlas/1594160054783/download/stationlist.json"

with fsspec.open(json_url) as f:
    data = json.load(f)

def parse_data(data):
    rows = []
    for line in data["features"]:
        rows.append({
            "station_id": line["id"],
            "longitude": line["geometry"]["coordinates"][0],
            "latitude": line["geometry"]["coordinates"][1],
            "pga": line["properties"]["pga"], # unit: %g
            "pgv": line["properties"]["pgv"], # unit: cm/s
            "distance": line["properties"]["distance"],
        }
    )
    return pd.DataFrame(rows)

data = parse_data(data)
data = data[(data["pga"] != "null") & (data["pgv"] != "null")]
data = data[~data["station_id"].str.startswith("DYFI")]
data = data.dropna()
data = data.sort_values(by="distance", ascending=True)
data["logR"] = data["distance"].apply(lambda x: np.log10(float(x)))
data["logPGA"] = data["pga"].apply(lambda x: np.log10(float(x)))
data["logPGV"] = data["pgv"].apply(lambda x: np.log10(float(x)))

2. Use pandas to print the first few rows of the data. Understand what each column means.

3. Create a scatter plot of latitude vs. longitude, with point colors representing PGA values.

4. Calculate and print the mean and standard deviation of PGA values.

### Part 2: Simple Linear Regression

1. Create a scatter plot of log(PGA) vs. log(R).

2. Use scikit-learn to fit a linear regression model to this data.


3. Print the slope and intercept of the fitted line.

4. Calculate and print the R-squared value of the model.

### Part 3: Residual Analysis

1. Calculate the residuals of your model.

2. Create a residual plot residuals vs. log(R). 

3. Comment on any patterns you observe in the residual plot.

4. Find the actual PGA value recorded in Los Angeles for the Ridgecrest earthquake using the USGS ShakeMap.

USGS ShakeMap: [https://earthquake.usgs.gov/earthquakes/eventpage/ci38457511/shakemap/pga](https://earthquake.usgs.gov/earthquakes/eventpage/ci38457511/shakemap/pga)

Note that we are working on the M7.1 earthquake.

5. Use your model to predict the PGA in the Los Angeles basin (hint: use distance of the station you selected).

6. Compare this value to your model's prediction. Discuss possible reasons for any differences.
