# Lab 4. Data structures and arrays
#### Computational Methods for Geoscience - EPS 400/522
#### Instructor: Eric Lindsey

Due: Sept. 21, 2023

---------

Adrian Marziliano

In [None]:
# some useful imports and settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import interpolate
import netCDF4 as nc

%config InlineBackend.figure_format = 'retina' # better looking figures on high-resolution screens

### Using data structures to categorize data

The file 'worldwide_m4+_2022.csv' (on canvas) contains all earthquakes larger than magnitude 4 recorded by the USGS in 2022 (more than 15,000 events). Let's use a dictionary to keep track of how many events happened in each state.

First, read the data into python using pandas. The column 'place' contains a short description of the location of each event, and if it occurred in the US, this description will (usually) mention a state name. We can find out if a string is contained in another string using the keyword 'in' (see the notes).

Instructions: loop over the list of state names, and for each state count the number of M4+ earthquakes that occurred in that state (you may need to loop over the whole dataset for each state name). Add this number to a dictionary with the state name as the key; for example it might contain 'New Mexico': 4.

Finally, print out the top 10 states by number of earthquakes in 2022.

In [None]:
earthquake_df=pd.read_csv('worldwide_m4+_2022.csv')
print(earthquake_df[['longitude', 'latitude', 'mag', 'place']])

In [None]:
# Create list of states as the keys of your dictionary.
us_states = [ "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", 
             "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
             "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", 
             "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire",
             "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", 
             "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", 
             "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", 
             "Wisconsin", "Wyoming"]

# Create an empty dictionary to store the earthquake data
earthquake_data_dict = {
    "Time": [],
    "Latitude": [],
    "Longitude": [],
    "Depth": [],
    "Magnitude": [],
    "Place": []
}

# Iterate through the DataFrame and extract data for US state names
for index, row in earthquake_df.iterrows():
    if isinstance(row['place'], str):  # Check if 'place' is a string
        for state in us_states:
            if state in row['place']:
                earthquake_data_dict["Time"].append(row['time'])
                earthquake_data_dict["Latitude"].append(row['latitude'])
                earthquake_data_dict["Longitude"].append(row['longitude'])
                earthquake_data_dict["Depth"].append(row['depth'])
                earthquake_data_dict["Magnitude"].append(row['mag'])
                earthquake_data_dict["Place"].append(row['place'])
                break  # Break the loop once a match is found to avoid duplicate entries

# Convert the dictionary to a DataFrame if needed
earthquake_data_df = pd.DataFrame(earthquake_data_dict)

# Print or use the earthquake data as needed
print(earthquake_data_df[['Time', 'Magnitude', 'Place']])

### Resampling a dataset

Often times, our data have values missing, large errors, or are unevenly sampled. In this case, we need to 'resample' the data onto a regular grid. This is also known as 'gridding' the data.

In [None]:
# original data - slight variation in the time sampling
time = np.linspace(0, 10, 20) +  np.random.uniform(-0.2, 0.2, 20)
values = np.sin(time)

# add some bad data
ibad=np.random.randint(2,18,(4,))
values[ibad] += 5+10*np.random.rand(4)

# plot the data
plt.plot(time,values,'ks',label='original')

In [None]:
# Plot US earthquake data
plt.plot(earthquake_data_df['Time'],earthquake_data_df['Magnitude'],'k.') # notice I set the marker to black dots with 'k.'
plt.show()

### Assignment 1: remove outliers and resample the above data 

Step 1. Remove the outliers using logical indexing.

Step 2. Resample the remaining data onto a regularly spaced set of points sampled every 0.1 seconds, from 0 to 10. You can choose the interpolation method you find best!

Step 3. Plot the resampled data on top of the original data (without outliers), showing how the interpolation works.

#### STEP 1: Remove Outliers

In [None]:
# Assuming you have a DataFrame named 'earthquake_df' with relevant data
# Replace 'earthquake_df' with the actual name of your DataFrame.

# Calculate mean and standard deviation for the 'mag' column
mean_magnitude = earthquake_data_df['Magnitude'].mean()
std_deviation_magnitude = earthquake_data_df['Magnitude'].std()

# Define a threshold for outliers (e.g., values more than 2 standard deviations from the mean)
threshold = 2 * std_deviation_magnitude

# Create a boolean mask identifying outliers
outliers_mask = np.abs(earthquake_data_df['Magnitude'] - mean_magnitude) > threshold

# Use the mask to filter the DataFrame and remove outliers
filtered_earthquake_df = earthquake_data_df[~outliers_mask]

# Now, 'filtered_earthquake_df' contains the DataFrame with outliers removed.

# You can also reset the index if needed
filtered_earthquake_df.reset_index(drop=True, inplace=True)

# Print or work with the filtered DataFrame as needed
print(filtered_earthquake_df)

In [None]:
# Plot US earthquake data W?O OUTLIERS
plt.plot(filtered_earthquake_df['Time'],filtered_earthquake_df['Magnitude'],'k.') # notice I set the marker to black dots with 'k.'
plt.show()

#### STEP 2: RESAMPLING DATA

In [None]:
# Assuming you have the 'filtered_earthquake_df' DataFrame with the relevant data
# Replace 'filtered_earthquake_df' with the actual name of your DataFrame.

# Convert the 'Time' column to a datetime object if it's not already
filtered_earthquake_df['Time'] = pd.to_datetime(filtered_earthquake_df['Time'])

# Set the 'Time' column as the DataFrame's index using .loc
filtered_earthquake_df.set_index('Time', inplace=True)

# Filter out rows with datetime values within the valid range
filtered_earthquake_df = filtered_earthquake_df[
    (filtered_earthquake_df['Time'] >= pd.Timestamp('1677-09-21')) &
    (filtered_earthquake_df['Time'] <= pd.Timestamp('2262-04-11'))
]

# Resample the data to a regularly spaced set of points every 0.1 seconds from 0 to 10 seconds
resampled_df = filtered_earthquake_df.resample('100ms').mean()

# If you want to limit the resampling to a specific time range (0 to 10 seconds in this case):
start_time = pd.to_datetime('0s')
end_time = pd.to_datetime('10s')
resampled_df = resampled_df.loc[start_time:end_time]

# If the DataFrame contains NaN values (gaps in the data), you can fill them if needed
resampled_df = resampled_df.fillna(method='ffill')  # Forward fill NaN values


# Reset the index to have the 'Time' as a column again (optional)
resampled_df.reset_index(inplace=True)

# Print or work with the resampled DataFrame as needed
print(resampled_df)

In [None]:
# Plot US earthquake data
plt.plot(resampled_data['Time'],resampled_data['Magnitude'],'k.') # notice I set the marker to black dots with 'k.'
plt.show()

### Assignment 2. Use 2D Interpolation to fill in the continents.

Remember our averaged-monthly SST dataset? (Filename: 'sst.mon.ltm.1981-2010.nc') Let's use this as a (strange) example of interpolation. Try masking out the NaNs in the grid of temperatures from September, then use griddata to fill in all the values over the continents.

I think this will prove a litte challenging - good luck, work with each other!

In [None]:
# here is some code to get you started.
# note you will have to copy the data file into your current folder for it to work for you.

filename = 'sst.mon.ltm.1981-2010.nc'
dataset = nc.Dataset(filename)

# sst is stored as a 3D array (time,lat,lon)
# get the grid in September
sst_sept=dataset['sst'][8,:,:]

# Hint: note that this netCDF dataset comes with a 'mask' property that lets us know which values are NaN.
# we can access them with sst_sept.mask

print('whether each point is nan:\n',sst_sept.mask)

# you can use this to extract only the valid data from any given array, if it has the same size
zvalid = sst_sept[~sst_sept.mask]

# check the shapes:
print('shape of sst_sept is', np.shape(sst_sept))
# notice, now it became a vector instead of an array.
print('shape of zvalid is', np.shape(zvalid))


#### I suggest the following procedure:

**Step 1. Generate the gridded X and Y matrices**

Use np.meshgrid on the dataset['lon'] and dataset['lat'] vectors.
Make sure to verify that your output arrays have the same size as your SST data.

**Step 2. Extract the valid points from each of your 3 arrays (X, Y, SST)**

Check out the hint above for how to use the mask property of the netcdf dataset.

**Step 3. Choose an interpolation method and do the interpolation from the scattered valid data back to the full X and Y grids**

**Step 4. Mask the ocean areas to show just the continents. You should end up with something cool!**