# GPGN268 - Geophysical Data Analysis
## Types of Seismic Data – Part I: File manipulation
#### Due: April 6, in class

For the first part of this Data Story you will look at Seismic Data measured by [Distributed Acoustic Sensing](https://en.wikipedia.org/wiki/Distributed_acoustic_sensing)
(DAS) cables deployed at Kafadar Commons as part of the [Geophysical Discovery Lab](https://people.mines.edu/rkrahenb/geophysical-discovery-lab/). The data was collected and pre-processes by [Dr. Eileen Martin](https://geophysics.mines.edu/project/martin-eileen/) and her Mathematical Geophysics students.

- The data was collected on two separate days: Feb 17, 2023 and March 1, 2023.
- The file names are tagged with the date and time of collection in UTC
- The data was recorded at a rate of 500 samples per second
- The measurements recorded correspond to strain and timing of the measurement


You may discuss this assignment with your peers, but everyone should submit their assignment individually. If there is anyone who has *signficantly* contrbuted to your work, helped you figure something important out, etc., list them as a collaborator below with a short description of their input. **Collaborating will not impact your grade**. Please be honest.



### Preparation

- Navigate to the GPGN268-CORE directory and do a `git pull` to get this notebook. 

- Then navigate to your `ds03-seismology`, create a directory called `notebooks` and copy this notebook to your notebooks directory. 

```
$ cd ~/work/classes/GPGN268/coursework-lastname/ds03-seismology/
$ cp ~/work/classes/GPGN268/GPGN268-CORE/assignments/DS03-DAS.ipynb notebooks
```

- On Canvas, download the data files from files/data/DAS. There should be 11 files. 
Then create a sub-directory `data` in your `ds03-seismology` directory and move the `*.h5` files that you just downloaded there. When you are done, 
you should have somthing like this:

```bash
$ pwd
~/work/classes/GPGN268/coursework-villasboas/ds03-seismology
$ ls data
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164233Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164333Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164433Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164533Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164633Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-03-01T231427Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-03-01T231527Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-03-01T231627Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-03-01T231727Z.h5
Global_DAS_1078DT_25PR_14GL_5DEC_2023-03-01T231827Z.h5
```

- From the root of your `ds03-seismology` directory, launch Jupyter Lab. Remember to activate the GPGN268 conda environment first.

```
$ conda activate GPGN268
$ jupyter lab
```

- Using the left navigation toolbar in Jupyter Lab, go to the `notebooks` directory rename this notebook to `dev.ipynb`– this will be where you will develop the code for your data story (try things out, make draft figures, etc). **You will not** turn in (i.e., push to GitHub) the `dev.ipynb` file. 

- Create another notebook called `ds03-seismology.ipynb`. This is where you will put the final version of your Data Story, with polished text, and clean and well-documented code.

- Copy the text below onto the first cell (Markdown) of your `ds03-seismology.ipynb` notebook and fill it out with your name and date.

```markdown
# GPGN268 - Geophysical Data Analysis
## Data Story 03 - Seismology

**Student:** Blaster the Burro 
**Collaborators:**
- Yoda helped me figure out how to use the force
- Obi-Wan provided input on my code to plot resistivity
**Date:** May the 4th, 2078
```

- Complete the tasks below. Use this notebook (`dev.ipynb`) to explore and follow the instructions. After your are done with the final version of your assignment, git add `ds03-seismology.ipynb`, commit, and push to GitHub.

### Reading HDF5 data
The DAS data that we will use is in HDF5 format. To read these data we will use a python library called [h5py](https://docs.h5py.org/en/stable/#).

In [63]:
# Import libraries
import os
import glob
import numpy as np

import matplotlib.pyplot as plt
import h5py
import pandas as pd

First we will list all files from 2023-02-17 using `glob` and wildcards

In [3]:
files_list = sorted(glob.glob('../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02*.h5'))
files_list

['../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5',
 '../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164233Z.h5',
 '../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164333Z.h5',
 '../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164433Z.h5',
 '../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164533Z.h5',
 '../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164633Z.h5']

To read a file with `h5py` we use the module `h5py.File` and pass the name of the file followed by the argument `'r'`, which stands for "reading mode". Let's read the first file from our list

In [5]:
data = h5py.File(files_list[0], 'r')
data

<HDF5 file "Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5" (mode r)>

We have no idea what to do with this object, so let's use `dir` to try to find something about it.

In [6]:
dir(data)

['_MutableMapping__marker',
 '__abstractmethods__',
 '__bool__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_d',
 '_e',
 '_gcpl_crt_order',
 '_id',
 '_ipython_key_completions_',
 '_lapl',
 '_lcpl',
 '_libver',
 'attrs',
 'build_virtual_dataset',
 'clear',
 'close',
 'copy',
 'create_dataset',
 'create_dataset_like',
 'create_group',
 'create_virtual_dataset',
 'driver',
 'file',
 'filename',
 'flush',
 'get',
 'id',
 'items',
 'key

This object has the attribute `keys`. So let's try that

In [7]:
data.keys()

<KeysViewHDF5 ['Acquisition']>

... and keep exploring

In [8]:
data['Acquisition']

<HDF5 group "/Acquisition" (2 members)>

In [9]:
data['Acquisition'].keys()

<KeysViewHDF5 ['Custom', 'Raw[0]']>

In [10]:
data['Acquisition']['Raw[0]']

<HDF5 group "/Acquisition/Raw[0]" (3 members)>

In [11]:
data['Acquisition']['Raw[0]'].keys()

<KeysViewHDF5 ['Custom', 'RawData', 'RawDataTime']>

In [13]:
data['Acquisition']['Raw[0]']['RawData'][:]

array([[ 13963817,  15360819,  15728914, ...,  54194224, -72606198,
        -75493199],
       [ 13963223,  15360234,  15728365, ...,  54195131, -72605440,
        -75492494],
       [ 13962343,  15359339,  15727367, ...,  54193676, -72606535,
        -75492884],
       ...,
       [ 14339230,  15755428,  16149981, ...,  54189850, -72609147,
        -75494221],
       [ 14339863,  15756100,  16150714, ...,  54190729, -72608397,
        -75494259],
       [ 14339059,  15755116,  16149678, ...,  54189972, -72608558,
        -75493907]], dtype=int32)

Bingo! It looks like we found where in this data structure the strain measurements are stored. It looks like there is some time information too.

In [73]:
data['Acquisition']['Raw[0]']['RawDataTime'][:]

array([1676652093654000, 1676652093656000, 1676652093658000, ...,
       1676652153648000, 1676652153650000, 1676652153652000])

----------------------------
## Task 1 – Reading the data

Write a function that takes as input the path to a given DAS `.h5` file and returns the strain and timing

```python
def your_function(path_to_file):
    ...
    ...
    
    
    return strain, timing
```

### Manipulating file names and paths

Python has an incredibly useful library for manipulating files and paths called [os](https://docs.python.org/3/library/os.html). We've seen in the introduction that the time of the start of the data collection is written in the file name. We will use `os` to extract the time information from the filename.

We know from using `dir` above that our `h5py` object has an attribute `filename`

In [75]:
data

<HDF5 file "Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5" (mode r)>

In [76]:
data.filename

'../data/Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5'

`data.filename` has two parts: the path and the actual filename (also known as basename). We can extract the basename using:

In [77]:
os.path.basename(data.filename)

'Global_DAS_1078DT_25PR_14GL_5DEC_2023-02-17T164133Z.h5'

Now, we would like to grab just the portion that has `2023-02-17T164133Z`. We see that parts of the filename are separated by an underscore. We can use the function `split` to split the filename every time there is an underscore

In [79]:
os.path.basename(data.filename).split('_')

['Global', 'DAS', '1078DT', '25PR', '14GL', '5DEC', '2023-02-17T164133Z.h5']

It's getting better. Now, we just want the last element of this list

In [80]:
os.path.basename(data.filename).split('_')[-1]

'2023-02-17T164133Z.h5'

And split again where there is a `.`

In [81]:
os.path.basename(data.filename).split('_')[-1].split('.')

['2023-02-17T164133Z', 'h5']

Great! Now, we want the first element of the list above

In [82]:
os.path.basename(data.filename).split('_')[-1].split('.')[0]

'2023-02-17T164133Z'

And we probably don't want to include the "Z" in the end

In [85]:
os.path.basename(data.filename).split('_')[-1].split('.')[0][:-1]

'2023-02-17T164133'

Combining all steps into a single line and saving it to a variable, we have:

In [151]:
time_string = os.path.basename(data.filename).split('_')[-1].split('.')[0][:-1]
time_string

'2023-02-17T164133'

-------------------------------
## Task 2 – Time and time again

### Task 2.1

- Test the code above for other files in your `file_list` and see if the time extracted from the filename makes sense.

- Using the steps above, write a function that takes as input the path to a given DAS `.h5` file and returns a string with the date and time portion of the filename.

```python
def your_function(path_to_file):
    ...
    ...
    
    
    return time_string
```

In [133]:
# Your code goes here

#### Creating a time vector

Now, the step above gave us a string with the time and date. We would really like to have an actual time with the proper units. The function `to_datetime` from `pandas` can help with that. For example, let's say I'm working with a time string `time_test`

In [91]:
time_test = '1989-05-21 03:14' # This string has spaces and special characteres
pd.to_datetime(time_test, format='%Y-%m-%d %H:%M') # format will tell pandas how to read the date and time

Timestamp('1989-05-21 03:14:00')

We can be even more specific and pass to `pd.to_datetime` the argument `utc=True` to indicate that our timestamp is in reference to UTC time.

### Task 2.2

Now, try a similar approach to convert the time from your filename (`file_time`) to a timestamp:

```python
time_stamp = pd.to_datetime(....,format=...., utc=True)
```

In [132]:
# Your code goes here

In [152]:
time_stamp = pd.to_datetime(time_string, format='%Y-%m-%dT%H%M%S', utc=True)
time_stamp

Timestamp('2023-02-17 16:41:33+0000', tz='UTC')

### Task 2.3
Given that these measurements were collected at a rate of 500 measurements per second, calculate:
- The total duration of your record in seconds
- The interval between two conssecutive measurements in seconds

In [131]:
# Your code goes here

### Task 2.4

`pandas` has a function `Timedelta` which allows adding a certain amount of time to a timestamp. For example, if you wanted to add 10 seconds to your `time_stamp`, you would do:

```python
time_stamp + pd.Timedelta(10, 's') # Value to add, units of the value (seconds in this case)
```

Using the method above and the values that you calculated in Task 2.3, create a variable `end_time` to save the timestamp for the last measurement in your dataset.

```python
end_time = time_stamp + pd.Timedelta(..., ...)
```

In [135]:
# Your code goes here

### Task 2.5

Now you should have the start time (`time_stamp`), the end time (`end_time`) and you should know how many points you have in your data. This should be sufficient for you to create a time vector with the datetimes of all your records. To do that, use the functionn `date_range` from `pandas`, which is similar to the `numpy.range`. The first argument is the start date, the second argument is the end date, and the argument `periods` is the number of elements that you want in yout range (or the length of your record).

```python
time_utc = pd.date_range(start=..., end=..., periods=...)
```

In [147]:
# Your code goes here

### Task 2.6
The final step is to convert the time to Mountain time. You can achieve this by doing

```python
time = time_utc.tz_convert('US/Mountain')
```

In [None]:
# Your code goes here

### Task 2.7

Combine all the steps that you used for creating the time into a function that takes as input the input the path to a given DAS .h5 file and returns a datetime vector referenced to US/Mountain time.


```python
def your_function(path_to_file):
    ...
    ...
    
    return time
```

--------
## Task 3

The data in the .h5 files is saved as strain, but for geophysics applications we're more interested in the **strain rate**. Write a function that takes as input the strain and a datetime vector and returns the strain **rate**:

$$
Strain\ Rate = \frac{Strain}{\Delta t}
$$

-------
## Task 4

- Clean up your functions.
- Combining the steps and functions that we did in the previous tasks, write a function that takes as input the path to a given DAS .h5 file and returns the strain rate and the time of each measurement as a datetime object converted to MST. 
- Document your functions with a `docstring` and use expressive variable names. 