# ETL

Extract-Transform-Load (ETL) step for the project '**Aftershock pattern prediction based on earthquake rupture data for improved seismic hazard assessment**'. DeVries18 will refer to the article 'Deep learning of aftershock patterns following large earthquakes' by Phoebe M. R. DeVries, Fernanda Viégas, Martin Wattenberg & Brendan J. Meade, and published in Nature in 2018 (https://www.nature.com/articles/s41586-018-0438-y ).

## Extract

The data needed for this project were already extracted in the notebook pred_seism_aftXYZ.data_exp for preliminary data exploration.

-  **Mainshock rupture files (SRCMOD)** extracted from [here](https://github.com/TheFenrisLycaon/AfterShock/), originally from [here](http://equake-rc.info/SRCMOD/searchmodels/allevents/), then filtered manually, considering only the 199 mainshocks used in the DeVries18 publication;

-  **Labelled data of the baseline model (complete_dataset_labelled)** imported as pickle file, created from a list of csv files originally imported from the [Google Drive](https://drive.google.com/drive/folders/1c5Rb_6EsuP2XedDjg37bFDyf8AadtGDa)

We here directly 'extract' all needed data from our working directory.

#### SRCMOD data (mainshock rupture models) 

In [1]:
import os
import fnmatch

SRCMOD_dir = './datasets/SRCMOD/'
SRCMOD_list = fnmatch.filter(os.listdir(SRCMOD_dir), '*.fsp')
SRCMOD_list.sort()
len(SRCMOD_list)

199

In [2]:
import pandas as pd

df_complete = pd.read_pickle('./Data/complete_dataset_labelled.pkl')
len(df_complete)

6121210

## Transform

### Transform SRCMOD mainshock rupture fsp files into more manageable list of defaultdict objects

The [fsp format](http://equake-rc.info/SRCMOD/fileformats/fsp/) is quite complex and must be cleaned for feature engineering.

In [3]:
with open(SRCMOD_dir + SRCMOD_list[0]) as file0:
    line = file0.readline()
    while line:
        print(line.strip())
        line = file0.readline()

% ----------------------------------  FINITE-SOURCE RUPTURE MODEL  --------------------------------
%
% Event : Hyuga-nada (Japan) 		04/01/1968 		[Yagi et al. (1998) ]
% EventTAG: s1968HYUGAx01YAGI
%
% Loc  : LAT  = 32.28 		LON = 132.53		DEP = 15.0
% Size : LEN  = 72.0 km 	WID =  63.0 km		Mw = 7.53	Mo = 2.22e+20 Nm
% Mech : STRK = 227.0		DIP = 12.0		RAKE = 90.0	Htop = 10.32 km
% Rupt : HypX = 31.50 km 	HypZ = 22.50 km 		avTr = 12.0 s	avVr = 2.8 km/s
%
% ----------------------------------  inversion-related parameters  --------------------------------
%
% Invs :  Nx  =  8 	Nz  = 7 	Fmin = 999.00 Hz 	Fmax = 999.00 Hz
% Invs :  Dx  =  9.00 km 	Dz  = 9.00 km
% Invs :  Ntw =  1	Nsg =  1 			(# of time-windows,# of fault segments)
% Invs :  LEN =  999.0 s	SHF =  0.0 s 		(time-window length and time-shift)
% SVF  :  unknown 					(type of slip-velocity function used)
%
% Data : 	SGM	TELE	TRIL	LEVEL	GPS	INSAR	SURF	OTHER
% Data : 	999	0	0	0	0	0	0	0
% PHImx: 	999	0	0	0	0	0	0	0
% Rmin : 	999	0	0	0	

Fortunately, we can use the srcmod.py program which reads and transform fsp files to be (originally) used as input for Coulomb stress modelling. The original code is available on the following [Google GitHub repository](https://github.com/google/stress_transfer/tree/master/stress_transfer). 

Although we won't do any Coulomb stress modelling, using the same input format will simplify feature engineering in the next step of the process model. Note that the original code was done with Python 2. For Python 3, we made the following modifications:

- 'print x' changed to 'print(x)', 
- '.has_key' changed to 'in'
- We also removed the gcs module with 'gcs.File(filename)' changed to 'open(filename, 'r')'. 
- Changed proj utm source.

The updated code srcmod.py is available in this project's GitHub repository under utils.

Once in our working directory, we load srcmod.py and run it:

Let us first do a test on one SRCMOD file. We see that ReadSrcmodFile returns a dictionary-like defaultdict object:

In [8]:
# Copyright (c) 2015 Google, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of
# this software and associated documentation files (the "Software"), to deal in
# the Software without restriction, including without limitation the rights to
# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
# the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

# Modified for Python 3 by Rishabh Anand, 2022.
# 'print x' changed to 'print(x)'
# '.has_key' changed to 'in'
# gcs module removed with 'gcs.File(filename)' changed to 'open(filename, 'r')'

"""Reads Srcmod data from .fsp files.

The original srcmod reader was a little fragile, failing to read in many of the
.mat files. ReadSrcmodFile fixes this -- it reads in the raw ascii from a .fsp
file, and returns the data in the same format. There are some small differences
in the data, mostly due to the fact that .mat files are 64-bit, and we read
in the ascii, and covert to floats. As such, there's some small differences in
the read in values, but this routine does work with all the data.
"""


import collections
import datetime
import logging
import math
import re

import numpy as np
import pyproj
import utm

# Regular expressions that will parse the text Srcmod files.
# TAGS are of the form: 'xxx : yyy zzz'
TAGS_RE = re.compile(r"(\w+\s*:\s*(?:\S+ ?)+)")

# FIELDS are of the form: 'xxxx = float'
FIELDS_RE = re.compile(r"\w+\s+=\s+\-?\d+\.?\d*[eE]?[\+\-]?\d*")

# DATES are of the form: 'nn/nn/nn'
DATE_RE = re.compile(r"\d+/\d+/\d+")

# DATA fields within a segment begin with '% LAT LON'
DATA_FIELDS_RE = re.compile(r"%\s+LAT\s+LON")

# Maps tags between what's given in the srcmod file, and the output fields we
# use.
TAG_MAP = [
    ("EVENTTAG", "tag"),
    ("EVENT", "description"),
]

# There are a number of data fields from the header of a Srcmod file that are
# directly copied over into the output of the file reader. This is an array of
# the tuples where:
#
#     (INPUT_NAME, OUTPUT_NAME)
FIELD_MAP = [
    ("LAT", "epicenterLatitude"),
    ("LON", "epicenterLongitude"),
    ("DEP", "depth"),
    ("MW", "magnitude"),
    ("MO", "moment"),
]

# Constants to do some conversions.
KM2M = 1e3  # Convert kilometers to meters
CM2M = 1e-2  # Convert centimeters to meters


def _FindFields(data, opt_ignore_duplicate=True):
    """Finds all 'FIELD = VAL' in given string.

    Args:
      data: String of data to search for.
      opt_ignore_duplicate: We have two options if we encounter a named field more
        than once: we can ignore the duplicate, or we can take the new value. By
        default, we will ignore the duplicate fields.
    Returns:
      Dictionaries 'field': 'val' where 'val' has been cast to float. NB: unless
      specified, only the first field found is specified.
    """
    # Extract the fields from the data.
    fields = {}
    for field in FIELDS_RE.findall(data):
        name, val = field.split("=")
        name = name.strip().upper()
        # Take the FRIST values seen.
        if not opt_ignore_duplicate or name not in fields:
            fields[name] = float(val.strip())
    return fields


def _SeparateSegments(num_segments, fields, data):
    """Pulls the segments out of the data.

    Depending on if the srcmod file is a multi or single segment file, this
    function will find the segment separator, and return the separated segment
    data.

    A single segment file looks like:

      % SRCMOD HEADER
      % SOURCE MODEL PARAMETERS
      %     [ SEGMENT_HEADER ]
      data

    A multi-segment file will look like:

      % SRCMOD HEADER
      % SEGMENT
      %     [ SEGMENT_HEADER ]
      data

      [.... num_segments ....]

      % SEGMENT
      %     [ SEGMENT_HEADER ]
      data

    Args:
      num_segments: The number of segments in the data.
      fields: The header of the srcmod file.
      data: The data (as a string) of the srcmod file.

    Returns:
      Tuple of (segments, segment_fields)
        segments: Array of all the segment data (as strings).
        segment_fields: The fields that have been stripped from the segment
          headers.
    """
    # Set up the segment data.
    if num_segments > 1:
        delimeter = "% SEGMENT"
        assert delimeter in data
        segments = [delimeter + _ for _ in data.split(delimeter)[1:]]
        segment_fields = [_FindFields(seg) for seg in segments]
    else:
        delimeter = "% SOURCE MODEL PARAMETERS"
        assert delimeter in data
        segments = [delimeter + _ for _ in data.split(delimeter)[1:]]
        segment_fields = [fields]
    assert len(segments) == num_segments
    assert len(segment_fields) == num_segments
    return segments, segment_fields


def _GetSegmentData(data):
    """Given a segment of data, we parse it into the appropriate fields.

    Args:
      data: String that contains all the characters in a segment's worth of data.
    Returns:
      List of lists of dictionaries.
    """
    ret = []
    rows = []
    names = []
    last_z = None
    for line in data.split("\n"):
        if not line:
            continue  # Skip blank lines
        if DATA_FIELDS_RE.match(line):  # Find field names
            # We extract the names of the fields.
            # The field names will be a in a string of the following form:
            #
            #     '%     F1   F2    F3==X     Z'
            #
            # First we split up the string by removing all spaces, discard the first
            # one ('%'), and then we remove any pieces after and including '=' in the
            # field name. NB: The last row must be a 'Z'
            names = [x.upper() for x in line.split()[1:]]
            names = [x.split("=")[0] if "=" in x else x for x in names]
        if line[0] == "%":  # Skip comment lines.
            continue
        else:
            # Make a dict of our values.
            val = {n: float(v) for n, v in zip(names, line.split())}
            assert -180.0 <= val["LON"] <= 180.0
            assert -90.0 <= val["LAT"] <= 90.0

            # If the z value we've just read in doesn't equal the last z value we've
            # read in, we have a new row. We then save off the row we've read so far
            # before adding the new value to the rows.
            if last_z is not None and val["Z"] != last_z:
                ret.append(rows)
                assert len(ret[0]) == len(ret[-1])  # Is same length as previous?
                rows = []
            rows.append(val)
            last_z = val["Z"]
    if rows:
        ret.append(rows)
    assert len(ret[0]) == len(ret[-1])  # Is same length as previous?
    return ret


def ReadSrcmodFile(filename):
    """Reads a Srcmod file.

    Args:
      filename: Full path to Srcmod file.
    Returns:
      List of dictionaries. Each dictionary is a single segment of the fault.
    """
    logging.info("Reading SRCMOD file: %s", filename)

    src_mod = collections.defaultdict(list)
    with open(filename, "r") as f:
        data = f.read()
        # Read the date.
        date = DATE_RE.search(data).group(0)
        src_mod["date"] = date
        src_mod["datetime"] = datetime.datetime.strptime(date, "%m/%d/%Y")

        # Extract tags
        tags = {}
        for tag in TAGS_RE.findall(data):
            name, val = tag.split(":")
            tags[name.strip().upper()] = val.strip()

        # Remap tags to src_mod output.
        for in_name, out_name in TAG_MAP:
            if in_name not in tags:
                print("error", in_name, tags)
                continue
            src_mod[out_name] = tags[in_name]

        # Find fields, and remap them to src_mod output.
        fields = _FindFields(data)
        for in_name, out_name in FIELD_MAP:
            if in_name not in fields:
                print("error", in_name, fields)
                continue
            src_mod[out_name] = fields[in_name]

        # Calculate some epicenter projection stuff.
        _, _, number, letter = utm.from_latlon(
            src_mod["epicenterLatitude"], src_mod["epicenterLongitude"]
        )
        src_mod["zoneNumber"] = number
        src_mod["zoneLetter"] = letter
        proj = pyproj.Proj(
            proj="utm", zone="{}".format(number), ellps="WGS84"
        )
        src_mod["projEpicenter"] = proj
        src_mod["epicenterXUtm"], src_mod["epicenterYUtm"] = proj(
            src_mod["epicenterLongitude"], src_mod["epicenterLatitude"]
        )

        # Set up the segment data.
        num_segments = int(fields["NSG"])
        segments, segment_fields = _SeparateSegments(num_segments, fields, data)

        # Loop through the segments.
        for i in range(num_segments):
            if "STRIKE" in segment_fields[i]:
                seg_strike = segment_fields[i]["STRIKE"]
            else:
                seg_strike = fields["STRK"]
            angle = -(seg_strike - 90)
            if angle < 0:
                angle += 360

            data = _GetSegmentData(segments[i])
            if len(data) == 1:
                continue  # Skip short segments.

            # Calculate the length and wide if individual patch elements in current
            # panel.
            length = segment_fields[i].get("DX", fields["DX"])
            if "LEN" in segment_fields[i]:
                width = segment_fields[i]["LEN"] / len(data)
            else:
                width = fields["DZ"]

            # Calculate the geometric coordinates of the segments.
            #
            # In the following code, we convert the srcmod data into a format we use
            # for our coloumb stress calculations. Specifically, we take the srcmod
            # data and remap the geometry into a form we need. The original srcmod
            # data looks like:
            #
            #               v this coordinate is the x,y,z data point.
            #       +-------*--------+
            #       |                |
            #       |                |
            #       +----------------+
            #
            # The original srcmod data is also along a x,y,z coordinate system where
            # the Z vector is projected from the core of the earth. We need to
            # decompse the data (using the strikeslip and dipslip[*]) of the fault.
            #
            # The first thing we do is find the offsets between the x/y coordinates --
            # specifically, [xy]_top_offset and [xyz]_top_bottom_offset. We calculate
            # these values as follows:
            #
            #   [xy]_top_offset is calculated by assuming the fault patches are
            #     uniformally spaced, and sized on a given segment. Given this, and
            #     the length and angle of the fault, we calculate the offsets as the
            #     length rotated about the angle.
            #   [xyz]_top_bottom_offsets are calculated by (again assuming uniform
            #     patch size) taking the difference between two [xyz] coordinates.
            #
            # We remap the coordinates into the following format:
            #
            #       <---------------->  x_top_offset * 2
            #       |                |
            #
            # xyz1  +----------------+ xyz2  --^
            #       |                |         |  x_top_bottom_offset
            #       |                |         |
            # xyz3  +----------------+ xyz4  --v
            #
            # We do this remaping with a number of different transforms for x, y, and
            # z.
            #
            # [*] strikeslip is the angle the fault, and slip as the two plates move
            # laterally across each other. dipslip is the angle of the fault as the
            # two plates move under/over each other.
            rot = np.array(
                [
                    [math.cos(math.radians(angle)), -math.sin(math.radians(angle))],
                    [math.sin(math.radians(angle)), math.cos(math.radians(angle))],
                ]
            )
            x_orig = np.array([[length / 2.0], [0.0]])
            x_rot = np.dot(rot, x_orig)
            x_top_offset = x_rot[0] * KM2M
            y_top_offset = x_rot[1] * KM2M
            x_top_bottom_offset = (data[1][0]["X"] - data[0][0]["X"]) * KM2M
            y_top_bottom_offset = (data[1][0]["Y"] - data[0][0]["Y"]) * KM2M
            z_top_bottom_offset = (data[1][0]["Z"] - data[0][0]["Z"]) * KM2M

            # Loops over the down-dip and along-strike patches of the current panel
            for dip in range(0, len(data)):
                for strike in range(0, len(data[0])):
                    # Extract top center coordinates of current patch
                    x_top_center = data[dip][strike]["X"] * KM2M
                    y_top_center = data[dip][strike]["Y"] * KM2M
                    z_top_center = data[dip][strike]["Z"] * KM2M
                    src_mod["patchLongitude"].append(data[dip][strike]["LON"])
                    src_mod["patchLatitude"].append(data[dip][strike]["LAT"])

                    # Calculate location of top corners and convert from km to m
                    src_mod["x1"].append(x_top_center + x_top_offset)
                    src_mod["y1"].append(y_top_center + y_top_offset)
                    src_mod["z1"].append(z_top_center)
                    src_mod["x2"].append(x_top_center - x_top_offset)
                    src_mod["y2"].append(y_top_center - y_top_offset)
                    src_mod["z2"].append(z_top_center)

                    # Calculate location of bottom corners and convert from km to m
                    src_mod["x3"].append(
                        x_top_center + x_top_bottom_offset + x_top_offset
                    )
                    src_mod["y3"].append(
                        y_top_center + y_top_bottom_offset + y_top_offset
                    )
                    src_mod["z3"].append(z_top_center + z_top_bottom_offset)
                    src_mod["x4"].append(
                        x_top_center + x_top_bottom_offset - x_top_offset
                    )
                    src_mod["y4"].append(
                        y_top_center + y_top_bottom_offset - y_top_offset
                    )
                    src_mod["z4"].append(z_top_center + z_top_bottom_offset)

                    # Create UTM version of the same
                    x_top_center_utm, y_top_center_utm = proj(
                        src_mod["patchLongitude"][-1], src_mod["patchLatitude"][-1]
                    )
                    src_mod["patchXUtm"] = x_top_center_utm
                    src_mod["patchYUtm"] = y_top_center_utm
                    src_mod["x1Utm"].append(x_top_center_utm + x_top_offset)
                    src_mod["y1Utm"].append(y_top_center_utm + y_top_offset)
                    src_mod["z1Utm"].append(z_top_center)
                    src_mod["x2Utm"].append(x_top_center_utm - x_top_offset)
                    src_mod["y2Utm"].append(y_top_center_utm - y_top_offset)
                    src_mod["z2Utm"].append(z_top_center)
                    src_mod["x3Utm"].append(
                        x_top_center_utm + (x_top_bottom_offset + x_top_offset)
                    )
                    src_mod["y3Utm"].append(
                        y_top_center_utm + (y_top_bottom_offset + y_top_offset)
                    )
                    src_mod["z3Utm"].append(z_top_center + z_top_bottom_offset)
                    src_mod["x4Utm"].append(
                        x_top_center_utm + (x_top_bottom_offset - x_top_offset)
                    )
                    src_mod["y4Utm"].append(
                        y_top_center_utm + (y_top_bottom_offset - y_top_offset)
                    )
                    src_mod["z4Utm"].append(z_top_center + z_top_bottom_offset)

                    # Extract patch dip, strike, width, and length
                    # NB: dipMean and strikeMean are not length weighted
                    src_mod["dip"].append(segment_fields[i]["DIP"])
                    src_mod["strike"].append(seg_strike)
                    src_mod["dipMean"] = np.mean(np.array(src_mod["dip"]))
                    src_mod["strikeMean"] = np.mean(np.array(src_mod["strike"]))
                    src_mod["rake"].append(data[dip][strike].get("RAKE", 0))
                    src_mod["angle"].append(angle)
                    src_mod["width"].append(KM2M * width)
                    src_mod["length"].append(KM2M * length)

                    # Extract fault slip
                    src_mod["slip"].append(data[dip][strike]["SLIP"])
                    rot = np.array(
                        [
                            [
                                math.cos(math.radians(src_mod["rake"][-1])),
                                -math.sin(math.radians(src_mod["rake"][-1])),
                            ],
                            [
                                math.sin(math.radians(src_mod["rake"][-1])),
                                math.cos(math.radians(src_mod["rake"][-1])),
                            ],
                        ]
                    )
                    x_orig = np.array([[src_mod["slip"][-1]], [0]])
                    x_rot = np.dot(rot, x_orig)
                    src_mod["slipStrike"].append(x_rot[0])
                    src_mod["slipDip"].append(x_rot[1])

    # Check that our dips and strikes are within proper ranges.
    for dip in src_mod["dip"]:
        assert -180.0 <= dip <= 180.0
    for strike in src_mod["strike"]:
        assert 0.0 <= strike <= 360.0

    logging.info("Done reading SRCMOD file %s", filename)

    return src_mod


In [9]:
SRCMOD0_refmt = ReadSrcmodFile(SRCMOD_dir + SRCMOD_list[0])
SRCMOD0_refmt

defaultdict(list,
            {'date': '04/01/1968',
             'datetime': datetime.datetime(1968, 4, 1, 0, 0),
             'tag': 's1968HYUGAx01YAGI',
             'description': 'Hyuga-nada (Japan)',
             'epicenterLatitude': 32.28,
             'epicenterLongitude': 132.53,
             'depth': 15.0,
             'magnitude': 7.53,
             'moment': 2.22e+20,
             'zoneNumber': 53,
             'zoneLetter': 'S',
             'projEpicenter': <Other Coordinate Operation Transformer: utm>
             Description: PROJ-based coordinate operation
             Area of Use:
             - undefined,
             'epicenterXUtm': 267375.8991816583,
             'epicenterYUtm': 3574151.1392670088,
             'patchLongitude': [132.9,
              132.8299,
              132.7598,
              132.6898,
              132.6197,
              132.5496,
              132.4796,
              132.4095,
              132.8361,
              132.766,
              1

In [10]:
SRCMOD0_refmt.keys()

dict_keys(['date', 'datetime', 'tag', 'description', 'epicenterLatitude', 'epicenterLongitude', 'depth', 'magnitude', 'moment', 'zoneNumber', 'zoneLetter', 'projEpicenter', 'epicenterXUtm', 'epicenterYUtm', 'patchLongitude', 'patchLatitude', 'x1', 'y1', 'z1', 'x2', 'y2', 'z2', 'x3', 'y3', 'z3', 'x4', 'y4', 'z4', 'patchXUtm', 'patchYUtm', 'x1Utm', 'y1Utm', 'z1Utm', 'x2Utm', 'y2Utm', 'z2Utm', 'x3Utm', 'y3Utm', 'z3Utm', 'x4Utm', 'y4Utm', 'z4Utm', 'dip', 'strike', 'dipMean', 'strikeMean', 'rake', 'angle', 'width', 'length', 'slip', 'slipStrike', 'slipDip'])

Far more manageable than the fsp file format, we will use defaultdict objects in the next step of the process model to define new features based on the SRCMOD rupture parameters (geometry and kinematics). The structure of the dictionary created by ReadSrcmodFile is first further simplified to use cell centers instead of 4 corners per cell:

In [11]:
from collections import defaultdict

# Create list of SRCMOD dictionaries
SRCMOD_dictList = []

# Create geometric/kinematic parameters to be used as input for new feature definitions
# Use UTM coordinate system to match both datasets (SRCMOD and DeVries18 files);
# Use rupture's center of cells (x0,y0,z0) for future operations
for i in SRCMOD_list:
    SRCMODi_dict_orig = ReadSrcmodFile(SRCMOD_dir + i)   #read each mainshock rupture fsp file and transform into defaultdict
    # simply cell corners [[x1,y1,z1], [x2,y2,z2], [x3,y3,z3], [x4,y4,z4]] to cell center [x0,y0,z0]
    x1_flt = np.asarray([y for x in SRCMODi_dict_orig['x1Utm'] for y in x])
    x2_flt = np.asarray([y for x in SRCMODi_dict_orig['x2Utm'] for y in x])
    x0_flt = x1_flt + (x2_flt - x1_flt)*.5
    y1_flt = np.asarray([y for x in SRCMODi_dict_orig['y1Utm'] for y in x])
    y2_flt = np.asarray([y for x in SRCMODi_dict_orig['y2Utm'] for y in x])
    y0_flt = y1_flt + (y2_flt - y1_flt)*.5
    z1_flt = np.asarray(SRCMODi_dict_orig['z1Utm'])
    z3_flt = np.asarray(SRCMODi_dict_orig['z3Utm'])
    z0_flt = z1_flt + (z3_flt - z1_flt)*.5
    SRCMODi_dict = defaultdict()
    #general info (all cells together)
    SRCMODi_dict['ID'] = SRCMODi_dict_orig['tag']
    SRCMODi_dict['epicenterXUtm'] = SRCMODi_dict_orig['epicenterXUtm']
    SRCMODi_dict['epicenterYUtm'] = SRCMODi_dict_orig['epicenterYUtm']
    SRCMODi_dict['moment'] = SRCMODi_dict_orig['moment']
    SRCMODi_dict['magnitude'] = SRCMODi_dict_orig['magnitude']
    SRCMODi_dict['strikeMean'] = SRCMODi_dict_orig['strikeMean']
    SRCMODi_dict['dipMean'] = SRCMODi_dict_orig['dipMean']
    #info per rupture cell
    SRCMODi_dict['x'] = x0_flt
    SRCMODi_dict['y'] = y0_flt
    SRCMODi_dict['z'] = -z0_flt     #same depth convention as DeVries18 
    # other parameters of interest
    SRCMODi_dict['width'] = SRCMODi_dict_orig['width']
    SRCMODi_dict['length'] = SRCMODi_dict_orig['length']
    SRCMODi_dict['slip'] = SRCMODi_dict_orig['slip']
    SRCMODi_dict['dip'] = SRCMODi_dict_orig['dip']
    SRCMODi_dict['strike'] = SRCMODi_dict_orig['strike']
    SRCMODi_dict['rake'] = SRCMODi_dict_orig['rake']
    SRCMODi_dict['slipStrike'] = SRCMODi_dict_orig['slipStrike']
    SRCMODi_dict['slipDip'] = SRCMODi_dict_orig['slipDip']
    SRCMOD_dictList.append(SRCMODi_dict)

### No transformation needed for the labelled dataset of the baseline model

We will directly use the dataframe df_DeVries18 for feature engineering.

## Load

Two objects will be used for feature definition:
-  **SRCMOD_dictList**, a list of dictionaries, which contains all required rupture parameters per mainshock 'ID' (saved as SRCMOD_cleaned.pkl);
-  **df_DeVries18**, a dataframe, which contains the labelled dataset of the baseline model and where mainshocks are identified by 'ID' (already saved as LabelledDataset_DeVries18.pkl).

### SRCMOD mainshock rupture data as list of dictionaries

In [12]:
import pickle

# save the new list of dictionaries in file
with open('./Data/SRCMOD_cleaned.pkl', 'wb') as output:
    pickle.dump(SRCMOD_dictList, output)

#### Check if SRCMOD_dict can be read back

In [13]:
SRCMOD_dictList_check = pd.read_pickle('./Data/SRCMOD_cleaned.pkl')

for i in range(len(SRCMOD_dictList_check)):
    print(SRCMOD_dictList_check[i]['ID'])

s1968HYUGAx01YAGI
s1968TOKACH01NAGA
s1969GIFUxK01TAKE
s1971SANFER01HEAT
s1974IZUxHA01TAKE
s1974PERUCE01HART
s1978MIYAGI01YAMA
s1978TABASI01HART
s1979COYOTE01LIUx
s1979IMPERI01ARCH
s1979IMPERI01HART
s1979IMPERI01OLSO
s1979PETATL01MEND
s1980IZUxHA01TAKE
s1981PLAYAA01MEND
s1982NEWBRU01HART
s1983BORAHP01MEND
s1983JAPANE01FUKU
s1984MORGAN01HART
s1984NAGANO01TAKE
s1985CENTRA01MEND
s1985MICHOA01MEND
s1985NAHANN01HART
s1985NAHANN02HART
s1985ZIHUAT01MEND
s1986NORTHP01HART
s1986NORTHP01MEND
s1987ELMORE01LARS
s1987SUPERS01LARS
s1987SUPERS01WALD
s1987WHITTI01HART
s1988SAGUEN01HART
s1989LOMAPR01EMOL
s1989LOMAPR01STEI
s1989LOMAPR01WALD
s1991SIERRA01WALD
s1992JOSHUA01HOUG
s1992LANDER01COHE
s1992LANDER01COTT
s1992LANDER01HERN
s1992LANDER01WALD
s1992LANDER01ZENG
s1992LITTLE01SILV
s1993HOKKAI01MEND
s1994NORTHR01HART
s1994NORTHR01HUDN
s1994NORTHR01SHEN
s1994NORTHR01WALD
s1994SANRIK01NAGA
s1994SANRIK01NAKA
s1995COLIMA01MEND
s1995COPALA01COUR
s1995KOBEJA01CHOx
s1995KOBEJA01HORI
s1995KOBEJA01IDEx
s1995KOBEJ

In [14]:
dict_tmp = SRCMOD_dictList_check[0]
dict_tmp.keys()

dict_keys(['ID', 'epicenterXUtm', 'epicenterYUtm', 'moment', 'magnitude', 'strikeMean', 'dipMean', 'x', 'y', 'z', 'width', 'length', 'slip', 'dip', 'strike', 'rake', 'slipStrike', 'slipDip'])

### Labelled dataset of baseline model as pandas dataframe

In [15]:
df_complete.columns

Index(['x', 'y', 'z', 'stresses_full_xx', 'stresses_full_xy',
       'stresses_full_yy', 'stresses_full_xz', 'stresses_full_yz',
       'stresses_full_zz', 'stresses_full_max_shear', 'stresses_full_cfs_1',
       'stresses_full_cfs_2', 'stresses_full_cfs_3', 'stresses_full_cfs_4',
       'von_mises', 'aftershocksyn', 'ID'],
      dtype='object')

In [16]:
df_complete.head(5)

Unnamed: 0,x,y,z,stresses_full_xx,stresses_full_xy,stresses_full_yy,stresses_full_xz,stresses_full_yz,stresses_full_zz,stresses_full_max_shear,stresses_full_cfs_1,stresses_full_cfs_2,stresses_full_cfs_3,stresses_full_cfs_4,von_mises,aftershocksyn,ID
0,228050.836661,3438285.0,-2500.0,-2643.095958,-303.314853,-8306.19013,-62.011213,-591.513951,-23.523893,4191.904227,1278.43967,-1609.139444,1609.139444,-1278.43967,7423.266049,0.0,s1968HYUGAx01YAGI
1,233050.836661,3438285.0,-2500.0,-2506.143012,33.487142,-8293.83754,-40.27233,-619.872533,-25.140746,4180.998624,1353.632905,-1706.330567,1706.330567,-1353.632905,7427.85427,0.0,s1968HYUGAx01YAGI
2,238050.836661,3438285.0,-2500.0,-2390.411001,381.016088,-8149.157556,-16.212656,-642.672511,-26.597214,4124.540863,1417.713571,-1789.742291,1789.742291,-1417.713571,7351.107005,0.0,s1968HYUGAx01YAGI
3,243050.836661,3438285.0,-2500.0,-2299.36227,723.248609,-7868.650242,9.526966,-659.03552,-27.846052,4021.524131,1466.875844,-1854.619572,1854.619572,-1466.875844,7190.271468,0.0,s1968HYUGAx01YAGI
4,248050.836661,3438285.0,-2500.0,-2233.960345,1043.357897,-7454.950559,36.174998,-668.289344,-28.844615,3872.648486,1497.602675,-1896.565287,1896.565287,-1497.602675,6945.768984,0.0,s1968HYUGAx01YAGI
