# Earthquake Events Data Processing

This notebook processes earthquake event data from the Silver layer and prepares it for the Gold layer.

## Configuration

Define the paths for Silver and Gold data in Azure Data Lake Storage (ADLS).

In [0]:
from datetime import date, timedelta

# Remove this before running Data Factory Pipeline
start_date = date.today() - timedelta(days=1)

# Define ADLS paths for Silver and Gold data
silver_adls = "abfss://silver@earthquakedatadb.dfs.core.windows.net/"   
gold_adls = "abfss://gold@earthquakedatadb.dfs.core.windows.net/"

# Define the path for Silver earthquake events data
silver_data = f"{silver_adls}earthquake_events_silver/"

## Import Libraries

Import necessary libraries for data processing and geocoding.

In [0]:
from pyspark.sql.functions import when, col, udf
from pyspark.sql.types import StructType, StructField, StringType
# Ensure the below library is installed on your cluster
import reverse_geocoder as rg
import pycountry
from datetime import date, timedelta

## Load and Filter Data

Load the Silver data and filter it based on the start date.

In [0]:
df = spark.read.parquet(silver_data).filter(col('time') > start_date)

In [0]:
df = df.limit(100) 

## Define UDF for Retrieving Country Name

Define a User Defined Function (UDF) to retrieve the country name for given latitude and longitude using reverse geocoding.

In [0]:
def get_country_name(lat, lon):
    """
    Retrieve the country full name for a given latitude and longitude.

    Parameters:
    lat (float or str): Latitude of the location.
    lon (float or str): Longitude of the location.

    Returns:
    str: Country full name of the location, retrieved using the python library 'pycountry'.

    Example:
    >>> get_country_details(48.8588443, 2.2943506)
    ('France')
    """
    try:
        coordinates = (float(lat), float(lon))
        result = rg.search([coordinates])[0]
        country_code = result.get('cc')
        country = pycountry.countries.get(alpha_2=country_code)
        country_name = country.name if country else None
        print(f"Processed coordinates: {coordinates} -> {country_name}")
        return country_name
    except Exception as e:
        print(f"Error processing coordinates: {lat}, {lon} -> {str(e)}")
        return None

In [0]:
# registering the udfs so they can be used on spark dataframes
get_country_name_udf = udf(get_country_name, StringType())

In [0]:
get_country_name(48.8588443, 2.2943506)

Loading formatted geocoded file...
Processed coordinates: (48.8588443, 2.2943506) -> France


'France'

In [0]:
# adding country_name and city attributes
df_with_location = \
                df.\
                    withColumn('country_name', get_country_name_udf(col('latitude'), col('longitude')))

In [0]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- depth: double (nullable = true)
 |-- title: string (nullable = true)
 |-- place_description: string (nullable = true)
 |-- sig: long (nullable = true)
 |-- mag: double (nullable = true)
 |-- magType: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- updated: timestamp (nullable = true)



In [0]:
df_with_location.printSchema()

root
 |-- id: string (nullable = true)
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- depth: double (nullable = true)
 |-- title: string (nullable = true)
 |-- place_description: string (nullable = true)
 |-- sig: long (nullable = true)
 |-- mag: double (nullable = true)
 |-- magType: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- updated: timestamp (nullable = true)
 |-- country_name: string (nullable = true)



In [0]:
df.head()

Row(id='nn00896479', longitude=-115.9286, latitude=36.7856, depth=7.2, title='M 0.3 - 33 km NW of Indian Springs, Nevada', place_description='33 km NW of Indian Springs, Nevada', sig=1, mag=0.3, magType='ml', time=datetime.datetime(2025, 4, 19, 23, 58, 50, 799000), updated=datetime.datetime(2025, 4, 20, 0, 23, 17, 444000))

In [0]:
df_with_location.head()

Row(id='nn00896479', longitude=-115.9286, latitude=36.7856, depth=7.2, title='M 0.3 - 33 km NW of Indian Springs, Nevada', place_description='33 km NW of Indian Springs, Nevada', sig=1, mag=0.3, magType='ml', time=datetime.datetime(2025, 4, 19, 23, 58, 50, 799000), updated=datetime.datetime(2025, 4, 20, 0, 23, 17, 444000), country_name='United States')

In [0]:
# adding significance classification
df_with_location_sig_class = \
                            df_with_location.\
                                withColumn('sig_class', 
                                           when(col('sig') <  100, 'Low').\
                                           when((col('sig') >= 100) & (col('sig') <  500), 'Moderate').\
                                           otherwise("High")
                                        )

In [0]:
df_with_location_sig_class.printSchema()

root
 |-- id: string (nullable = true)
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- depth: double (nullable = true)
 |-- title: string (nullable = true)
 |-- place_description: string (nullable = true)
 |-- sig: long (nullable = true)
 |-- mag: double (nullable = true)
 |-- magType: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- updated: timestamp (nullable = true)
 |-- country_name: string (nullable = true)
 |-- sig_class: string (nullable = false)



## Save Data with Location to Gold Layer

Save the DataFrame with the new `country_name` column to the Gold layer in ADLS.

In [0]:
# Save the transformed DataFrame to the silver container
gold_output_path = f"{gold_adls}/earthquake_events_gold"

In [0]:
# Append DataFrame to Gold container in Parquet format
df_with_location_sig_class.write.mode("append").parquet(gold_output_path)