<a href="https://colab.research.google.com/github/EonTechie/Big_Data_Processing_Spark_Projects/blob/main/spark-rdd-tasks/LotteryTriplePatternMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Filiz-Yıldız-Part1-Question3
"""
Dataset: EartquakeData-07032025.txt
Goal: For the top 10 earthquakes (1990–2019), find all quakes within 24h and 20km (foreshocks & aftershocks).

My Approach:
I first filtered the dataset to the required date range (1990–2019).
I used RDD transformations to:

Parse and clean the data

Sort earthquakes by magnitude

Select top 10 events Then, for each of these 10, I checked all other earthquakes in the dataset to find those within 24 hours and 20km radius.
This was done using custom time and distance logic, keeping the solution RDD-based.

Initially, I solved this problem using RDD (Resilient Distributed Dataset). In the RDD solution, I used map and filter functions to process and filter the dataset
by applying operations on each element. Then, I calculated the distance and time difference between each pair of earthquakes.

Afterward, I attempted to solve the same problem using Spark DataFrame. In the DataFrame solution, I leveraged higher-level data manipulation functions to perform
the task in a more efficient and readable way. Specifically, I used functions like isNotNull() and filter() to clean the dataset by removing missing and empty values.
Additionally, I calculated the time difference and distance, then filtered out only those earthquake pairs that occurred within 24 hours and were within 20 km of each other.

As a result, both approaches yielded the same outcome, but using DataFrame proved to be much faster and more efficient. This experience helped me better understand
the differences between RDD and DataFrame, as well as when to use each one for large-scale data processing.
"""

# Connect colab to my drive account to fetch the dataset stored there
from google.colab import drive
drive.mount('/content/drive')

# Print files to see the namesof all (optional)
import os
folder_path = "/content/drive/My Drive/datasets"
files = os.listdir(folder_path)
print(files)

Mounted at /content/drive
['2.txt', 'Capitals.txt', 'EartquakeData-07032025.txt', 'DollarDataset.txt', 'couples.txt', 'join-actors.txt', 'points-null-values.txt', 'numbers-test.txt', 'join-series.txt', 'points.txt', 'names.txt', 'Lottery.txt', 'JamesJoyce-Ulyses.txt', 'world.txt', 'points-places.txt', 'Iris.csv', 'ml-latest-small']


In [None]:
# Import the SparkSession class from the PySpark SQL module as the main entry point for working with structured data (like DataFrames) in Spark
from pyspark.sql import SparkSession

# Create or retrieve a SparkSession object named 'spark',set the application name for tracking and logging purposes
spark = SparkSession.builder.appName("Part1-Question3-with-EathquakeData").getOrCreate()

In [None]:
# Initialize SparkContext to interact with Spark
sc = spark.sparkContext

# Read the text file into an RDD (Resilient Distributed Dataset)
rdd = sc.textFile("/content/drive/My Drive/datasets/EartquakeData-07032025.txt")

# Print the first 5 records from the RDD to get an idea of the data structure
print(rdd.take(5))

# Count the total number of records (lines) in the RDD
print(rdd.count())


['No    \tDeprem Kodu\tOlus tarihi\tOlus zamani\tEnlem\tBoylam\tDer(km)\txM\tMD\tML\tMw\tMs\tMb\tTip\tYer', '000001\t20241129191312\t2024.11.29\t19:13:12.53\t35.8453\t31.6895\t009.8\t4.9\t0.0\t4.9\t4.6\t0.0\t0.0\tKe\tAKDENIZ', '000002\t20241129122652\t2024.11.29\t12:26:52.13\t38.2525\t42.3815\t006.5\t3.8\t0.0\t3.7\t3.8\t0.0\t0.0\tKe\tICLIKAVAL-HIZAN (BITLIS) [South East  0.4 km]', '000003\t20241128071159\t2024.11.28\t07:11:59.85\t38.0462\t36.6528\t005.4\t4.0\t0.0\t4.0\t3.8\t0.0\t0.0\tKe\tGUCUKSU-GOKSUN (KAHRAMANMARAS) [South East  0.8 km]', '000004\t20241127025643\t2024.11.27\t02:56:43.09\t38.2540\t42.3613\t005.3\t4.3\t0.0\t4.2\t4.3\t0.0\t0.0\tKe\tICLIKAVAL-HIZAN (BITLIS) [West 1.5 km]']
20767


In [None]:
# Map operation to clean and split each line by tab (\t) and remove any leading/trailing whitespace
rdd1 = rdd.map(lambda x: x.strip().split('\t'))

# Print the first 5 records after stripping and splitting by tab
print(rdd1.take(5))

# Count the total number of records (lines) in the RDD (which remains the same after mapping)
print(rdd.count())


[['No    ', 'Deprem Kodu', 'Olus tarihi', 'Olus zamani', 'Enlem', 'Boylam', 'Der(km)', 'xM', 'MD', 'ML', 'Mw', 'Ms', 'Mb', 'Tip', 'Yer'], ['000001', '20241129191312', '2024.11.29', '19:13:12.53', '35.8453', '31.6895', '009.8', '4.9', '0.0', '4.9', '4.6', '0.0', '0.0', 'Ke', 'AKDENIZ'], ['000002', '20241129122652', '2024.11.29', '12:26:52.13', '38.2525', '42.3815', '006.5', '3.8', '0.0', '3.7', '3.8', '0.0', '0.0', 'Ke', 'ICLIKAVAL-HIZAN (BITLIS) [South East  0.4 km]'], ['000003', '20241128071159', '2024.11.28', '07:11:59.85', '38.0462', '36.6528', '005.4', '4.0', '0.0', '4.0', '3.8', '0.0', '0.0', 'Ke', 'GUCUKSU-GOKSUN (KAHRAMANMARAS) [South East  0.8 km]'], ['000004', '20241127025643', '2024.11.27', '02:56:43.09', '38.2540', '42.3613', '005.3', '4.3', '0.0', '4.2', '4.3', '0.0', '0.0', 'Ke', 'ICLIKAVAL-HIZAN (BITLIS) [West 1.5 km]']]
20767


In [None]:
# Get the header row (first line) from the dataset
header = rdd1.first()

# Filter out the header row from the data
data = rdd1.filter(lambda row: row != header)

# Map each row of data to a dictionary using the header as keys and row values as values
records_rdd = data.map(lambda row: dict(zip(header, row)))

# Print the first 2 records after converting to dictionary
print(records_rdd.take(2))

# Count the total number of records in the transformed RDD
print(records_rdd.count())


[{'No    ': '000001', 'Deprem Kodu': '20241129191312', 'Olus tarihi': '2024.11.29', 'Olus zamani': '19:13:12.53', 'Enlem': '35.8453', 'Boylam': '31.6895', 'Der(km)': '009.8', 'xM': '4.9', 'MD': '0.0', 'ML': '4.9', 'Mw': '4.6', 'Ms': '0.0', 'Mb': '0.0', 'Tip': 'Ke', 'Yer': 'AKDENIZ'}, {'No    ': '000002', 'Deprem Kodu': '20241129122652', 'Olus tarihi': '2024.11.29', 'Olus zamani': '12:26:52.13', 'Enlem': '38.2525', 'Boylam': '42.3815', 'Der(km)': '006.5', 'xM': '3.8', 'MD': '0.0', 'ML': '3.7', 'Mw': '3.8', 'Ms': '0.0', 'Mb': '0.0', 'Tip': 'Ke', 'Yer': 'ICLIKAVAL-HIZAN (BITLIS) [South East  0.4 km]'}]
20766


In [None]:
# Filter records by year (between 1990 and 2019) based on the 'Olus tarihi' (date of occurrence)
rdd_by_year = records_rdd.filter(lambda x: int(x['Olus tarihi'].split('.')[0]) >= 1990 and int(x['Olus tarihi'].split('.')[0]) <= 2019)

# Print the first 5 records from the filtered RDD
print(rdd_by_year.take(5))

# Count the total number of records after the year filter
print(rdd_by_year.count())

# Check the data type of the 'Mw' value in the first record (should be a string, but might need conversion)
print(type(rdd_by_year.take(1)[0]['Mw']))


[{'No    ': '003970', 'Deprem Kodu': '20191229112006', 'Olus tarihi': '2019.12.29', 'Olus zamani': '11:20:06.59', 'Enlem': '40.3462', 'Boylam': '42.1595', 'Der(km)': '005.0', 'xM': '4.3', 'MD': '0.0', 'ML': '4.2', 'Mw': '4.3', 'Ms': '0.0', 'Mb': '0.0', 'Tip': 'Ke', 'Yer': 'GULLUDAG-NARMAN (ERZURUM) [South East  3.7 km]'}, {'No    ': '003971', 'Deprem Kodu': '20191228014837', 'Olus tarihi': '2019.12.28', 'Olus zamani': '01:48:37.67', 'Enlem': '35.6587', 'Boylam': '32.0620', 'Der(km)': '031.0', 'xM': '3.6', 'MD': '0.0', 'ML': '3.6', 'Mw': '3.6', 'Ms': '0.0', 'Mb': '0.0', 'Tip': 'Ke', 'Yer': 'AKDENIZ'}, {'No    ': '003972', 'Deprem Kodu': '20191227071131', 'Olus tarihi': '2019.12.27', 'Olus zamani': '07:11:31.39', 'Enlem': '38.3725', 'Boylam': '39.0448', 'Der(km)': '002.2', 'xM': '3.6', 'MD': '0.0', 'ML': '3.6', 'Mw': '0.0', 'Ms': '0.0', 'Mb': '0.0', 'Tip': 'Ke', 'Yer': 'CEVRIMTAS-SIVRICE (ELAZIG) [North West  2.9 km]'}, {'No    ': '003973', 'Deprem Kodu': '20191227070225', 'Olus tarihi':

In [None]:
# Map over the filtered RDD to get the type of the 'Mw' field for each record
Mw_type = rdd_by_year.map(lambda x: type(x["Mw"])).distinct()

# Collect and print the distinct types of 'Mw'
Mw_type.collect()

# There  may be empty strings '', so we need to filter them out

[str]

In [None]:
# Filter out records with empty 'Mw' values and convert 'Mw' to float
rdd_max = rdd_by_year.filter(lambda x: x['Mw'] and x['Mw'].strip() not in ['']).map(lambda x: (
        x['Yer'],              # Location of the earthquake
        x['Enlem'],            # Latitude of the earthquake
        x['Deprem Kodu'],      # Earthquake code
        x['Boylam'],           # Longitude of the earthquake
        x['Olus tarihi'],      # Date of the earthquake
        x['Olus zamani'],      # Time of the earthquake
        float(x['Mw'])         # Magnitude of the earthquake, converted to float
    ))

# Print the first 5 records after filtering and converting 'Mw' to float
print(rdd_max.take(5))

# Print the total count of records after filtering
print(rdd_max.count())


[('GULLUDAG-NARMAN (ERZURUM) [South East  3.7 km]', '40.3462', '20191229112006', '42.1595', '2019.12.29', '11:20:06.59', 4.3), ('AKDENIZ', '35.6587', '20191228014837', '32.0620', '2019.12.28', '01:48:37.67', 3.6), ('CEVRIMTAS-SIVRICE (ELAZIG) [North West  2.9 km]', '38.3725', '20191227071131', '39.0448', '2019.12.27', '07:11:31.39', 0.0), ('TOPALUSAGI-SIVRICE (ELAZIG) [North East  0.3 km]', '38.3513', '20191227070225', '38.9847', '2019.12.27', '07:02:25.38', 4.8), ('BAYAT- (BALIKESIR) [East 0.6 km]', '39.4388', '20191226025534', '27.9092', '2019.12.26', '02:55:34.45', 3.3)]
1954


In [None]:
# Sort the filtered RDD by 'Mw' (magnitude) in descending order to get the most significant earthquakes
rdd_most = rdd_max.sortBy(lambda x: x[6], ascending=False)

# Print the top 20 most significant earthquakes based on 'Mw'
print(rdd_most.take(20))

# Print the total number of records in the sorted RDD
print(rdd_most.count())


[('BASISKELE (KOCAELI) [North East  2.0 km]', '40.7600', '19990817000137', '29.9700', '1999.08.17', '00:01:37.60', 7.4), ('YEMLICE- (VAN) [North West  1.5 km]', '38.7212', '20111023104121', '43.4110', '2011.10.23', '10:41:21.01', 7.2), ('UGUR- (DUZCE) [North East  0.3 km]', '40.7400', '19991112165720', '31.2100', '1999.11.12', '16:57:20.80', 7.2), ('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('KURTULUS- (BINGOL) [South West  4.3 km]', '39.0100', '20030501002704', '40.4600', '2003.05.01', '00:27:04.40', 6.4), ('AKDENIZ', '35.7948', '20080715032633', '27.8798', '2008.07.15', '03:26:33.58', 6.3), ('EGE DENIZI', '38.8468', '20170612122837', '26.3252', '2017.06.12', '12:28:37.53', 6.1), ('AKDENIZ', '35.5138', '20110401132908', '26.5798', '2011.04.01', '13:29:08.56', 6.1), ('ISAAGAMEZRASI-KOVANCILAR (ELAZIG) [South West  0.6 km]', '38.8300', '20100308023231', '40.1308', '2010.03.08', '02:32:31.09', 6.1), ('SAGLAMTAS-PULUMUR (TUNCELI)

In [None]:
# Take the top 10 most significant earthquakes (based on 'Mw') from the sorted RDD
rdd_10 = rdd_most.take(10)

# Parallelize the top 10 earthquakes into a new RDD to prepare for Cartesian product calculation
rdd_top10 = sc.parallelize(rdd_10)

# Print the number of records in the rdd_top10 RDD (which should be 10)
print(rdd_top10.count())

# Print the top 10 earthquakes (as a check)
rdd_top10.take(10)


10


[('BASISKELE (KOCAELI) [North East  2.0 km]',
  '40.7600',
  '19990817000137',
  '29.9700',
  '1999.08.17',
  '00:01:37.60',
  7.4),
 ('YEMLICE- (VAN) [North West  1.5 km]',
  '38.7212',
  '20111023104121',
  '43.4110',
  '2011.10.23',
  '10:41:21.01',
  7.2),
 ('UGUR- (DUZCE) [North East  0.3 km]',
  '40.7400',
  '19991112165720',
  '31.2100',
  '1999.11.12',
  '16:57:20.80',
  7.2),
 ('GOKOVA KORFEZI (AKDENIZ)',
  '36.9693',
  '20170720223109',
  '27.4057',
  '2017.07.20',
  '22:31:09.66',
  6.6),
 ('KURTULUS- (BINGOL) [South West  4.3 km]',
  '39.0100',
  '20030501002704',
  '40.4600',
  '2003.05.01',
  '00:27:04.40',
  6.4),
 ('AKDENIZ',
  '35.7948',
  '20080715032633',
  '27.8798',
  '2008.07.15',
  '03:26:33.58',
  6.3),
 ('EGE DENIZI',
  '38.8468',
  '20170612122837',
  '26.3252',
  '2017.06.12',
  '12:28:37.53',
  6.1),
 ('AKDENIZ',
  '35.5138',
  '20110401132908',
  '26.5798',
  '2011.04.01',
  '13:29:08.56',
  6.1),
 ('ISAAGAMEZRASI-KOVANCILAR (ELAZIG) [South West  0.6 km]',


In [None]:
# Create a new RDD with all earthquakes excluding the top 10 by using zipWithIndex and filtering out the first 10 records for preparing the remaining rdd
rdd_rest = rdd_most.zipWithIndex().filter(lambda x: x[1] >= 10).map(lambda x: x[0])

# Print the first 10 records of the rest of the earthquakes (excluding the top 10)
print(rdd_rest.take(10))

# Print the total count of earthquakes in the rdd_rest RDD
print(rdd_rest.count())

[('COBANLAR (AFYONKARAHISAR) [South East  4.1 km]', '38.6800', '20020203092644', '30.8200', '2002.02.03', '09:26:44.10', 6.0), ('TASKOPRU-SULTANDAGI (AFYONKARAHISAR) [West 3.4 km]', '38.5800', '20020203071128', '31.2500', '2002.02.03', '07:11:28.60', 6.0), ('VAN G�L�', '38.6377', '20111023204534', '43.0828', '2011.10.23', '20:45:34.85', 5.9), ('DEMIRCILI-URLA (IZMIR) [South West  9.6 km]', '38.1812', '20051020214001', '26.5940', '2005.10.20', '21:40:01.41', 5.9), ('OTLUCA- (HAKKARI) [South West  3.6 km]', '37.5700', '20050125164412', '43.6800', '2005.01.25', '16:44:12.20', 5.9), ('GIRIT ADASI (AKDENIZ)', '35.1613', '20150416180744', '26.9055', '2015.04.16', '18:07:44.80', 5.8), ('SOGUT-SIMAV (K�TAHYA) [North East  3.2 km]', '39.1553', '20110519201522', '29.0893', '2011.05.19', '20:15:22.94', 5.8), ('ZEYTINELI-URLA (IZMIR) [South East  9.5 km]', '38.1570', '20051017094653', '26.5268', '2005.10.17', '09:46:53.97', 5.8), ('KAZANLI-KARLIOVA (BINGOL) [South West  4.2 km]', '39.3475', '20050

In [None]:
# Create a Cartesian product of the top 10 earthquakes (rdd_top10) and the remaining earthquakes (rdd_rest)
rdd_cart = rdd_top10.cartesian(rdd_rest)

# Print the first 20 pairs of earthquakes from the Cartesian product
print(rdd_cart.take(20))

# Print the total count of pairs in the Cartesian product
print(rdd_cart.count())


[(('BASISKELE (KOCAELI) [North East  2.0 km]', '40.7600', '19990817000137', '29.9700', '1999.08.17', '00:01:37.60', 7.4), ('COBANLAR (AFYONKARAHISAR) [South East  4.1 km]', '38.6800', '20020203092644', '30.8200', '2002.02.03', '09:26:44.10', 6.0)), (('YEMLICE- (VAN) [North West  1.5 km]', '38.7212', '20111023104121', '43.4110', '2011.10.23', '10:41:21.01', 7.2), ('COBANLAR (AFYONKARAHISAR) [South East  4.1 km]', '38.6800', '20020203092644', '30.8200', '2002.02.03', '09:26:44.10', 6.0)), (('UGUR- (DUZCE) [North East  0.3 km]', '40.7400', '19991112165720', '31.2100', '1999.11.12', '16:57:20.80', 7.2), ('COBANLAR (AFYONKARAHISAR) [South East  4.1 km]', '38.6800', '20020203092644', '30.8200', '2002.02.03', '09:26:44.10', 6.0)), (('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('COBANLAR (AFYONKARAHISAR) [South East  4.1 km]', '38.6800', '20020203092644', '30.8200', '2002.02.03', '09:26:44.10', 6.0)), (('KURTULUS- (BINGOL) [South West 

In [None]:
import math
from math import radians, sin, cos, sqrt, atan2
from datetime import datetime

# Haversine function (calculating distance)
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    dlat = radians(lat2 - lat1)
    dlon = radians(lon2 - lon1)

    a = sin(dlat / 2)**2 + cos(radians(lat1)) * cos(radians(lat2)) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c  # Distance in kilometers

# Function to convert date and time to datetime object
def to_datetime(date_str, time_str):
    return datetime.strptime(date_str + " " + time_str, "%Y.%m.%d %H:%M:%S.%f")

# Filter the RDD for earthquakes that are within 20 km and occurred within 24 hours
result_rdd = rdd_cart.filter(
    lambda x: (haversine(
        float(x[0][1]), float(x[0][3]),  # First earthquake (lat, lon)
        float(x[1][1]), float(x[1][3])   # Second earthquake (lat, lon)
    ) <= 20) and (abs(to_datetime(x[1][4], x[1][5]) - to_datetime(x[0][4], x[0][5])).total_seconds() <= 86400)
)

# Print the count of the filtered earthquakes
print(result_rdd.count())

# Take a sample of the filtered earthquakes
result_rdd.take(5)


88


[(('GOKOVA KORFEZI (AKDENIZ)',
   '36.9693',
   '20170720223109',
   '27.4057',
   '2017.07.20',
   '22:31:09.66',
   6.6),
  ('AKYARLAR-BODRUM (MUGLA) [South West  4.5 km]',
   '36.9465',
   '20170721170946',
   '27.2537',
   '2017.07.21',
   '17:09:46.28',
   4.9)),
 (('GOKOVA KORFEZI (AKDENIZ)',
   '36.9693',
   '20170720223109',
   '27.4057',
   '2017.07.20',
   '22:31:09.66',
   6.6),
  ('AKYARLAR-BODRUM (MUGLA) [South East  2.2 km]',
   '36.9580',
   '20170720232351',
   '27.3103',
   '2017.07.20',
   '23:23:51.44',
   4.9)),
 (('GOKOVA KORFEZI (AKDENIZ)',
   '36.9693',
   '20170720223109',
   '27.4057',
   '2017.07.20',
   '22:31:09.66',
   6.6),
  ('GOKOVA KORFEZI (AKDENIZ)',
   '36.9198',
   '20170721050359',
   '27.5538',
   '2017.07.21',
   '05:03:59.39',
   4.5)),
 (('GOKOVA KORFEZI (AKDENIZ)',
   '36.9693',
   '20170720223109',
   '27.4057',
   '2017.07.20',
   '22:31:09.66',
   6.6),
  ('GOKOVA KORFEZI (AKDENIZ)',
   '36.8833',
   '20170721021234',
   '27.3415',
   '2017.

In [None]:
# Map the filtered earthquake pairs to a new RDD with the following structure:
# (main_eq, related_eq, type, time_diff_hours, distance_km)
# - Assign "foreshock" if the related event occurred before the main event, otherwise "aftershock"
# - Compute time difference in hours
# - Compute spatial distance in kilometers using Haversine formula
rdd_labeled = result_rdd.map(
    lambda x: (
        x[0], x[1],  # main and related earthquake
        ("foreshock" if to_datetime(x[1][4], x[1][5]) < to_datetime(x[0][4], x[0][5]) else "aftershock"),
        abs((to_datetime(x[1][4], x[1][5]) - to_datetime(x[0][4], x[0][5])).total_seconds()) / 3600,
        haversine(
            float(x[0][1]), float(x[0][3]),  # main earthquake lat, lon
            float(x[1][1]), float(x[1][3])   # related earthquake lat, lon
        )
    )
)

# Show example output
print(rdd_labeled.take(5))
print(rdd_labeled.count())


[(('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('AKYARLAR-BODRUM (MUGLA) [South West  4.5 km]', '36.9465', '20170721170946', '27.2537', '2017.07.21', '17:09:46.28', 4.9), 'aftershock', 18.643505555555553, 13.741603119208985), (('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('AKYARLAR-BODRUM (MUGLA) [South East  2.2 km]', '36.9580', '20170720232351', '27.3103', '2017.07.20', '23:23:51.44', 4.9), 'aftershock', 0.8782722222222222, 8.568598104198921), (('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('GOKOVA KORFEZI (AKDENIZ)', '36.9198', '20170721050359', '27.5538', '2017.07.21', '05:03:59.39', 4.5), 'aftershock', 6.547147222222222, 14.266057753856034), (('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('GOKOVA KORFEZI (AKDENIZ)', '36.8833', '20170721021234', '27.34

In [None]:
# Collect the top 10 largest magnitude earthquakes as a Python list
top10_list = rdd_top10.collect()

# Collect the labeled related events (foreshocks/aftershocks) as a Python list
result_list = rdd_labeled.collect()

# Print the first main earthquake from the top 10 list to heck list order
print(top10_list[0])

# Print the first labeled related event with its main earthquake, type, time diff, and distance to check list order
print(result_list[0])


('BASISKELE (KOCAELI) [North East  2.0 km]', '40.7600', '19990817000137', '29.9700', '1999.08.17', '00:01:37.60', 7.4)
(('GOKOVA KORFEZI (AKDENIZ)', '36.9693', '20170720223109', '27.4057', '2017.07.20', '22:31:09.66', 6.6), ('AKYARLAR-BODRUM (MUGLA) [South West  4.5 km]', '36.9465', '20170721170946', '27.2537', '2017.07.21', '17:09:46.28', 4.9), 'aftershock', 18.643505555555553, 13.741603119208985)


In [None]:
# This function prints a formatted summary of main earthquakes and their related shocks.
# It takes two inputs:
# 1. main_quakes: list of top 10 main earthquakes
# 2. labeled_shocks: list of related events (aftershocks or foreshocks) with their type, distance, and time difference

def print_quake_summary(main_quakes, labeled_shocks):
    for main in main_quakes:
        main_name, lat, quake_id, lon, date, time, mag = main

        # Print section header for each main earthquake
        print("-----------------------------------------------------------------------------------------------------------------------\n")
        print("MAIN EARTHQUAKE:")

        # Print table headers for main earthquake info
        print("Main EQ ID\t\tMain_Location\t\t\t\t\t\tMain Magnitude")
        # Print main earthquake info
        print(f"{quake_id:<20}\t{main_name:<50}\t\t{mag:<20}\t")
        print()

        # Print header for related shocks
        print("RELATED SHOCKS:")
        print("Related EQ ID\t\tRelated Location\t\t\t\t\tRelated Magnitude\tType\t\t\tDistance km\tTime Diff hrs")

        # Print all related shocks (foreshocks or aftershocks) for this main earthquake
        for m, shock, shock_type, distance, time_diff in labeled_shocks:
            if m == main:
                print(f"{shock[2]:<20}\t{shock[0]:<50}\t\t{shock[6]:<6}\t\t{shock_type.capitalize()}\t\t{distance:.2f}\t\t{time_diff:.2f}\t\t")

        print("================================================================================\n")

# Call the function with collected main earthquakes and labeled shocks
print_quake_summary(top10_list, result_list)


-----------------------------------------------------------------------------------------------------------------------

MAIN EARTHQUAKE:
Main EQ ID		Main_Location						Main Magnitude
19990817000137      	BASISKELE (KOCAELI) [North East  2.0 km]          		7.4                 	

RELATED SHOCKS:
Related EQ ID		Related Location					Related Magnitude	Type			Distance km	Time Diff hrs

-----------------------------------------------------------------------------------------------------------------------

MAIN EARTHQUAKE:
Main EQ ID		Main_Location						Main Magnitude
20111023104121      	YEMLICE- (VAN) [North West  1.5 km]               		7.2                 	

RELATED SHOCKS:
Related EQ ID		Related Location					Related Magnitude	Type			Distance km	Time Diff hrs

-----------------------------------------------------------------------------------------------------------------------

MAIN EARTHQUAKE:
Main EQ ID		Main_Location						Main Magnitude
19991112165720      	UGUR- (DUZCE) [North East  

In [None]:
################################################################################################################################################################################
################################################################################################################################################################################
################################################################################################################################################################################
#################################################################################### EXTRA SECOND WAY SOLUTION #################################################################
################################################################################################################################################################################
################################################################################################################################################################################
################################################################################################################################################################################

In [None]:
# I also solved the question using spark dataframes

In [None]:
# Import functions for DataFrame column operations, timestamp conversion, and absolute value | col → Refer to a column in expressions | to_timestamp, unix_timestamp → Convert string to timestamp format | abs → Get absolute value (for differences)
from pyspark.sql.functions import col, to_timestamp, unix_timestamp, abs

# Import DoubleType to define or cast columns to double (float) type | Needed for type conversions (for latitude/longitude as float)
from pyspark.sql.types import DoubleType

# Import standard math functions from Python for custom calculations (haversine) | Required to convert degrees to radians and perform trigonometry
from math import radians, cos, sin, asin, sqrt

# Import udf to define user-defined functions in Spark
from pyspark.sql.functions import udf

# Import 'when' function to write conditional expressions (if-else in SQL) I used it to create new columns based on conditions
from pyspark.sql.functions import when

# Import concat_ws to concatenate columns with a separator
from pyspark.sql.functions import concat_ws

# Import round function to round numerical columns (distances)
from pyspark.sql.functions import round


In [None]:
# Load the earthquake data file into a DataFrame using SparkSession
df = (
    spark.read
    .option("inferSchema", "false")           # Prevents Spark from automatically inferring data types; all columns will be read as strings
    .option("header", True)                   # The file contains a header row; Spark will use it as column names
    .option("delimiter", "\t")                # The columns are separated by tab characters (\t), not commas
    .option("encoding", "ISO-8859-9")         # Use Turkish encoding to correctly read special characters like ç, ğ, ı, etc.
    .csv("/content/drive/My Drive/datasets/EartquakeData-07032025.txt")  # File path
)

# Display the first 20 rows of the DataFrame (default)
df.show()

# Print the structure and data types of all columns
df.printSchema()

# Count the total number of rows in the dataset
df.count()


+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|No    |   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|000001|20241129191312| 2024.11.29|19:13:12.53|35.8453|31.6895|  009.8|4.9|0.0|4.9|4.6|0.0|0.0| Ke|             AKDENIZ|
|000002|20241129122652| 2024.11.29|12:26:52.13|38.2525|42.3815|  006.5|3.8|0.0|3.7|3.8|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000003|20241128071159| 2024.11.28|07:11:59.85|38.0462|36.6528|  005.4|4.0|0.0|4.0|3.8|0.0|0.0| Ke|GUCUKSU-GOKSUN (K...|
|000004|20241127025643| 2024.11.27|02:56:43.09|38.2540|42.3613|  005.3|4.3|0.0|4.2|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000005|20241127025006| 2024.11.27|02:50:06.71|38.2340|42.3765|  003.8|4.3|0.0|4.1|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000006|20241125172846| 2024.11.

20766

In [None]:
# Remove leading and trailing spaces from column names (if any),
# especially to avoid issues when referencing columns with unexpected whitespace
df = df.toDF(*[column.strip() for column in df.columns])  # In this case, it removed trailing space in "No    " column

df.show()
df.printSchema()
df.count()


+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|000001|20241129191312| 2024.11.29|19:13:12.53|35.8453|31.6895|  009.8|4.9|0.0|4.9|4.6|0.0|0.0| Ke|             AKDENIZ|
|000002|20241129122652| 2024.11.29|12:26:52.13|38.2525|42.3815|  006.5|3.8|0.0|3.7|3.8|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000003|20241128071159| 2024.11.28|07:11:59.85|38.0462|36.6528|  005.4|4.0|0.0|4.0|3.8|0.0|0.0| Ke|GUCUKSU-GOKSUN (K...|
|000004|20241127025643| 2024.11.27|02:56:43.09|38.2540|42.3613|  005.3|4.3|0.0|4.2|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000005|20241127025006| 2024.11.27|02:50:06.71|38.2340|42.3765|  003.8|4.3|0.0|4.1|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000006|20241125172846| 2024.11.

20766

In [None]:
df_cleaned = df.filter(df["Mw"].isNotNull() & (df["Mw"] != ""))
df_cleaned.show()
df_cleaned.count()

+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+
|000001|20241129191312| 2024.11.29|19:13:12.53|35.8453|31.6895|  009.8|4.9|0.0|4.9|4.6|0.0|0.0| Ke|             AKDENIZ|
|000002|20241129122652| 2024.11.29|12:26:52.13|38.2525|42.3815|  006.5|3.8|0.0|3.7|3.8|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000003|20241128071159| 2024.11.28|07:11:59.85|38.0462|36.6528|  005.4|4.0|0.0|4.0|3.8|0.0|0.0| Ke|GUCUKSU-GOKSUN (K...|
|000004|20241127025643| 2024.11.27|02:56:43.09|38.2540|42.3613|  005.3|4.3|0.0|4.2|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000005|20241127025006| 2024.11.27|02:50:06.71|38.2340|42.3765|  003.8|4.3|0.0|4.1|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|
|000006|20241125172846| 2024.11.

7769

In [None]:
df_cleaned = df_cleaned.withColumn("datetime", to_timestamp(concat_ws(" ", col("Olus tarihi"), col("Olus zamani")), "yyyy.MM.dd HH:mm:ss.SS"))# Create a new column "datetime" by combining separate date and time columns
df_cleaned = df_cleaned.withColumn(
    "datetime",  # Name of the new column to be added
    to_timestamp(  # Convert the resulting string into a proper Spark TimestampType
        concat_ws(" ", col("Olus tarihi"), col("Olus zamani")),  # Join date and time columns with a space in between
        "yyyy.MM.dd HH:mm:ss.SS"  # Specify the expected datetime format for correct parsing
    )
)
df_cleaned.show()
df_cleaned.printSchema()

+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|            datetime|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+
|000001|20241129191312| 2024.11.29|19:13:12.53|35.8453|31.6895|  009.8|4.9|0.0|4.9|4.6|0.0|0.0| Ke|             AKDENIZ|2024-11-29 19:13:...|
|000002|20241129122652| 2024.11.29|12:26:52.13|38.2525|42.3815|  006.5|3.8|0.0|3.7|3.8|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|2024-11-29 12:26:...|
|000003|20241128071159| 2024.11.28|07:11:59.85|38.0462|36.6528|  005.4|4.0|0.0|4.0|3.8|0.0|0.0| Ke|GUCUKSU-GOKSUN (K...|2024-11-28 07:11:...|
|000004|20241127025643| 2024.11.27|02:56:43.09|38.2540|42.3613|  005.3|4.3|0.0|4.2|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|2024-11-27 02:56:...|
|00000

In [None]:
# Create a new column "year" by extracting the first 4 characters from "Olus tarihi" (date string)
# Convert the extracted year to integer for future numeric operations like filtering or grouping
df = df_cleaned.withColumn("year", col("Olus tarihi").substr(1, 4).cast("int"))
df.show()
df.printSchema()
df.count()

+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|            datetime|year|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+
|000001|20241129191312| 2024.11.29|19:13:12.53|35.8453|31.6895|  009.8|4.9|0.0|4.9|4.6|0.0|0.0| Ke|             AKDENIZ|2024-11-29 19:13:...|2024|
|000002|20241129122652| 2024.11.29|12:26:52.13|38.2525|42.3815|  006.5|3.8|0.0|3.7|3.8|0.0|0.0| Ke|ICLIKAVAL-HIZAN (...|2024-11-29 12:26:...|2024|
|000003|20241128071159| 2024.11.28|07:11:59.85|38.0462|36.6528|  005.4|4.0|0.0|4.0|3.8|0.0|0.0| Ke|GUCUKSU-GOKSUN (K...|2024-11-28 07:11:...|2024|
|000004|20241127025643| 2024.11.27|02:56:43.09|38.2540|42.3613|  005.3|4.3|0.0|4.2|4.3|0.0|0.0| Ke|ICLIKAVAL-HIZAN (..

7769

In [None]:
# Filter the DataFrame to include only rows where the year is between 1990 and 2019 (inclusive)
df = df.filter((col("year") >= 1990) & (col("year") <= 2019))
df.show()
df.printSchema()
df.count()

+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|            datetime|year|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+
|003970|20191229112006| 2019.12.29|11:20:06.59|40.3462|42.1595|  005.0|4.3|0.0|4.2|4.3|0.0|0.0| Ke|GULLUDAG-NARMAN (...|2019-12-29 11:20:...|2019|
|003971|20191228014837| 2019.12.28|01:48:37.67|35.6587|32.0620|  031.0|3.6|0.0|3.6|3.6|0.0|0.0| Ke|             AKDENIZ|2019-12-28 01:48:...|2019|
|003972|20191227071131| 2019.12.27|07:11:31.39|38.3725|39.0448|  002.2|3.6|0.0|3.6|0.0|0.0|0.0| Ke|CEVRIMTAS-SIVRICE...|2019-12-27 07:11:...|2019|
|003973|20191227070225| 2019.12.27|07:02:25.38|38.3513|38.9847|  004.5|5.1|0.0|5.1|4.8|0.0|0.0| Ke|TOPALUSAGI-SIVRIC..

1954

In [None]:
top10 = df.orderBy(col("Mw").cast("double").desc()).limit(10) # get 10 strongest earthquakes to match them with all others afterwards
top10 = top10.withColumn("lat", col("Enlem").cast("double"))
top10 = top10.withColumn("lon", col("Boylam").cast("double")) # Geospatial calculations (like distance) need numeric (double) values.
top10.show()
top10.printSchema()

+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+-------+-------+
|    No|   Deprem Kodu|Olus tarihi|Olus zamani|  Enlem| Boylam|Der(km)| xM| MD| ML| Mw| Ms| Mb|Tip|                 Yer|            datetime|year|    lat|    lon|
+------+--------------+-----------+-----------+-------+-------+-------+---+---+---+---+---+---+---+--------------------+--------------------+----+-------+-------+
|012369|19990817000137| 1999.08.17|00:01:37.60|40.7600|29.9700|   0018|7.4|6.7|0.0|7.4|0.0|0.0| Ke|BASISKELE (KOCAEL...|1999-08-17 00:01:...|1999|  40.76|  29.97|
|007358|20111023104121| 2011.10.23|10:41:21.01|38.7212|43.4110|  005.0|7.2|0.0|6.7|7.2|0.0|0.0| Ke|YEMLICE- (VAN) [N...|2011-10-23 10:41:...|2011|38.7212| 43.411|
|012153|19991112165720| 1999.11.12|16:57:20.80|40.7400|31.2100|   0025|7.2|0.0|0.0|7.2|0.0|0.0| Ke|UGUR- (DUZCE) [No...|1999-11-12 16:57:...|1999|  40.74|  31.21|
|004840|20170720223109

In [None]:
# Convert latitude and longitude to double type for numeric distance calculation
df = df.withColumn("lat", col("Enlem").cast("double"))
df = df.withColumn("lon", col("Boylam").cast("double"))


# UDF: Define the Haversine formula to calculate distance between two coordinates on Earth (in km)
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # The earth radius (km)
    dlat = radians(lat2 - lat1) # calculate distance and convert it to radian
    dlon = radians(lon2 - lon1)
    a = sin(dlat/2)**2 + cos(radians(lat1)) * cos(radians(lat2)) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    return R * c

# Register the Python function as a Spark UDF that returns DoubleType to be able to used with 'withColumn()'
haversine_udf = udf(haversine, DoubleType())

# Cross Join: Match every row in top10 with every row in the full dataset (df)
# Use alias("main") for the top10 DataFrame and alias("other") for the full DataFrame
joined = top10.alias("main").crossJoin(df.alias("other"))


# Compute the geographical distance (in km) between each pair using the Haversine formula
joined = joined.withColumn("distance", haversine_udf(
    col("main.lat"), col("main.lon"),
    col("other.lat"), col("other.lon"))
)


# Compute the absolute time difference between events in seconds
joined = joined.withColumn("time_diff", abs(unix_timestamp("main.datetime") - unix_timestamp("other.datetime")))

# Filter for nearby and close-in-time quakes based on spatial and temporal proximity to identify related events (aftershocks or foreshocks) | 24 hours = 86400 seconds
result = joined.filter(
    (col("distance") <= 20) & (col("time_diff") <= 86400) & (col("main.datetime") != col("other.datetime"))
)

# Defines its type whether foreshock or aftershock
result = result.withColumn(
    "type",
    when(col("other.datetime") < col("main.datetime"), "Foreshock").otherwise("Aftershock")
)


# Coumns to be selected
final = result.select(
    col("main.No").alias("Main_EQ_ID"),
    col("main.Mw").alias("Main_Magnitude"),
    col("main.Yer").alias("Main_Location"),
    col("other.No").alias("Related_EQ_ID"),
    col("other.Mw").alias("Related_Magnitude"),
    col("other.Yer").alias("Related_Location"),
    "type",
    round(col("time_diff") / 3600, 2).alias("Time_Diff_hrs"),
    round(col("distance"), 2).alias("Distance_km")
)


# Göster
final.show(truncate=False)
final.count()

+----------+--------------+------------------------+-------------+-----------------+--------------------------------------------+----------+-------------+-----------+
|Main_EQ_ID|Main_Magnitude|Main_Location           |Related_EQ_ID|Related_Magnitude|Related_Location                            |type      |Time_Diff_hrs|Distance_km|
+----------+--------------+------------------------+-------------+-----------------+--------------------------------------------+----------+-------------+-----------+
|004840    |6.6           |GOKOVA KORFEZI (AKDENIZ)|004763       |3.9              |GOKOVA KORFEZI (AKDENIZ)                    |Aftershock|23.33        |19.16      |
|004840    |6.6           |GOKOVA KORFEZI (AKDENIZ)|004765       |4.9              |AKYARLAR-BODRUM (MUGLA) [South West  4.5 km]|Aftershock|18.64        |13.74      |
|004840    |6.6           |GOKOVA KORFEZI (AKDENIZ)|004766       |3.8              |AKYARLAR-BODRUM (MUGLA) [North East  4.0 km]|Aftershock|18.4         |6.66       

88