Task:

Data

The provided dataset contains Automatic Identification System (AIS) data for vessels, including details such as MMSI (Maritime Mobile Service Identity), timestamp, latitude, and longitude. Students will need to calculate the distance traveled by each vessel throughout the day and determine which vessel has the longest route.
Tasks

    Data Retrieval
        Download the dataset from the given URL and unzip it to access the .csv or similar format file contained within.

    Data Preparation
        Load the data into a PySpark DataFrame.
        Ensure that the data types for latitude, longitude, and timestamp are appropriate for calculations and sorting.

    Data Processing with PySpark
        Calculate the distance between consecutive positions for each vessel using a suitable geospatial library or custom function that can integrate with PySpark.
        Aggregate these distances by MMSI to get the total distance traveled by each vessel on that day.

    Identifying the Longest Route
        Sort or use an aggregation function to determine which vessel traveled the longest distance.

    Output
        The final output should be the MMSI of the vessel that traveled the longest distance, along with the computed distance.

    Code Documentation and Comments
        Ensure the code is well-documented, explaining key PySpark transformations and actions used in the process.

    Deliverables
        A PySpark script that completes the task from loading to calculating and outputting the longest route.
        A brief report or set of comments within the code that discusses the findings and any interesting insights about the data or the computation process.

Evaluation Criteria

    Correct implementation of data loading and preprocessing.
    Accuracy of the distance calculation.
    Efficiency of PySpark transformations and actions.
    Clarity and completeness of documentation and code comment

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/05/29 10:43:34 WARN Utils: Your hostname, DESKTOP-QJASGSB, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/05/29 10:43:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/29 10:43:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 46886)
Traceback (most recent call last):
  File "/home/justas/.pyenv/versions/3.10.15/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/home/justas/.pyenv/versions/3.10.15/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/home/justas/.pyenv/versions/3.10.15/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/justas/.pyenv/versions/3.10.15/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/home/justas/.pyenv/versions/3.10.15/envs/bigdata_task4-3.10/lib/python3.10/site-packages/pyspark/accumulators.py", line 299, in handle
    poll(accum_updates)
  File "/home/justas/.pyenv/versions/3.10.15/envs/bigdata_task4-3.10/lib/pyth

# Spark examples

In [2]:
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [3]:
pandas_df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2., 3., 4.],
    'c': ['string1', 'string2', 'string3'],
    'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})
df = spark.createDataFrame(pandas_df)
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [4]:
df.show()
df.printSchema()

                                                                                

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
|  3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)



# Data preparation

In [1]:
from pyspark.sql.functions import to_timestamp, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Load the CSV file into a DataFrame
data_path = "../lab1/data/aisdk-test.csv"
# data_path = "data/aisdk-2024-05-04.csv"

df = spark.read.csv(data_path, header=True, inferSchema=True)

# Convert the '# Timestamp' column to TimestampType
df = df.withColumn("Timestamp", to_timestamp(col("# Timestamp"), "dd/MM/yyyy HH:mm:ss"))

# Keep only these columns: Timestamp, MMSI, Latitude, Longitude
df = df.select("Timestamp", "MMSI", "Latitude", "Longitude")

# disply the schema of the DataFrame
df.printSchema()

# print the length of the DataFrame
print(f"Length of DataFrame: {df.count()}")

# Remove rows with null values in any of the columns
df = df.dropna()

# print the length of the DataFrame after dropping null values
print(f"Length of DataFrame after dropping null values: {df.count()}")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/05/29 10:48:02 WARN Utils: Your hostname, DESKTOP-QJASGSB, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/05/29 10:48:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/29 10:48:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

root
 |-- Timestamp: timestamp (nullable = true)
 |-- MMSI: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)



                                                                                

Length of DataFrame: 999999


[Stage 5:==>                                                      (1 + 21) / 22]

Length of DataFrame after dropping null values: 999999


                                                                                

# Data Processing with PySpark

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
import numpy as np

EARTH_RADIUS = 6371  # km

@pandas_udf(DoubleType(), PandasUDFType.SCALAR)
def haversine_udf(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return pd.Series(EARTH_RADIUS * c)



In [3]:
window = Window.partitionBy("MMSI").orderBy("Timestamp")

df = df.withColumn("prev_lat", F.lag("Latitude").over(window))
df = df.withColumn("prev_lon", F.lag("Longitude").over(window))

df = df.withColumn("distance", haversine_udf(
    F.col("Latitude"), F.col("Longitude"),
    F.col("prev_lat"), F.col("prev_lon")
))

In [4]:
df_total = df.groupBy("MMSI").agg(F.sum("distance").alias("total_distance"))

In [5]:
longest_route = df_total.orderBy(F.desc("total_distance")).limit(1)


In [6]:
longest_route.show()

[Stage 10:>                                                       (0 + 12) / 12]

+-------+------------------+
|   MMSI|    total_distance|
+-------+------------------+
|2579999|20065.740571127666|
+-------+------------------+



                                                                                