## Use-Case with Spark

This notebook contains the complete Python code to implement our big data use case using Spark. To see the results achieved by analyzing the data with Pandas instead of Spark, refer to the notebook titled ADE_Pro.

In addition to the commented code, this notebook also describes the complete flow of data during operation, from the input in the form of CSV files to the visualization of the data on a map.

### The Use-Case

We aim to use historical AIS (Automatic Identification System) data (location data used in shipping) to visualize the journey of a specific ship during a specific time frame. Since ships typically report their location via AIS at intervals ranging from a few seconds to up to 3 minutes, a large amount of data is generated continuously. This combined with the fact that AIS is in use all over the world makes this a "big data" use case. For further information see: https://en.wikipedia.org/wiki/Automatic_identification_system.

To limit our project to a realistic size and cost, we use historical Danish AIS data (which is provided for free, link: https://web.ais.dk/aisdata/ ). This limits the use case to the waters around Denmark and reduces the dataset to a more manageable size (approximately 2 GB per analyzed day). Consequently, the system is not live and only supports the analysis and visualization of historical voyages.

### Limitations

This application is only a prototype and serves as a proof of concept for our use case. As such, the application is constrained in several ways. As already mentioned, the system only uses historical data and cannot process live AIS data. In this notebook, the application runs in Spark's local mode, meaning there is only one node executing the jobs. The data itself contains only the AIS data of the waters surrounding Denmark. Also the application displays only the path taken by one ship at a time and not the paths of multiple/all vessels.

These limitations, however, have no impact on the use case in general or our comparison with Spark. Additionally, the technical limitations are further explained in the notebooks titled Scalability and Fault Tolerance.

### Structure of this notebook

1. Imports and creation of Spark Session
2. Loading of data and joining of data
3. Processing and filtering of data
4. Data Conversion for Visualization
5. Display of ship on map
6. Conclusion

In [1]:
# Comparison: Data Processing with Spark vs. Pandas
# -------------------------------------------------
#
# Goal of this Notebook:
# - Compare the data processing speed between Apache Spark and Pandas.
# - Analyze the data size limits where Pandas reaches its boundaries and Spark shows its advantages.
#
# Overview:
# - The first code block implements data processing with Spark.
# - The second code block implements the same logic using Pandas.
# - Both approaches are compared in terms of runtime and memory usage.
#
# Prerequisites:
# - Python version: 3.8 or higher
# - Apache Spark: 3.5.0
# - Installed libraries: pyspark, pandas, requests, folium
#
# Workflow:
# 1. Download and prepare the data.
# 2. Process the data using both approaches.
# 3. Measure and compare the runtime.
#
# Let's start by importing the necessary libraries and setting up the Spark environment.

In [2]:
from pyspark.sql import SparkSession
import requests
from io import BytesIO
import zipfile
from concurrent.futures import ThreadPoolExecutor
import tempfile
import os
from pyspark.sql.functions import col, to_timestamp, count, lit
import folium  # Import for map visualization
import time  # Import for time measurement

In [3]:
# Global configuration variables
# Path for local storage of CSV files
local_storage_path = "./data/csv_files"  # Configurable storage path for CSV files
os.makedirs(local_storage_path, exist_ok=True)  # Create the directory if it does not exist

In [4]:
# List of CSV (ZIP) URLs
csv_urls = [
    "https://web.ais.dk/aisdata/aisdk-2024-03-01.zip",
    "https://web.ais.dk/aisdata/aisdk-2024-03-02.zip",
    "https://web.ais.dk/aisdata/aisdk-2024-03-03.zip",
    "https://web.ais.dk/aisdata/aisdk-2024-03-04.zip",
    "https://web.ais.dk/aisdata/aisdk-2024-03-05.zip"
]

In [None]:
# Step 1: Create a Spark session
spark = SparkSession.builder \
    .appName("AIS Data Processing") \
    .getOrCreate()

# Step 2: Function to download, extract, and save CSV files locally if not already present
def download_and_unzip_to_csv(url):
    # Extract ZIP filename and corresponding CSV filename
    zip_filename = url.split("/")[-1]
    csv_filename = zip_filename.replace(".zip", ".csv")
    csv_filepath = os.path.join(local_storage_path, csv_filename)
    
    # Check if the CSV file already exists
    if os.path.exists(csv_filepath):
        print(f"File already exists: {csv_filepath}, skipping download.")
        return csv_filepath  # Return the path to the existing file
    
    # Download and extract the ZIP file
    print(f"Downloading and extracting: {url}")
    response = requests.get(url)
    response.raise_for_status()
    zipfile_bytes = BytesIO(response.content)
    with zipfile.ZipFile(zipfile_bytes, 'r') as z:
        with z.open(z.namelist()[0]) as csv_file:
            # Save the extracted CSV locally
            with open(csv_filepath, "wb") as output_file:
                output_file.write(csv_file.read())
    
    return csv_filepath  # Return the path to the saved file

# Step 4: Parallel downloading and storing CSV files locally

# Start the timer
start_time = time.time()

# Count the total number of URLs
total_urls = len(csv_urls)

# Parallel downloading and storing CSV files
with ThreadPoolExecutor(max_workers=10) as executor:
    csv_file_paths = list(executor.map(download_and_unzip_to_csv, csv_urls))

# Stop the timer
end_time = time.time()

# Convert elapsed time into minutes and seconds
elapsed_time = end_time - start_time
minutes = int(elapsed_time // 60)
seconds = int(elapsed_time % 60)
download_time = f"The download time for {total_urls} files with Spark is {minutes} minutes and {seconds} seconds."

# Print the download time
print(download_time)

# Automatically update the README file
try:
    # Open the file and read its contents
    with open("README.md", "r") as readme:
        lines = readme.readlines()
    
    # Create a new list of lines
    updated_lines = []
    section_found = False
    for line in lines:
        if line.strip() == "### Download Time Results with Spark":
            # Replace the existing value with the new one
            updated_lines.append(line)
            updated_lines.append(f"{download_time}\n")
            section_found = True
        elif not section_found or line.strip() != f"{download_time}":
            updated_lines.append(line)

    # If the section was not found, append it
    if not section_found:
        updated_lines.append("\n### Download Time Results with Spark\n")
        updated_lines.append(f"{download_time}\n")

    # Overwrite the file
    with open("README.md", "w") as readme:
        readme.writelines(updated_lines)

    print("The download time and URL count have been successfully updated in the README.")
except FileNotFoundError:
    # If the file does not exist, create it
    with open("README.md", "w") as readme:
        readme.write("### Download Time Results with Spark\n")
        readme.write(f"{download_time}\n")
    print("The README file was created, and the download time has been added.")
except Exception as e:
    print(f"Error writing to the README file: {e}")

25/02/04 01:16:43 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.113 instead (on interface en0)
25/02/04 01:16:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/04 01:16:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


File already exists: ./data/csv_files/aisdk-2024-03-01.csv, skipping download.
File already exists: ./data/csv_files/aisdk-2024-03-02.csv, skipping download.
File already exists: ./data/csv_files/aisdk-2024-03-03.csv, skipping download.
File already exists: ./data/csv_files/aisdk-2024-03-04.csv, skipping download.
File already exists: ./data/csv_files/aisdk-2024-03-05.csv, skipping download.
The download time for 5 files with Spark is 0 minutes and 0 seconds.
The download time and URL count have been successfully updated in the README.


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 50349)
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.fra

In [6]:
# Step 5: Read CSV files with Spark and combine them
# Create a list of DataFrames for each CSV file
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in csv_file_paths]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

# Print the number of rows in the combined DataFrame
print(f"The combined dataset contains {combined_df.count()} rows.")

25/02/04 01:16:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

The combined dataset contains 78022287 rows.


                                                                                

In [7]:
# Step 6: Display some sample rows to show possible MMSI numbers
combined_df.show(10)

# Print the original number of entries
print(f"The original dataset contains {combined_df.count()} entries.")

25/02/04 01:17:25 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------------------+--------------+---------+---------+---------+--------------------+----+----+-----+-------+-------+--------+----+---------+----------+-----+------+------------------------------+-------+-----------+----+----------------+----+----+----+----+
|        # Timestamp|Type of mobile|     MMSI| Latitude|Longitude| Navigational status| ROT| SOG|  COG|Heading|    IMO|Callsign|Name|Ship type|Cargo type|Width|Length|Type of position fixing device|Draught|Destination| ETA|Data source type|   A|   B|   C|   D|
+-------------------+--------------+---------+---------+---------+--------------------+----+----+-----+-------+-------+--------+----+---------+----------+-----+------+------------------------------+-------+-----------+----+----------------+----+----+----+----+
|01/03/2024 00:00:00|       Class A|219000873| 56.99091|10.304543|Under way using e...|NULL| 0.0| 30.2|   NULL|Unknown| Unknown|NULL|Undefined|      NULL| NULL|  NULL|                     Undefined|   NULL|    Unknown



The original dataset contains 78022287 entries.


                                                                                

In [8]:
# Step 5: Read CSV files with Spark and combine them
# Create a list of DataFrames for each CSV file
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in csv_file_paths]

# Combine all DataFrames into one large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

                                                                                

In [9]:
########################################################
# 2. Filter out base stations as they do not display navigation data ("Type of mobile" != "Base Station")
########################################################

# Check if the column "Type of mobile" exists and filter
if "Type of mobile" in combined_df.columns:
    combined_df = combined_df.filter(col("Type of mobile") != "Base Station")
else:
    print("Warning: 'Type of mobile' column not found, skipping this step.")

print(f"The adjusted dataset contains {combined_df.count()} entries.")



The adjusted dataset contains 72413518 entries.


                                                                                

In [10]:
########################################################
# 3. Keep only relevant columns to reduce data size
########################################################

relevant_columns = ["MMSI", "Latitude", "Longitude", "# Timestamp"]
combined_df = combined_df.select(*relevant_columns)

########################################################
# 4. Convert Timestamp to datetime format
########################################################

combined_df = combined_df.withColumn("# Timestamp", to_timestamp(col("# Timestamp"), "dd/MM/yyyy HH:mm:ss"))

########################################################
# 5. Filter MMSI numbers with enough data points to display meaningful routes
########################################################

# Count the number of data points per MMSI
mmsi_counts = combined_df.groupBy("MMSI").agg(count("*").alias("count"))

# Define a threshold (e.g., at least 50 points)
threshold = 50
valid_mmsi = mmsi_counts.filter(col("count") >= threshold).select("MMSI").rdd.flatMap(lambda x: x).collect()

# Filtered DataFrame containing only MMSI numbers with sufficient data points
filtered_by_count_df = combined_df.filter(col("MMSI").isin(valid_mmsi))

                                                                                

In [11]:
########################################################
# 6. Filter by specific MMSI and time range + plot the route
########################################################

mmsi_number = 245097000 #219016832 - Ferry # Replace with your desired MMSI

# Define start and end timestamps (in the format "dd/MM/yyyy HH:mm:ss")
start_str = "01/03/2024 00:00:00"  # Start time
end_str = "01/03/2024 23:59:59"    # End time

# Convert start and end times to datetime objects
start_dt = to_timestamp(lit(start_str), "dd/MM/yyyy HH:mm:ss")
end_dt = to_timestamp(lit(end_str), "dd/MM/yyyy HH:mm:ss")

# Check if the MMSI has enough data points
if mmsi_number not in valid_mmsi:
    print(f"MMSI {mmsi_number} does not have enough data points to display a meaningful route.")
else:
    # Filter by MMSI and time range
    route_df = filtered_by_count_df.filter(
        (col("MMSI") == mmsi_number) &
        (col("# Timestamp") >= start_dt) &
        (col("# Timestamp") <= end_dt)
    ).orderBy("# Timestamp")

    # Check if filtered data is available
    if route_df.count() == 0:
        print(f"No data for MMSI {mmsi_number} between {start_str} and {end_str}")
    else:
        # Convert data to Pandas DataFrame for plotting
        pandas_df = route_df.toPandas()

        # Create a map and plot the route
        mean_lat = pandas_df["Latitude"].mean()
        mean_lon = pandas_df["Longitude"].mean()
        
        route_map = folium.Map(location=[mean_lat, mean_lon], zoom_start=8)
        
        # Create a list of coordinates for the PolyLine
        coords = pandas_df[["Latitude", "Longitude"]].values.tolist()
        
        # Add the PolyLine to the map
        folium.PolyLine(coords, color="blue", weight=2.5, opacity=1).add_to(route_map)
        
        # Save the map
        route_map.save("ship_route.html")
        print("Route successfully saved as 'ship_route.html'.")
        display(route_map)

                                                                                

Route successfully saved as 'ship_route.html'.


25/02/04 04:46:44 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1936449 ms exceeds timeout 120000 ms
25/02/04 04:46:44 WARN SparkContext: Killing executors is not supported by current scheduler.
25/02/04 04:46:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$

### Conclusion
As observable in the cell above, you can alternatively open the ship_route.html file after executing the code to view the path taken by the specified ship (as determined by the MMSI number) within the specified time frame (both parameters are editable in the cell above the image).

During our analysis, we executed this code for different amounts of CSV files and data using both this notebook and the notebook using Pandas for analysis to compare the results. The observations from these tests are shown in the diagram below. We found that for small amounts of data (≈ 4GB), Pandas performed better than Spark. As the amount of data increased (≈ 6GB), Spark began to outperform Pandas, while Pandas started to experience a decline in performance. Further increasing the amount of data (≈ 10GB) resulted in Spark vastly outperforming Pandas.

![](img/Spark_Pandas_Comparison.png)
*Dowload Time with Spark and Pandas*

These findings can be explained by considering the different architectures and execution models of Spark and Pandas for handling data. The approximation between Pandas and Spark for medium amounts smaller of data is likely due to the overhead created by Spark when handling large amounts of data, including tasks like analyzing and partitioning files and creating an execution model. For larger datasets, Spark's architecture begins to provide significant advantages primarily due to parallelization, leading to increasingly better performance. This can be seen in the image with Sparks increase in time being a lot slower, almost sub-linear.

The increasingly poor performance of Pandas can be explained by its limitations: Pandas is single-node and single-threaded. This means that compared to Spark, Pandas has limited I/O bandwidth, memory, and CPU power. For larger datasets, these resources become insufficient, resulting in performance issues and, eventually, crashes.

This difference is also evident when examining the hardware performance. The following image (to be added) shows the hardware monitor during the execution of the Spark version, followed by the Pandas version of the program.

As seen in the first image, when running with Spark, the load is well-balanced across all 8 cores, preventing bottlenecks. In contrast, the Pandas version exhibits a much less efficient load distribution, utilizing fewer cores and resulting in hardware bottlenecks. This performance gap was also noticeable during testing, as the overall system responsiveness significantly declined while executing the Pandas version.

![](img/Spark_CPU.png)
*CPU usage with the Spark version*

![](img/Pandas_CPU.png)
*CPU usage with the Pandas version*

To conclude, our analysis of both Spark and Pandas demonstrates that Spark is superior for processing large amounts of data. This superiority can be attributed to Pandas' limitations and Spark's "big data" features and scalability-oriented architecture. While for small projects Pandas might be suffcient for any purpose that is remotely "big data" Spark should be the preferred option. This is underlined by our tests regarding scalability and fault tolerance, both of which are highly relevant for the use with large volumes of data.