# Scalability

This notebook aims to answer general questions about Spark’s scalability and its comparison with Pandas. It is structured into three sections:<br>

1. General Scalability of Spark (Tests 1-4)
2. Hardware Scalability of Spark (Tests 5-10)
3. Direct Comparison with Pandas (Tests 11-12)

In order to simplify the Problem the code getting executed here is not our entire use case but only part of it. In all tests the data files will get read and then united into a single dataframe. To show the correctness of the after every test the amount of rows in the dataframe gets printed. Errors regarding the data will be visible through a different amount of rows.

In [2]:
#import findspark
from pyspark.sql import SparkSession
import os
from pyspark.sql.functions import col, to_timestamp, count, lit
import pandas as pd
import psutil

In [3]:
local_storage_path = "./data/csv_files"
os.makedirs(local_storage_path, exist_ok=True)  # Create the directory if it does not exist

In [3]:
spark = SparkSession.builder \
    .appName("AIS Data Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 01:51:10 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 01:51:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 01:51:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
pathsTest1 = [
    "./data/csv_files/aisdk-2024-03-01.csv"
] #unfertig muss hier noch pfade auf unsere neuen ändern

pathsTest2 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv"
]

pathsTest3 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv"
]

pathsTest4 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv",
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv"
]

### Tests General Scalability

### Test 1 - One File - 2.63 GB

In [5]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest1]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 01:51:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

15512927

Results:<br>
Tasks - 46<br>
Input - 5.3 GiB<br>
Time - 9.4 s<br>
Note: The parameter "InferSchema" results in additional Jobs due to passing over the file and identifiying the data structure. This leads to a slightly higher amount of jobs than expected.

### Test 2 - 2 Files - 5.4 GB

In [6]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest2]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

31817670

Results:<br>
Tasks - 93<br>
Input - 10.8 GiB<br>
Time - 13.4 s <br>
Note: The input is larger than the individual files used, this is because the size in the spark ui is not the filesize, but the size of the loaded objects. These objects needs more memory due to overhead in the spark data structure.

### Test 3 - 5 Files - 13.32 GB

In [7]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Tasks - 226<br>
Input - 26.5 GiB<br>
Time - 32.9 s <br>

### Test 4 - 10 Files - 26.46 GB

In [8]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest4]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

156044574

Results:<br>
Tasks - 451<br>
Input - 52.9 GiB<br>
Time - 1 m 7 s <br>

### Conclusion General Scalability

In these tests the scalability of spark was demonstrated using the data of our project and a simplified version of our projects code. In every test the amount of tasks created and executed by spark, the size of the input data and the execution time were noted. The tests were conducted using sparks local mode. If you compare the results of the individual tests you see that as the amount of data increases linearly, so does the input size, the amount of spark tasks and the execution time. This behaviour is one of the things making spark a good tool in Big Data use cases since a linear increase in performence is desired.

### Tests Hardware Scalability - CPUs

### Test 5 - 2 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[2]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:03:04 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:03:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:03:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:03:24 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 1m 30.8s

### Test 6 - 4 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:05:27 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:05:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:05:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:05:38 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 49.6s

### Test 7 - 8 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[8]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:10:00 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:10:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:10:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:10:12 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 36.1s

### Tests Hardware Scalability - Memory

### Test 8 - 4GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:14:20 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:14:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:14:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:14:31 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 52.5s

### Test 9 - 8GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","8g") \
    .config("spark.driver.memory","8g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:16:05 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:16:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:16:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:16:17 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 47.5s

### Test 10 - 16GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","16g") \
    .config("spark.driver.memory","16g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:20:27 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:20:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:20:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:20:40 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

25/02/01 05:00:08 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 228694 ms exceeds timeout 120000 ms
25/02/01 05:00:08 WARN SparkContext: Killing executors is not supported by current scheduler.
25/02/01 05:00:14 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

Results:
Time - 50s

### Conclusion Hardware Scalability
All previous tests executed the simplified Spark code with the same amount of data (5 files) but different hardware configurations. The first few tests (5-7) all ran with 4 GB of RAM but with different numbers of cores (2, 4, 8). These tests show that increasing the number of cores significantly improves the application's performance. This is most likely because increasing the number of cores also increases the potential for parallelization.<br>
The next few tests (7-10) all ran with 4 cores but with different amounts of memory (4 GB, 8 GB, 16 GB). In these tests, the increase in memory had an extremely small impact on the application's performance. This is most likely because the test application is relatively simple, and 4 GB was already sufficient for the required tasks. This suggests that an individual node generally does not need large amounts of memory.

### Tests Hardware Comparison Pandas - Spark

### Test 11 - Comparison with 2 Files

### Pandas:

In [6]:

dataframes = [pd.read_csv(path) for path in pathsTest2]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Get the number of rows  
row_count = len(combined_df)
print(row_count)


Memory Usage: 43.9%
Memory Usage: 64.2%
31817670


Results:<br>
Time - 48.8s<br>
Memory Usage Read - 43.9%<br>
Memory Usage Union - 64.2%

### Spark:

In [7]:
spark = SparkSession.builder \
    .appName("AIS Comparison Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/03 21:59:32 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.113 instead (on interface en0)
25/02/03 21:59:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/03 21:59:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest2]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

combined_df.count()

                                                                                

Memory Usage: 30.5%
Memory Usage: 30.5%


                                                                                

31817670

Results:<br>
Time - 10.2s<br>
Memory Usage Read - 30.5%<br>
Memory Usage Union - 30.5%

### Test 12 - Comparison with 5 Files

### Pandas:

In [10]:
dataframes = [pd.read_csv(path) for path in pathsTest3]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Get the number of rows  
row_count = len(combined_df)
print(row_count)

Memory Usage: 54.7%
Memory Usage: 82.4%
78022287


Results:<br>
Time - 2m 2.6s - 122.6s<br>
Memory Usage Read - 54.7%<br>
Memory Usage Union - 82.4%

### Spark:

In [11]:
spark = SparkSession.builder \
    .appName("AIS Comparison Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

In [12]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

combined_df.count()

                                                                                

Memory Usage: 54.3%
Memory Usage: 31.2%


                                                                                

78022287

Results:<br>
Time - 10.2s<br>
Memory Usage Read - 54.3%<br>
Memory Usage Union - 31.2%

### Conclusion Comparison Pandas - Spark

In both tests, the Spark implementation was significantly faster than the Pandas version (Test 1: 38.6s; Test 2: 77.7s). The execution time nearly doubled, aligning with the increase in data volume. Notably, Spark's processing time for the test with five files was still shorter than Pandas' time for just two files. Additionally, in both tests, Pandas exhibited consistently higher memory usage while reading and merging the data frames. These results demonstrate that, already for data volumes similar to those used in this test, Spark is generally the superior (i.e., faster and more memory-efficient) solution.