# Scalability

This notebook aims to answer general questions about Spark’s scalability and its comparison with Pandas. It is structured into three sections:<br>

1. General Scalability of Spark (Tests 1-4)
    These tests are designed to demonstrate Spark's behavior with varying data volumes.
2. Hardware Scalability of Spark (Tests 5-10)
    These tests demonstrate the impact of different hardware configurations on Spark's performance.
3. Direct Comparison with Pandas (Tests 11-12)
    These tests directly compare Pandas and Spark across different data volumes, analyzing their impact on execution time and hardware usage.

To simplify the problem, the code executed here represents only a portion of our overall use case. In all tests, data files are read and then combined into a single DataFrame. To verify correctness, the number of rows in the DataFrame is printed after each test. Any discrepancies in row count may indicate errors in data processing.<br>

Note: To properly test Spark setups with different configurations, you need to terminate the old Spark sessions and then execute the cells containing the Spark session builders.

In [1]:
#import findspark
from pyspark.sql import SparkSession
import os
from pyspark.sql.functions import col, to_timestamp, count, lit
import pandas as pd
import psutil

In [2]:
local_storage_path = "./data/csv_files"
os.makedirs(local_storage_path, exist_ok=True)  # Create the directory if it does not exist

In [3]:
spark = SparkSession.builder \
    .appName("AIS Data Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:52:40 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.113 instead (on interface en0)
25/02/04 00:52:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/04 00:52:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/04 00:52:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
#Note: If you encounter any errors related to file paths, ensure that the placement of "/" symbols is correct.

pathsTest1 = [
    local_storage_path + "/aisdk-2024-03-01.csv"
] 

pathsTest2 = [
    local_storage_path + "/aisdk-2024-03-01.csv",
    local_storage_path + "/aisdk-2024-03-02.csv"
]

pathsTest3 = [
    local_storage_path + "/aisdk-2024-03-01.csv",
    local_storage_path + "/aisdk-2024-03-02.csv",
    local_storage_path + "/aisdk-2024-03-03.csv",
    local_storage_path + "/aisdk-2024-03-04.csv",
    local_storage_path + "/aisdk-2024-03-05.csv"
]

pathsTest4 = [
    local_storage_path + "/aisdk-2024-03-01.csv",
    local_storage_path + "/aisdk-2024-03-02.csv",
    local_storage_path + "/aisdk-2024-03-03.csv",
    local_storage_path + "/aisdk-2024-03-04.csv",
    local_storage_path + "/aisdk-2024-03-05.csv",
    local_storage_path + "/aisdk-2024-03-01.csv",
    local_storage_path + "/aisdk-2024-03-02.csv",
    local_storage_path + "/aisdk-2024-03-03.csv",
    local_storage_path + "/aisdk-2024-03-04.csv",
    local_storage_path + "/aisdk-2024-03-05.csv"
]

### Tests General Scalability

### Test 1 - One File - 2.63 GB

In [5]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest1]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

15512927

Results:<br>
Tasks - 46<br>
Input - 5.3 GiB<br>
Time - 9.4 s<br>
Note: The parameter "InferSchema" results in additional Jobs due to passing over the file and identifiying the data structure. This leads to a slightly higher amount of jobs than expected.

### Test 2 - 2 Files - 5.4 GB

In [6]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest2]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/04 00:52:52 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

31817670

Results:<br>
Tasks - 93<br>
Input - 10.8 GiB<br>
Time - 13.4 s <br>
Note: The input is larger than the individual files used, this is because the size in the spark ui is not the filesize, but the size of the loaded objects. These objects needs more memory due to overhead in the spark data structure.

### Test 3 - 5 Files - 13.32 GB

In [7]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Tasks - 226<br>
Input - 26.5 GiB<br>
Time - 32.9 s <br>

### Test 4 - 10 Files - 26.46 GB

In [8]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest4]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

156044574

Results:<br>
Tasks - 451<br>
Input - 52.9 GiB<br>
Time - 1 m 7 s <br>

### Conclusion General Scalability

In these tests the scalability of spark was demonstrated using the data of our project and a simplified version of our projects code. In every test the amount of tasks created and executed by spark, the size of the input data and the execution time were noted. The tests were conducted using sparks local mode. If you compare the results of the individual tests you see that as the amount of data increases linearly, so does the input size, the amount of spark tasks and the execution time. This behaviour is one of the things making spark a good tool in Big Data use cases since a linear increase in performence is desired.

### Tests Hardware Scalability - CPUs

### Test 5 - 2 Cores

In [9]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[2]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:54:48 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [10]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Time - 1m 30.8s

### Test 6 - 4 Cores

In [11]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:55:24 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [12]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Time - 49.6s

### Test 7 - 8 Cores

In [13]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[8]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:56:00 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [14]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Time - 36.1s

### Tests Hardware Scalability - Memory

### Test 8 - 4GB RAM

In [15]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:56:36 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [16]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Time - 52.5s

### Test 9 - 8GB RAM

In [17]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","8g") \
    .config("spark.driver.memory","8g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:57:12 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [18]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Time - 47.5s

### Test 10 - 16GB RAM

In [19]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","16g") \
    .config("spark.driver.memory","16g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:57:50 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [20]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:
Time - 50s

### Conclusion Hardware Scalability
All previous tests executed the simplified Spark code with the same amount of data (5 files) but different hardware configurations. The first few tests (5-7) all ran with 4 GB of RAM but with different numbers of cores (2, 4, 8). These tests show that increasing the number of cores significantly improves the application's performance. This is most likely because increasing the number of cores also increases the potential for parallelization.<br>
The next few tests (7-10) all ran with 4 cores but with different amounts of memory (4 GB, 8 GB, 16 GB). In these tests, the increase in memory had an extremely small impact on the application's performance. This is most likely because the test application is relatively simple, and 4 GB was already sufficient for the required tasks. This suggests that an individual node generally does not need large amounts of memory.

### Tests Hardware Comparison Pandas - Spark

### Test 11 - Comparison with 2 Files

### Pandas:

In [21]:

dataframes = [pd.read_csv(path) for path in pathsTest2]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Get the number of rows  
row_count = len(combined_df)
print(row_count)


Memory Usage: 54.5%
Memory Usage: 78.9%
31817670


Results:<br>
Time - 48.8s<br>
Memory Usage Read - 43.9%<br>
Memory Usage Union - 64.2%

### Spark:

In [22]:
spark = SparkSession.builder \
    .appName("AIS Comparison Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/04 00:59:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [23]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest2]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

combined_df.count()

                                                                                

Memory Usage: 60.1%
Memory Usage: 43.5%


                                                                                

31817670

Results:<br>
Time - 10.2s<br>
Memory Usage Read - 30.5%<br>
Memory Usage Union - 30.5%

### Test 12 - Comparison with 5 Files

### Pandas:

In [24]:
dataframes = [pd.read_csv(path) for path in pathsTest3]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Get the number of rows  
row_count = len(combined_df)
print(row_count)

Memory Usage: 55.7%
Memory Usage: 82.8%
78022287


Results:<br>
Time - 2m 2.6s - 122.6s<br>
Memory Usage Read - 54.7%<br>
Memory Usage Union - 82.4%

### Spark:

In [25]:
spark = SparkSession.builder \
    .appName("AIS Comparison Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

In [26]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

memory_info = psutil.virtual_memory()
print(f"Memory Usage: {memory_info.percent}%")

combined_df.count()

                                                                                

Memory Usage: 52.7%
Memory Usage: 28.0%


                                                                                

78022287

Results:<br>
Time - 10.2s<br>
Memory Usage Read - 54.3%<br>
Memory Usage Union - 31.2%

### Conclusion Comparison Pandas - Spark

In both tests, the Spark implementation was significantly faster than the Pandas version (Test 1: 38.6s; Test 2: 77.7s). The execution time nearly doubled, aligning with the increase in data volume. Notably, Spark's processing time for the test with five files was still shorter than Pandas' time for just two files. Additionally, in both tests, Pandas exhibited consistently higher memory usage while reading and merging the data frames. However, during the tests, memory usage varied significantly. These results demonstrate that, already for data volumes similar to those used in this test, Spark is generally the superior (i.e., faster and more memory-efficient) solution.