In [1]:
#import findspark
from pyspark.sql import SparkSession
import os
from pyspark.sql.functions import col, to_timestamp, count, lit

In [2]:
local_storage_path = "./data/csv_files"
os.makedirs(local_storage_path, exist_ok=True)  # Create the directory if it does not exist

In [3]:
spark = SparkSession.builder \
    .appName("AIS Data Analysis") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 01:51:10 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 01:51:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 01:51:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
pathsTest1 = [
    "./data/csv_files/aisdk-2024-03-01.csv"
] #unfertig muss hier noch pfade auf unsere neuen ändern

pathsTest2 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv"
]

pathsTest3 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv"
]

pathsTest4 = [
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv",
    "./data/csv_files/aisdk-2024-03-01.csv",
    "./data/csv_files/aisdk-2024-03-02.csv",
    "./data/csv_files/aisdk-2024-03-03.csv",
    "./data/csv_files/aisdk-2024-03-04.csv",
    "./data/csv_files/aisdk-2024-03-05.csv"
]

### Tests General Scalability

### Test 1 - One File - 2.63 GB

In [5]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest1]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 01:51:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

15512927

Results:<br>
Tasks - 46<br>
Input - 5.3 GiB<br>
Time - 9.4 s<br>
Note: The parameter "InferSchema" results in additional Jobs due to passing over the file and identifiying the data structure. This leads to a slightly higher amount of jobs than expected.

### Test 2 - 2 Files - 5.4 GB

In [6]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest2]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

31817670

Results:<br>
Tasks - 93<br>
Input - 10.8 GiB<br>
Time - 13.4 s <br>
Note: The input is larger than the individual files used, this is because the size in the spark ui is not the filesize, but the size of the loaded objects. These objects needs more memory due to overhead in the spark data structure.

### Test 3 - 5 Files - 13.32 GB

In [7]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

78022287

Results:<br>
Tasks - 226<br>
Input - 26.5 GiB<br>
Time - 32.9 s <br>

### Test 4 - 10 Files - 26.46 GB

In [8]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest4]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

                                                                                

156044574

Results:<br>
Tasks - 451<br>
Input - 52.9 GiB<br>
Time - 1 m 7 s <br>

### Conclusion General Scalability

In these tests the scalability of spark was demonstrated using the data of our project and a simplified version of our projects code. In every test the amount of tasks created and executed by spark, the size of the input data and the execution time were noted. The tests were conducted using sparks local mode. If you compare the results of the individual tests you see that as the amount of data increases linearly, so does the input size, the amount of spark tasks and the execution time. This behaviour is one of the things making spark a good tool in Big Data use cases since a linear increase in performence is desired.

### Tests Hardware Scalability - CPUs

### Test 5 - 2 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[2]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:03:04 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:03:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:03:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:03:24 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 1m 30.8s

### Test 6 - 4 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:05:27 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:05:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:05:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:05:38 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 49.6s

### Test 7 - 8 Cores

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[8]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:10:00 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:10:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:10:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:10:12 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 36.1s

### Tests Hardware Scalability - Memory

### Test 8 - 4GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:14:20 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:14:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:14:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:14:31 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 52.5s

### Test 9 - 8GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","8g") \
    .config("spark.driver.memory","8g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:16:05 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:16:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:16:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:16:17 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:<br>
Time - 47.5s

### Test 10 - 16GB RAM

In [3]:
spark = SparkSession.builder \
    .appName("AIS Hardware Analysis") \
    .master("local[4]") \
    .config("spark.executor.memory","16g") \
    .config("spark.driver.memory","16g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

25/02/01 03:20:27 WARN Utils: Your hostname, MBAO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.117 instead (on interface en0)
25/02/01 03:20:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 03:20:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
dataframes = [spark.read.csv(path, header=True, inferSchema=True) for path in pathsTest3]

# Combine all DataFrames into a single large DataFrame
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.union(df)

combined_df.count()

25/02/01 03:20:40 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

78022287

Results:
Time - 50s

### Conclusion Hardware Scalability
All previous tests executed the simplified Spark code with the same amount of data (5 files) but different hardware configurations. The first few tests all ran with 4 GB of RAM but with different numbers of cores (2, 4, 8). These tests show that increasing the number of cores significantly improves the application's performance. This is most likely because increasing the number of cores also increases the potential for parallelization.<br>
The next few tests all ran with 4 cores but with different amounts of memory (4 GB, 8 GB, 16 GB). In these tests, the increase in memory had an extremely small impact on the application's performance. This is most likely because the test application is relatively simple, and 4 GB was already sufficient for the required tasks. This suggests that an individual node generally does not need large amounts of memory.