# Step 1: Cluster Setup & Resource Management

**Requirement:** - **Cluster Setup (Step 1):** Connect to the distributed cluster (1 Master, 2 Workers).

- **Resource Management (Step 4, 6):** Explicitly configure `spark.executor.memory`, `spark.executor.cores`, and `spark.sql.shuffle.partitions` to optimize performance and prevent resource contention.

**Configuration Strategy:**

- **Executor Memory:** Set to `1g` (leaving buffer for OS/Overhead within the 2GB container limit).
- **Executor Cores:** Set to `2` (utilizing the available cores per worker).
- **Shuffle Partitions:** Set to `6` (Calculation: 2 workers Ã— 3 partitions/worker) to ensure balanced parallelism during joins/aggregations.


In [None]:
from pyspark.sql import SparkSession
import socket
import time

# initialize spark session with clear resource management config
spark = (
    SparkSession.builder.appName("FlightDelay-Mandatory-Part1")
    .master("spark://spark-master:7077")
    # --- Resource Management (Req 6) ---
    .config("spark.executor.memory", "1g")  # memory per worker
    .config("spark.executor.cores", "2")  # cores per worker
    .config("spark.cores.max", "4")  # total cores in cluster
    # balanced partitions for shuffling
    .config("spark.sql.shuffle.partitions", "6")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/16 11:19:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
# verify conn
print("=" * 50)
print("SPARK CLUSTER CONNECTION")
print("=" * 50)
print(f"Version    : {spark.version}")
print(f"Master URL       : {spark.sparkContext.master}")
print(f"App Name         : {spark.sparkContext.appName}")
print(f"Running on Node  : {socket.gethostname()}")
print("-" * 50)
print(f"Executor Memory  : {spark.conf.get('spark.executor.memory')}")
print(f"Executor Cores   : {spark.conf.get('spark.executor.cores')}")
print(f"Partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print("=" * 50)

SPARK CLUSTER CONNECTION
Version    : 3.5.0
Master URL       : spark://spark-master:7077
App Name         : FlightDelay-Mandatory-Part1
Running on Node  : spark-master
--------------------------------------------------
Executor Memory  : 1g
Executor Cores   : 2
Partitions: 6


In [None]:
# trigger the UI registration
spark.range(5).collect()
print("Cluster is active. Check Spark UI at http://localhost:8080")

[Stage 0:>                                                          (0 + 4) / 4]

Cluster is active. Check Spark UI at http://localhost:8080


                                                                                

# Step 2: Manual Data Splitting and Node Assignment

**Requirement:**

- **Manual Splitting (Step 2):** The dataset must be manually split into parts.
- **Node Assignment:** Worker 1, Worker 2, and the Master must each hold a specific portion of the data.

**Implementation:**
Verify the existence of the three split files (`part1.csv`, `part2.csv`, `part3.csv`) in the shared volume -> define a logical mapping (`node_file_map`) that restricts each node to access _only_ its assigned file during the independent processing phase.


In [None]:
# each node "receives one portition"
node_file_map = {
    "spark-master": "/data/part1.csv",
    "spark-worker1": "/data/part2.csv",
    "spark-worker2": "/data/part3.csv",
}

total_rows = 0

In [None]:
import os

# data split verification: files exist and count rows
for node, file_path in node_file_map.items():
    if os.path.exists(file_path):
        # row count (subtract 1 for header)
        with open(file_path, "r") as f:
            row_count = sum(1 for line in f) - 1

        file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
        print(
            f"Node: {node:<15} | File: {file_path:<15} | Rows: {row_count:<8,} | Size: {file_size_mb:.2f} MB"
        )
        total_rows += row_count
    else:
        print(f"ERROR: {file_path} not found for {node}!")

print("_" * 50)
print(f"Total Dataset Size: {total_rows:,} rows")
print("=" * 50)

Node: spark-master    | File: /data/part1.csv | Rows: 201,921  | Size: 21.57 MB
Node: spark-worker1   | File: /data/part2.csv | Rows: 201,921  | Size: 21.55 MB
Node: spark-worker2   | File: /data/part3.csv | Rows: 201,923  | Size: 21.56 MB
__________________________________________________
Total Dataset Size: 1,211,530 rows


# Step 3: Distributed Loading, Preprocessing & Optimization

**Requirement:**

- **Data Loading (Step 3):** Load data from all nodes and combine into a unified DataFrame.
- **Spark SQL Processing (Step 4):** Use SQL queries for cleaning and feature engineering.
- **Resource Management (Step 6):** Adjust shuffle partitions and cache data for performance.

**Implementation:**

1.  **Load:** `/data/part*.csv`. Spark's driver coordinates the workers to read their local file parts in parallel.
2.  **Clean:** register a temp view `flights_raw` and use Spark SQL to cast columns and filter out cancelled flights.
3.  **Optimize:** repartition the dataframe to `6` (matching the shuffle configuration) and call `.cache()` to store it in the workers' RAM.


In [None]:
import time

print("=" * 50)
print("DISTRIBUTED PREPROCESSING")
print("=" * 50)

start_time = time.time()

# distributed Loading (reaad part1, part2, part3 in parallel)
df_raw = spark.read.csv("/data/part*.csv", header=True, inferSchema=True)

# Spark SQL Preprocessing
df_raw.createOrReplaceTempView("flights_raw")

# performing casting and filtering (Requirement: "Data Partitioning and Processing")
df_clean = spark.sql(
    """
    SELECT 
        -- target
        CAST(ArrDelay AS DOUBLE) AS ArrDelay,
        
        -- numeric features 
        CAST(DepDelay AS DOUBLE) AS DepDelay,
        CAST(Distance AS DOUBLE) AS Distance,
        CAST(TaxiOut AS DOUBLE) AS TaxiOut,
        CAST(CRSElapsedTime AS DOUBLE) AS CRSElapsedTime,
        
        -- categorical features
        DayOfWeek,
        UniqueCarrier,
        Origin,
        Dest
        
    FROM flights_raw
    WHERE Cancelled = 0 
      AND Diverted = 0
      AND ArrDelay IS NOT NULL
      AND DepDelay IS NOT NULL
"""
)

# repartition & Cache (Requirement: "Resource Management")
# repartition to 6 to match our shuffle config
df_processed = df_clean.repartition(6).cache()

# trigger action to materialize cache and count rows
row_count = df_processed.count()
duration = time.time() - start_time

print("_" * 50)
print(f"Input Partitions      : {df_raw.rdd.getNumPartitions()}")
print(f"Output Partitions     : {df_processed.rdd.getNumPartitions()}")
print(f"Final Row Count       : {row_count:,}")
print(f"Duration           : {duration:.2f} seconds")
print(f"Is Cached?            : {df_processed.is_cached}")
print("=" * 50)

df_processed.printSchema()

DISTRIBUTED PREPROCESSING


25/12/16 12:04:26 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

__________________________________________________
Input Partitions      : 4
Output Partitions     : 6
Final Row Count       : 587,130
Duration           : 9.97 seconds
Is Cached?            : True
root
 |-- ArrDelay: double (nullable = true)
 |-- DepDelay: double (nullable = true)
 |-- Distance: double (nullable = true)
 |-- TaxiOut: double (nullable = true)
 |-- CRSElapsedTime: double (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)



                                                                                

# Step 4: Distributed ML Pipeline with Classification Evaluation

**Requirement:**

- **Models (Step 5):** Implement Machine Learning models (Linear Regression and Random Forest).
- **Evaluation (Step 5):** Evaluate performance using **Accuracy, Precision, and F1-score**.

**Implementation Strategy:**

1.  **Feature Engineering:**

    - Convert `Origin`/`Dest` (high cardinality) and `UniqueCarrier` into numerical vectors using `StringIndexer` and `OneHotEncoder`.
    - Assemble with numeric features (`DepDelay`, `Distance`, etc.).

2.  **Distributed Training:** Train two regression models to predict the exact delay in minutes.

3.  **Metric Calculation (The Threshold Strategy):**
    - Since the models predict _time_ (continuous), but the requirements ask for _classification metrics_ (Accuracy/F1) -> apply the **15-minute threshold**.
    - **Logic:** If `Predicted_Delay > 15.0` $\rightarrow$ Class 1 (Late), else Class 0 (On Time).
      - then compute Accuracy, Weighted Precision, and Weighted F1-Score based on this binary classification.


In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import col, when

# handle high-cardinality categoricals (optimization)
# keep top 20 airports, others become "other" to prevent massive feature vectors
top_origins = [
    r[0]
    for r in df_processed.groupBy("Origin")
    .count()
    .orderBy(col("count").desc())
    .limit(20)
    .collect()
]
top_dests = [
    r[0]
    for r in df_processed.groupBy("Dest")
    .count()
    .orderBy(col("count").desc())
    .limit(20)
    .collect()
]

df_ml = df_processed.withColumn(
    "Origin_group",
    when(col("Origin").isin(top_origins), col("Origin")).otherwise("Other"),
).withColumn(
    "Dest_group", when(col("Dest").isin(top_dests), col("Dest")).otherwise("Other")
)

# split data
train_data, test_data = df_ml.randomSplit([0.8, 0.2], seed=42)
print(f"Train Rows: {train_data.count():,} | Test Rows: {test_data.count():,}")

# pipeline stages
stages = []
categorical_cols = ["UniqueCarrier", "Origin_group", "Dest_group"]
numeric_cols = ["DepDelay", "Distance", "TaxiOut", "CRSElapsedTime", "DayOfWeek"]

for c in categorical_cols:
    stages.append(StringIndexer(inputCol=c, outputCol=c + "_idx", handleInvalid="keep"))
    stages.append(OneHotEncoder(inputCol=c + "_idx", outputCol=c + "_vec"))

assembler = VectorAssembler(
    inputCols=numeric_cols + [c + "_vec" for c in categorical_cols],
    outputCol="features",
)
stages.append(assembler)

Train Rows: 469,668 | Test Rows: 117,462


In [None]:
# train & evaluate models
models = [
    (
        "Linear Regression",
        LinearRegression(labelCol="ArrDelay", featuresCol="features", maxIter=10),
    ),
    (
        "Random Forest",
        RandomForestRegressor(
            labelCol="ArrDelay", featuresCol="features", numTrees=20, maxDepth=5
        ),
    ),
]

results = {}

# train & evaluate loop
for name, algo in models:
    print(f"\nTraining {name}...")
    pipeline = Pipeline(stages=stages + [algo])

    start_time = time.time()
    model = pipeline.fit(train_data)
    train_time = time.time() - start_time

    # make predictions
    raw_preds = model.transform(test_data)

    # CONVERT TO BINARY CLASS FOR METRICS
    # threshold: 15 minutes
    preds_binary = raw_preds.withColumn(
        "label_class", when(col("ArrDelay") >= 15, 1.0).otherwise(0.0)
    ).withColumn("pred_class", when(col("prediction") >= 15, 1.0).otherwise(0.0))

    # calculate classification metrics
    evaluator_acc = MulticlassClassificationEvaluator(
        labelCol="label_class", predictionCol="pred_class", metricName="accuracy"
    )
    evaluator_prec = MulticlassClassificationEvaluator(
        labelCol="label_class",
        predictionCol="pred_class",
        metricName="weightedPrecision",
    )
    evaluator_f1 = MulticlassClassificationEvaluator(
        labelCol="label_class", predictionCol="pred_class", metricName="f1"
    )

    acc = evaluator_acc.evaluate(preds_binary)
    prec = evaluator_prec.evaluate(preds_binary)
    f1 = evaluator_f1.evaluate(preds_binary)

    results[name] = {"Accuracy": acc, "Precision": prec, "F1": f1, "Time": train_time}
    print(f"Done. Accuracy: {acc:.4f} | F1: {f1:.4f} | Time: {train_time:.2f}s")

# comparison table
print("\n" + "=" * 60)
print(
    f"{'Model':<20} | {'Accuracy':<10} | {'Precision':<10} | {'F1-Score':<10} | {'Time (s)':<10}"
)
print("-" * 75)
for name, m in results.items():
    print(
        f"{name:<20} | {m['Accuracy']:<10.4f} | {m['Precision']:<10.4f} | {m['F1']:<10.4f} | {m['Time']:<10.2f}"
    )
print("=" * 60)


Training Linear Regression...


25/12/16 13:49:33 WARN Instrumentation: [fc1458d5] regParam is zero, which might cause numerical instability and overfitting.
25/12/16 13:49:35 WARN Instrumentation: [fc1458d5] Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.
                                                                                

Done. Accuracy: 0.9233 | F1: 0.9218 | Time: 6.44s

Training Random Forest...




Done. Accuracy: 0.9070 | F1: 0.9049 | Time: 14.07s

Model                | Accuracy   | Precision  | F1-Score   | Time (s)  
---------------------------------------------------------------------------
Linear Regression    | 0.9233     | 0.9221     | 0.9218     | 6.44      
Random Forest        | 0.9070     | 0.9051     | 0.9049     | 14.07     


                                                                                

# Step 5: Model Tuning (Hyperparameter Optimization)

**Requirement:**

- **Tuning (Page 4):** Adjust hyperparameters (e.g., regularization) and run Cross-Validation to find the best configuration.

**Implementation:**

- use `CrossValidator` with a `ParamGridBuilder`.
- tune the **Linear Regression** model (since it was our best performer).
- **Hyperparameters to tune:**
  - `regParam`: Regularization parameter (0.01 vs 0.1).
  - `elasticNetParam`: Mixing parameter (0.0 for L2, 0.5 for ElasticNet).
- **Folds:** 3-Fold Cross-Validation.


In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

# define the estimator (pipeline with LR)
# - reuse the pipeline stages from Step 4, but with a new LR instance
lr_tune = LinearRegression(labelCol="ArrDelay", featuresCol="features", maxIter=10)
pipeline_tune = Pipeline(stages=stages + [lr_tune])

# define the ParamGrid
paramGrid = (
    ParamGridBuilder()
    .addGrid(lr_tune.regParam, [0.01, 0.1])
    .addGrid(lr_tune.elasticNetParam, [0.0, 0.5])
    .build()
)

# define CrossValidator
crossval = CrossValidator(
    estimator=pipeline_tune,
    estimatorParamMaps=paramGrid,
    evaluator=RegressionEvaluator(labelCol="ArrDelay", metricName="rmse"),
    numFolds=3,
)

In [None]:
print("Starting Cross-Validation...")
start_cv = time.time()

# run tunning
cvModel = crossval.fit(train_data)
end_cv = time.time()

# get best model results
best_model = cvModel.bestModel.stages[-1]
print(f"\nGrid Search Complete in {end_cv - start_cv:.2f}s")
print("-" * 60)
print(f"Best RegParam: {best_model.getRegParam()}")
print(f"Best ElasticNetParam: {best_model.getElasticNetParam()}")

# evaluate best model on test data
preds_tuned = cvModel.transform(test_data)
rmse_tuned = RegressionEvaluator(labelCol="ArrDelay", metricName="rmse").evaluate(
    preds_tuned
)
print(f"Test RMSE (Tuned Model): {rmse_tuned:.2f}")
print("=" * 60)

Starting Cross-Validation...


                                                                                


Grid Search Complete in 40.84s
------------------------------------------------------------
Best RegParam: 0.01
Best ElasticNetParam: 0.0
Test RMSE (Tuned Model): 10.34


# Part 1 - Part 2: Comparative Analysis (Independent vs. Distributed)

**Objective:**
Compare two execution paradigms as per Project Instructions Part 2:

1.  **Scenario A (Independent):** Simultaneous ML jobs running on isolated data chunks (simulating separate worker tasks).
2.  **Scenario B (Distributed):** One distributed ML job running on the unified dataset.

**Methodology:**

- use Python `threading` to launch 3 concurrent Spark jobs.
- each job loads a specific partition (`part1.csv`, `part2.csv`, `part3.csv`) and trains a local Linear Regression model.
- compare the **Average Accuracy** and **Total Time** against the Distributed Model results from Step 5.


In [None]:
import threading
import time
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import col, when

# dictionary to store results from the 3 concurrent jobs
independent_results = {}


def train_local_simulation(node_name, data_path):
    """
    simulates a worker node training a model on its isolated data partition.
    """
    try:
        print(f"[{node_name}] Starting local training on {data_path}...")
        start_local = time.time()

        # load ONLY the specific local file
        # - simulate isolation by reading specific paths.
        df_chunk = spark.read.csv(data_path, header=True, inferSchema=True)

        # local preprocessing (
        # - replicate the main cleaning logic (filter cancellations, cast types)
        df_clean = (
            df_chunk.filter(
                "Cancelled == 0 AND Diverted == 0 AND ArrDelay IS NOT NULL AND DepDelay IS NOT NULL"
            )
            .withColumn("ArrDelay", col("ArrDelay").cast("double"))
            .withColumn("DepDelay", col("DepDelay").cast("double"))
            .withColumn("Distance", col("Distance").cast("double"))
        )

        # feature engineering
        # - treat Origin/Dest as strings directly
        # - skip high-cardinality for speed in this test
        # use numeric features which are consistent across partitions
        assembler = VectorAssembler(
            inputCols=["DepDelay", "Distance"], outputCol="features"
        )

        # local model training
        lr = LinearRegression(labelCol="ArrDelay", featuresCol="features", maxIter=10)
        pipeline = Pipeline(stages=[assembler, lr])

        # split 80/20 locally
        train, test = df_clean.randomSplit([0.8, 0.2], seed=42)
        model = pipeline.fit(train)

        # local model eval
        preds = model.transform(test)

        # classif metrics (threshold 15m)
        preds_bin = preds.withColumn(
            "label_class", when(col("ArrDelay") >= 15, 1.0).otherwise(0.0)
        ).withColumn("pred_class", when(col("prediction") >= 15, 1.0).otherwise(0.0))

        acc = MulticlassClassificationEvaluator(
            labelCol="label_class", predictionCol="pred_class", metricName="accuracy"
        ).evaluate(preds_bin)

        duration = time.time() - start_local
        independent_results[node_name] = {"Accuracy": acc, "Time": duration}
        print(f"[{node_name}] Finished. Acc: {acc:.4f} | Time: {duration:.2f}s")

    except Exception as e:
        print(f"[{node_name}] FAILED: {str(e)}")

In [None]:
# define the 3 independent jobs
threads = []
file_parts = [
    ("Job_Master", "/data/part1.csv"),
    ("Job_Worker1", "/data/part2.csv"),
    ("Job_Worker2", "/data/part3.csv"),
]

global_start = time.time()

# launch threads to run simultaneously
for name, path in file_parts:
    t = threading.Thread(target=train_local_simulation, args=(name, path))
    threads.append(t)
    t.start()

# wait for all threads to complete
for t in threads:
    t.join()

global_end = time.time()
total_simultaneous_time = global_end - global_start

print("=" * 60)
print(f"All Simultaneous Jobs Finished in: {total_simultaneous_time:.2f}s")
print("_" * 60)

[Job_Master] Starting local training on /data/part1.csv...
[Job_Worker1] Starting local training on /data/part2.csv...
[Job_Worker2] Starting local training on /data/part3.csv...


25/12/16 15:35:31 WARN Instrumentation: [5ce40a69] regParam is zero, which might cause numerical instability and overfitting.
25/12/16 15:35:31 WARN Instrumentation: [7a4711e9] regParam is zero, which might cause numerical instability and overfitting.
25/12/16 15:35:32 WARN Instrumentation: [36d2dd83] regParam is zero, which might cause numerical instability and overfitting.
                                                                                

[Job_Worker2] Finished. Acc: 0.9006 | Time: 4.99s
[Job_Master] Finished. Acc: 0.8966 | Time: 5.44s
[Job_Worker1] Finished. Acc: 0.8977 | Time: 5.52s
All Simultaneous Jobs Finished in: 5.52s
____________________________________________________________


In [None]:
# Comparative Analysis

# calculate independent averages
avg_ind_acc = sum(r["Accuracy"] for r in independent_results.values()) / 3
max_ind_time = max(r["Time"] for r in independent_results.values())


dist_acc = 0.9233
dist_time = 6.44

print("\n=== FINAL COMPARISON: INDEPENDENT vs. DISTRIBUTED ===")
print(
    f"{'Metric':<20} | {'Independent (Scenario A)':<25} | {'Distributed (Scenario B)':<25}"
)
print("-" * 75)
print(f"{'Accuracy':<20} | {avg_ind_acc:<25.4f} | {dist_acc:<25.4f}")
print(f"{'Training Time':<20} | {max_ind_time:<25.2f}s | {dist_time:<25.2f}s")
print(
    f"{'Data Scope':<20} | {'Local Partition Only':<25} | {'Global Unified Dataset':<25}"
)
print("=" * 75)


=== FINAL COMPARISON: INDEPENDENT vs. DISTRIBUTED ===
Metric               | Independent (Scenario A)  | Distributed (Scenario B) 
---------------------------------------------------------------------------
Accuracy             | 0.8983                    | 0.9233                   
Training Time        | 5.52                     s | 6.44                     s
Data Scope           | Local Partition Only      | Global Unified Dataset   
