<a href="https://colab.research.google.com/github/EonTechie/Big_Data_Processing/blob/main/Machine_Learning_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

folder_path = "/content/drive/My Drive/datasets"
files = os.listdir(folder_path)
print(files)



Mounted at /content/drive
['2.txt', 'Capitals.txt', 'EartquakeData-07032025.txt', 'DollarDataset.txt', 'couples.txt', 'join-actors.txt', 'points-null-values.txt', 'numbers-test.txt', 'join-series.txt', 'points.txt', 'names.txt', 'Lottery.txt', 'JamesJoyce-Ulyses.txt', 'world.txt', 'points-places.txt', 'Iris.csv', 'ml-latest-small', 'iris-dataset.txt', 'HousePrices-1.txt', 'HousePrices-2.txt', 'HousePrices-3.txt', 'CombinedHousePricesOutput', 'CombinedHousePricesOutput.txt', 'datasetsoutput_prices', 'house_prices_combined.csv', 'output_prices', 'leaf.csv', 'hello1.txt', 'hello2.txt', 'hello3.txt', 'movie_turkish_train.txt']


In [None]:
# In this notebook, I applied extensive and diverse hyperparameter tuning for decision tree, random forest and multilayer perceptron classifiers
# to each classifier using both TrainValidationSplit and CrossValidator.
# To ensure reproducibility, I used fixed random seeds throughout training and splitting.
# Due to the large parameter grids and 5-fold cross-validation,
# some training sessions took up to 2 hours.
# Since it was not feasible to run all models within the same Colab runtime session,
# I saved each model's best result to Google Drive and later combined them into a single summary table.

# The models can be reached from this drive link: https://drive.google.com/drive/folders/19dl89-bdIc8U5rFM4HUBx5mWSvJfnoHr?usp=sharing
# The results can be reached from this drive link: https://drive.google.com/drive/folders/1yJnPVaDa2iLqKbcHi72Tj9q8XWj_bgKj?usp=sharing

In [None]:
# Required imports
from pyspark.ml.classification import (
    DecisionTreeClassifier,
    RandomForestClassifier,
    MultilayerPerceptronClassifier
)
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

spark = SparkSession.builder.getOrCreate()
# Set a fixed seed for reproducibility
seed_value = 42

# Initialize each classifier
dt = DecisionTreeClassifier(labelCol="Class", featuresCol="features", seed=seed_value)
rf = RandomForestClassifier(labelCol="Class", featuresCol="features", seed=seed_value)
mlp = MultilayerPerceptronClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Print explainParams for each
print(" Decision Tree Parameters:\n", dt.explainParams(), "\n")
print(" Random Forest Parameters:\n", rf.explainParams(), "\n")
print(" MLP Parameters:\n", mlp.explainParams(), "\n")

 Decision Tree Parameters:
 cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features, current: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label, current: Class)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max

In [None]:
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


leafDF = spark.read.option("inferSchema","true").option("header","false").csv("/content/drive/My Drive/datasets/leaf.csv")
leafDF.show()

+---+---+-------+------+-------+-------+-------+-------+---------+---------+---------+--------+---------+---------+---------+-------+
|_c0|_c1|    _c2|   _c3|    _c4|    _c5|    _c6|    _c7|      _c8|      _c9|     _c10|    _c11|     _c12|     _c13|     _c14|   _c15|
+---+---+-------+------+-------+-------+-------+-------+---------+---------+---------+--------+---------+---------+---------+-------+
|  1|  1|0.72694|1.4742|0.32396|0.98535|    1.0|0.83592|0.0046566|0.0039465|  0.04779| 0.12795| 0.016108|0.0052323|2.7477E-4| 1.1756|
|  1|  2|0.74173|1.5257|0.36116|0.98152|0.99825|0.79867|0.0052423|0.0050016|  0.02416|0.090476|0.0081195| 0.002708|7.4846E-5|0.69659|
|  1|  3|0.76722|1.5725|0.38998|0.97755|    1.0|0.80812|0.0074573| 0.010121| 0.011897|0.057445|0.0032891|9.2068E-4|3.7886E-5|0.44348|
|  1|  4|0.73797|1.4597|0.35376|0.97566|    1.0|0.81697|0.0068768|0.0086068|  0.01595|0.065491|0.0042707|0.0011544|6.6272E-5|0.58785|
|  1|  5|0.82301|1.7707|0.44462|0.97698|    1.0|0.75493| 0.007

In [None]:
# Correct column names based on the dataset description
column_names = [
    "Class", "Specimen Number", "Eccentricity", "Aspect Ratio", "Elongation",
    "Solidity", "Stochastic Convexity", "Isoperimetric Factor", "Maximal Indentation Depth",
    "Lobedness", "Average Intensity", "Average Contrast", "Smoothness",
    "Third moment", "Uniformity", "Entropy"
]

# Rename all columns in the DataFrame
for old, new in zip(leafDF.columns, column_names):
    leafDF = leafDF.withColumnRenamed(old, new)

# Show renamed columns and first rows
leafDF.show(5, truncate=False)

+-----+---------------+------------+------------+----------+--------+--------------------+--------------------+-------------------------+---------+-----------------+----------------+----------+------------+----------+-------+
|Class|Specimen Number|Eccentricity|Aspect Ratio|Elongation|Solidity|Stochastic Convexity|Isoperimetric Factor|Maximal Indentation Depth|Lobedness|Average Intensity|Average Contrast|Smoothness|Third moment|Uniformity|Entropy|
+-----+---------------+------------+------------+----------+--------+--------------------+--------------------+-------------------------+---------+-----------------+----------------+----------+------------+----------+-------+
|1    |1              |0.72694     |1.4742      |0.32396   |0.98535 |1.0                 |0.83592             |0.0046566                |0.0039465|0.04779          |0.12795         |0.016108  |0.0052323   |2.7477E-4 |1.1756 |
|1    |2              |0.74173     |1.5257      |0.36116   |0.98152 |0.99825             |0.7986

In [None]:
leafDF.select("Class").distinct().orderBy("Class").show(50)


+-----+
|Class|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   22|
|   23|
|   24|
|   25|
|   26|
|   27|
|   28|
|   29|
|   30|
|   31|
|   32|
|   33|
|   34|
|   35|
|   36|
+-----+



In [None]:
# List of feature column names with proper spacing and casing
feature_cols = [
    "Specimen Number", "Eccentricity", "Aspect Ratio", "Elongation",
    "Solidity", "Stochastic Convexity", "Isoperimetric Factor", "Maximal Indentation Depth",
    "Lobedness", "Average Intensity", "Average Contrast", "Smoothness",
    "Third moment", "Uniformity", "Entropy"
]

# Create a VectorAssembler to combine all feature columns into a single 'features' column
vec = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Transform the original DataFrame to add the 'features' column
leafDF = vec.transform(leafDF)
leafDF.show(10, truncate=False)

+-----+---------------+------------+------------+----------+--------+--------------------+--------------------+-------------------------+---------+-----------------+----------------+----------+------------+----------+-------+---------------------------------------------------------------------------------------------------------------------------------+
|Class|Specimen Number|Eccentricity|Aspect Ratio|Elongation|Solidity|Stochastic Convexity|Isoperimetric Factor|Maximal Indentation Depth|Lobedness|Average Intensity|Average Contrast|Smoothness|Third moment|Uniformity|Entropy|features                                                                                                                         |
+-----+---------------+------------+------------+----------+--------+--------------------+--------------------+-------------------------+---------+-----------------+----------------+----------+------------+----------+-------+---------------------------------------------------------------

In [None]:
# Show sample rows with class labels and generated features
leafDF.select("features","Class").show(5, truncate=False)

+------------------------------------------------------------------------------------------------------------------------------+-----+
|features                                                                                                                      |Class|
+------------------------------------------------------------------------------------------------------------------------------+-----+
|[1.0,0.72694,1.4742,0.32396,0.98535,1.0,0.83592,0.0046566,0.0039465,0.04779,0.12795,0.016108,0.0052323,2.7477E-4,1.1756]      |1    |
|[2.0,0.74173,1.5257,0.36116,0.98152,0.99825,0.79867,0.0052423,0.0050016,0.02416,0.090476,0.0081195,0.002708,7.4846E-5,0.69659]|1    |
|[3.0,0.76722,1.5725,0.38998,0.97755,1.0,0.80812,0.0074573,0.010121,0.011897,0.057445,0.0032891,9.2068E-4,3.7886E-5,0.44348]   |1    |
|[4.0,0.73797,1.4597,0.35376,0.97566,1.0,0.81697,0.0068768,0.0086068,0.01595,0.065491,0.0042707,0.0011544,6.6272E-5,0.58785]   |1    |
|[5.0,0.82301,1.7707,0.44462,0.97698,1.0,0.75493,0.0074

In [None]:
############################################################################################ DECISION TREE CLASSIFIER ###############################################################################################
############################################################################################ DECISION TREE CLASSIFIER ###############################################################################################
############################################################################################ DECISION TREE CLASSIFIER ###############################################################################################
############################################################################################ DECISION TREE CLASSIFIER ###############################################################################################
############################################################################################ DECISION TREE CLASSIFIER ###############################################################################################

Decision Tree Classifier with TrainValidationSplit

In [None]:
# Decision Tree Classifier with TrainValidationSplit and comprehensive parameter tuning
# This configuration includes all key hyperparameters relevant to the leaf dataset,
# which contains 340 samples across 40 classes (most with ~10 instances).
# A fixed seed is used for reproducibility across splitting and training.
# Parameters were selected based on their influence on training behavior and model complexity.
# Visualization-only or output-only parameters (e.g., thresholds, leafCol, probabilityCol) were intentionally excluded to avoid unnecessary tuning overhead.

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.sql import Row

# Set seed value for reproducibility
seed_value = 42

# Prepare dataset and perform stratified split
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)

# Optional: Print class balance check
print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Initialize classifier with seed
dt = DecisionTreeClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define hyperparameter grid
paramGrid_dt = (
    ParamGridBuilder()
    .addGrid(dt.maxDepth, [3, 5, 7, 10])                 # Controls tree complexity to prevent overfitting
    .addGrid(dt.impurity, ["gini", "entropy"])           # Split criterion: Gini vs. Entropy
    .addGrid(dt.maxBins, [32, 64])                       # Number of bins used for feature discretization
    .addGrid(dt.minInstancesPerNode, [1, 2])             # Minimum number of instances per child after split
    .addGrid(dt.minInfoGain, [0.0, 0.01])                # Discards splits with low information gain
    .addGrid(dt.minWeightFractionPerNode, [0.0])         # Ensures minimum relative weight in child nodes (default 0.0)
    .addGrid(dt.maxMemoryInMB, [256, 512])               # Memory budget for training histograms
    .addGrid(dt.cacheNodeIds, [True, False])             # Caches node IDs to accelerate deeper trees
    .addGrid(dt.checkpointInterval, [10])                # Used for fault-tolerant recovery during long training
    .build()
)

# Evaluation metric
evaluator_dt = MulticlassClassificationEvaluator(
    labelCol="Class",
    predictionCol="prediction",
    metricName="accuracy"
)

# TrainValidationSplit setup
val_dt = TrainValidationSplit(
    estimator=dt,
    estimatorParamMaps=paramGrid_dt,
    evaluator=evaluator_dt,
    trainRatio=0.8,
    seed=seed_value
)

# Train model
tuned_model_dt = val_dt.fit(trainDF)
bestModel = tuned_model_dt.bestModel

# Extract best parameter values as string for later tabulation
params_dt_tvs = f"maxDepth={bestModel.getMaxDepth()}, impurity={bestModel.getImpurity()}, maxBins={bestModel.getMaxBins()}, " \
                f"minInstancesPerNode={bestModel.getMinInstancesPerNode()}, minInfoGain={bestModel.getMinInfoGain()}, " \
                f"minWeightFractionPerNode={bestModel.getMinWeightFractionPerNode()}, maxMemoryInMB={bestModel.getMaxMemoryInMB()}, " \
                f"cacheNodeIds={bestModel.getCacheNodeIds()}, checkpointInterval={bestModel.getCheckpointInterval()}"

# Predict on test set and evaluate
resultDF_dt = tuned_model_dt.transform(testDF)
resultDF_dt.select("features", "Class", "prediction").show(5)

accuracy_dt_tvs = evaluator_dt.evaluate(resultDF_dt)
print("Accuracy (Decision Tree with TVS):", accuracy_dt_tvs)

# One-row result object for final result table
result_row_dt_tvs = Row(Method="Decision Tree (TVS)", Parameters=params_dt_tvs, Accuracy=accuracy_dt_tvs)
spark.createDataFrame([result_row_dt_tvs]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_dt_tvs.csv")
tuned_model_dt.bestModel.save('/content/drive/MyDrive/models/tuned_model_dt') # Save best model for DT with TrainValidationSplit Validator

Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|      15.0|
|[1.0,0.50924,1.21...|   30|      30.0|
|[1.0,0.60267,1.25...|   27|      26.0|
|[1.0,0.71763,1.50...|   13|       1.0|
|[1.0,0.86224,2.07...|   32|       7.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (Decision Tree with TVS): 0.6538461538461539


Decision Tree Classifier with CrossValidator

In [None]:
# Decision Tree Classifier with CrossValidator and full hyperparameter tuning
# This configuration uses 5-fold cross-validation to ensure stable and robust parameter selection.
# The leaf dataset has 340 samples across 40 classes; most classes have ~10 samples.
# 5 folds maintain a balance between training size and validation reliability for such a small, multi-class dataset.
# Parameters were selected by analyzing model-relevant hyperparameters from explainParams().
# Parameters not affecting training (like probabilityCol, leafCol, thresholds, etc.) were excluded to keep the search space efficient.

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import Row

# Set seed for reproducibility
seed_value = 42

# Prepare data
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)
print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Initialize classifier
dt = DecisionTreeClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define hyperparameter grid
paramGrid_dt = (
    ParamGridBuilder()
    .addGrid(dt.maxDepth, [3, 5, 7, 10])                 # Prevents overfitting by controlling tree complexity
    .addGrid(dt.impurity, ["gini", "entropy"])           # Information gain criterion for node splitting
    .addGrid(dt.maxBins, [32, 64])                       # Number of bins used to discretize continuous features
    .addGrid(dt.minInstancesPerNode, [1, 2])             # Avoids splits that leave very few samples in children
    .addGrid(dt.minInfoGain, [0.0, 0.01])                # Prevents weak or meaningless splits
    .addGrid(dt.minWeightFractionPerNode, [0.0])         # Adds regularization based on sample weight proportions
    .addGrid(dt.maxMemoryInMB, [256, 512])               # Memory budget for histogram aggregation
    .addGrid(dt.cacheNodeIds, [True, False])             # Optionally caches node IDs for performance
    .addGrid(dt.checkpointInterval, [10])                # Checkpointing for fault tolerance in long pipelines
    .build()
)

# Accuracy-based evaluator for multi-class classification
evaluator_dt = MulticlassClassificationEvaluator(
    labelCol="Class",
    predictionCol="prediction",
    metricName="accuracy"
)

# CrossValidator setup with 5 folds and fixed seed
crossval_dt = CrossValidator(
    estimator=dt,
    estimatorParamMaps=paramGrid_dt,
    evaluator=evaluator_dt,
    numFolds=5,
    seed=seed_value
)

# Train model using cross-validation
cv_model_dt = crossval_dt.fit(trainDF)
bestModel_cv = cv_model_dt.bestModel

# Extract best parameters into string format for table
params_dt_cv = f"maxDepth={bestModel_cv.getMaxDepth()}, impurity={bestModel_cv.getImpurity()}, maxBins={bestModel_cv.getMaxBins()}, " \
               f"minInstancesPerNode={bestModel_cv.getMinInstancesPerNode()}, minInfoGain={bestModel_cv.getMinInfoGain()}, " \
               f"minWeightFractionPerNode={bestModel_cv.getMinWeightFractionPerNode()}, maxMemoryInMB={bestModel_cv.getMaxMemoryInMB()}, " \
               f"cacheNodeIds={bestModel_cv.getCacheNodeIds()}, checkpointInterval={bestModel_cv.getCheckpointInterval()}"

# Evaluate on test data
resultDF_cv = cv_model_dt.transform(testDF)
resultDF_cv.select("features", "Class", "prediction").show(5)

accuracy_dt_cv = evaluator_dt.evaluate(resultDF_cv)
print("Accuracy (Decision Tree with CrossValidator):", accuracy_dt_cv)

# Save result row for final table
result_row_dt_cv = Row(Method="Decision Tree (CV)", Parameters=params_dt_cv, Accuracy=accuracy_dt_cv)
spark.createDataFrame([result_row_dt_cv]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_dt_cv.csv")
cv_model_dt.bestModel.save('/content/drive/MyDrive/models/cv_model_dt') # Save best model for DT with Cross Validation Validator

Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|       6.0|
|[1.0,0.50924,1.21...|   30|      30.0|
|[1.0,0.60267,1.25...|   27|      24.0|
|[1.0,0.71763,1.50...|   13|       1.0|
|[1.0,0.86224,2.07...|   32|       2.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (Decision Tree with CrossValidator): 0.5769230769230769


In [None]:
##################################################################################### RANDOM FOREST CLASSIFIER #####################################################################################
##################################################################################### RANDOM FOREST CLASSIFIER #####################################################################################
##################################################################################### RANDOM FOREST CLASSIFIER #####################################################################################
##################################################################################### RANDOM FOREST CLASSIFIER #####################################################################################
##################################################################################### RANDOM FOREST CLASSIFIER #####################################################################################

Random Forest Classifier with TrainValidationSplit

In [None]:
# Random Forest Classifier with TrainValidationSplit and full hyperparameter tuning
# This model performs hyperparameter tuning using an explicit grid focused on parameters that directly influence training and generalization.
# The leaf dataset contains 340 samples across 40 classes, with ~10 samples per class on average.
# We exclude parameters such as probabilityCol, leafCol, thresholds, and rawPredictionCol as they do not influence training behavior.

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.sql import Row

# Set seed value for reproducibility
seed_value = 42

# Load and prepare dataset
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)

print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Initialize classifier with seed
rf = RandomForestClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define hyperparameter grid based on training-impactful params
paramGrid_rf = (
    ParamGridBuilder()
    .addGrid(rf.numTrees, [10, 20, 30])                        # numTrees: More trees may improve stability but increase training cost
    .addGrid(rf.featureSubsetStrategy, ["auto", "sqrt"])      # featureSubsetStrategy: Standard values for classification forests
    .addGrid(rf.maxDepth, [5, 10])                             # maxDepth: Limited to prevent overfitting on small dataset
    .addGrid(rf.impurity, ["gini", "entropy"])                # impurity: Information gain criteria
    .addGrid(rf.maxBins, [32, 64])                             # maxBins: Discretization resolution for continuous features
    .addGrid(rf.minInstancesPerNode, [1, 2])                   # minInstancesPerNode: Avoids shallow splits with very small subsets
    .addGrid(rf.minInfoGain, [0.0, 0.01])                      # minInfoGain: Prevents meaningless or low-gain splits
    .addGrid(rf.maxMemoryInMB, [256, 512])                     # maxMemoryInMB: Controls training buffer allocation
    .addGrid(rf.cacheNodeIds, [True, False])                   # cacheNodeIds: May speed up repeated access to node paths
    .addGrid(rf.subsamplingRate, [0.8, 1.0])                   # subsamplingRate: Tests bootstrap sampling vs full dataset use
    .build()
)

# Define evaluator based on classification accuracy
evaluator_rf = MulticlassClassificationEvaluator(
    labelCol="Class", predictionCol="prediction", metricName="accuracy"
)

# Setup TVS
tvs_rf = TrainValidationSplit(
    estimator=rf,
    estimatorParamMaps=paramGrid_rf,
    evaluator=evaluator_rf,
    trainRatio=0.8,
    seed=seed_value
)

# Train model
tuned_model_rf = tvs_rf.fit(trainDF)
bestModel_rf = tuned_model_rf.bestModel

# Extract best parameters as string for report table
params_rf_tvs = f"numTrees={bestModel_rf.getNumTrees}, featureSubsetStrategy={bestModel_rf.getFeatureSubsetStrategy()}, " \
                f"maxDepth={bestModel_rf.getMaxDepth()}, impurity={bestModel_rf.getImpurity()}, " \
                f"maxBins={bestModel_rf.getMaxBins()}, minInstancesPerNode={bestModel_rf.getMinInstancesPerNode()}, " \
                f"minInfoGain={bestModel_rf.getMinInfoGain()}, maxMemoryInMB={bestModel_rf.getMaxMemoryInMB()}, " \
                f"cacheNodeIds={bestModel_rf.getCacheNodeIds()}, subsamplingRate={bestModel_rf.getSubsamplingRate()}"

# Predict and evaluate
resultDF_rf = tuned_model_rf.transform(testDF)
resultDF_rf.select("features", "Class", "prediction").show(5)

accuracy_rf_tvs = evaluator_rf.evaluate(resultDF_rf)
print("Accuracy (Random Forest with TVS):", accuracy_rf_tvs)

# Format for final result table
result_row_rf_tvs = Row(Method="Random Forest (TVS)", Parameters=params_rf_tvs, Accuracy=accuracy_rf_tvs)
spark.createDataFrame([result_row_rf_tvs]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_rf_tvs.csv")
bestModel_rf.save('/content/drive/MyDrive/models/bestModel_rf')  # Save best model for RF with TrainValidationSplit Validator

Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|      15.0|
|[1.0,0.50924,1.21...|   30|      30.0|
|[1.0,0.60267,1.25...|   27|      27.0|
|[1.0,0.71763,1.50...|   13|      13.0|
|[1.0,0.86224,2.07...|   32|      32.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (Random Forest with TVS): 0.7115384615384616


Random Forest Classifier with CrossValidator

In [None]:
# Random Forest Classifier with CrossValidator and full hyperparameter tuning
# This version uses 5-fold cross-validation to achieve more robust and stable model selection.
# The leaf dataset consists of 340 samples over 40 classes, with many classes having ~10 samples.
# Parameters selected below affect training behavior, model complexity, and generalization.
# Parameters such as thresholds, probabilityCol, rawPredictionCol, and leafCol were excluded because they affect only post-training output or are not relevant to tuning.

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import Row

# Set seed for reproducibility
seed_value = 42

# Prepare data
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)
trainDF.cache()
trainDF.count()  # triggers caching

print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Initialize classifier
rf = RandomForestClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define parameter grid (only training-relevant hyperparameters)
paramGrid_rf = (
    ParamGridBuilder()
    .addGrid(rf.numTrees, [10, 20, 30])                          # numTrees: Controls ensemble size; more trees increase stability but cost more
    .addGrid(rf.featureSubsetStrategy, ["auto", "sqrt"])        # featureSubsetStrategy: Common defaults for classification; sqrt helps regularize splits
    .addGrid(rf.maxDepth, [5, 10])                               # maxDepth: Controls tree complexity to avoid overfitting on small dataset
    .addGrid(rf.impurity, ["gini", "entropy"])                  # impurity: Evaluates both splitting criteria
    .addGrid(rf.maxBins, [32, 64])                               # maxBins: Higher resolution for continuous variables; trade-off with speed
    .addGrid(rf.minInstancesPerNode, [1, 2])                     # minInstancesPerNode: Ensures children have enough data to avoid unstable splits
    .addGrid(rf.minInfoGain, [0.0, 0.01])                        # minInfoGain: Forces splits to have meaningful improvement
    .addGrid(rf.maxMemoryInMB, [256, 512])                       # maxMemoryInMB: Memory budget for node histogram aggregation
    .addGrid(rf.cacheNodeIds, [False])                           # cacheNodeIds: Might improve performance by reusing node paths
    .addGrid(rf.subsamplingRate, [0.8, 1.0])                     # subsamplingRate: Tests effect of bootstrap-style sampling vs full dataset
    .build()
)

# Accuracy-based evaluator
evaluator_rf = MulticlassClassificationEvaluator(
    labelCol="Class",
    predictionCol="prediction",
    metricName="accuracy"
)

# CrossValidator setup
crossval_rf = CrossValidator(
    estimator=rf,
    estimatorParamMaps=paramGrid_rf,
    evaluator=evaluator_rf,
    numFolds=5,         # 5-fold: Good trade-off between validation reliability and training set size
    seed=seed_value,
    parallelism=4       # Parallel execution to speed up tuning
)

# Train model
cv_model_rf = crossval_rf.fit(trainDF)
bestModel_rf = cv_model_rf.bestModel

# Extract best parameter combination
params_rf_cv = f"numTrees={bestModel_rf.getNumTrees}, featureSubsetStrategy={bestModel_rf.getFeatureSubsetStrategy()}, " \
               f"maxDepth={bestModel_rf.getMaxDepth()}, impurity={bestModel_rf.getImpurity()}, " \
               f"maxBins={bestModel_rf.getMaxBins()}, minInstancesPerNode={bestModel_rf.getMinInstancesPerNode()}, " \
               f"minInfoGain={bestModel_rf.getMinInfoGain()}, maxMemoryInMB={bestModel_rf.getMaxMemoryInMB()}, " \
               f"cacheNodeIds={bestModel_rf.getCacheNodeIds()}, subsamplingRate={bestModel_rf.getSubsamplingRate()}"

# Predict on test set
resultDF_rf = cv_model_rf.transform(testDF)
resultDF_rf.select("features", "Class", "prediction").show(5)

# Evaluate accuracy
accuracy_rf_cv = evaluator_rf.evaluate(resultDF_rf)
print("Accuracy (Random Forest with CrossValidator):", accuracy_rf_cv)

# Save as result row for final table
result_row_rf_cv = Row(Method="Random Forest (CV)", Parameters=params_rf_cv, Accuracy=accuracy_rf_cv)

cv_model_rf.bestModel.save('/content/drive/MyDrive/models/cv_model_rf_bestModel')   # Save best model for RF with Cross Validation Validator

spark.createDataFrame([result_row_rf_cv]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_row_rf_cv.csv")



Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|      15.0|
|[1.0,0.50924,1.21...|   30|      30.0|
|[1.0,0.60267,1.25...|   27|      24.0|
|[1.0,0.71763,1.50...|   13|      13.0|
|[1.0,0.86224,2.07...|   32|      32.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (Random Forest with CrossValidator): 0.6730769230769231


In [None]:
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################
##################################################################### MULTILAYER PERCEPTRON CLASSIFIER ###############################################################

Multilayer Perceptron Classifier with TrainValidationSplit

In [None]:
# Multilayer Perceptron Classifier with TrainValidationSplit and targeted hyperparameter tuning
# This configuration is designed for the leaf dataset, which has only 340 samples across 40 classes (sparse multi-class setup).
# To avoid overfitting and long training times, we carefully select meaningful parameters and keep the grid small but effective.
# Parameters such as probabilityCol, rawPredictionCol, thresholds, and initialWeights are excluded because they don't affect training behavior or are rarely used in tuning.

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.sql import Row

# Set seed for reproducibility
seed_value = 42

# Prepare dataset
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)

print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Define classifier
mlp = MultilayerPerceptronClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define network architectures
# Since input has 16 features and output has 40 classes, we try 1 shallow and 1 moderately deep network
layers_1 = [15, 32, 40]        # One hidden layer
layers_2 = [15, 32, 16, 40]    # Two hidden layers

# Define parameter grid
paramGrid_mlp = (
    ParamGridBuilder()
    .addGrid(mlp.layers, [layers_1, layers_2])              # layers: Balanced depth for small dataset; deeper model risks overfitting
    .addGrid(mlp.blockSize, [64, 128])                      # blockSize: Mini-batch sizes to test different training granularity
    .addGrid(mlp.maxIter, [100, 200])                       # maxIter: Test early stopping vs. longer training
    .addGrid(mlp.stepSize, [0.03, 0.1])                     # stepSize: Smaller step for stable convergence, larger to speed up learning
    .addGrid(mlp.tol, [1e-4, 1e-6])                         # tol: Test both default and stricter convergence criteria
    .build()
)

# Define evaluator
evaluator_mlp = MulticlassClassificationEvaluator(
    labelCol="Class",
    predictionCol="prediction",
    metricName="accuracy"
)

# TrainValidationSplit setup
tvs_mlp = TrainValidationSplit(
    estimator=mlp,
    estimatorParamMaps=paramGrid_mlp,
    evaluator=evaluator_mlp,
    trainRatio=0.8,
    seed=seed_value
)

# Train the model
tuned_model_mlp = tvs_mlp.fit(trainDF)
bestModel = tuned_model_mlp.bestModel

# Extract best parameters
params_mlp_tvs = f"layers={bestModel.getLayers()}, blockSize={bestModel.getBlockSize()}, " \
                 f"maxIter={bestModel.getMaxIter()}, stepSize={bestModel.getStepSize()}, tol={bestModel.getTol()}"

# Evaluate and print results
resultDF = tuned_model_mlp.transform(testDF)
resultDF.select("features", "Class", "prediction").show(5)

accuracy_mlp_tvs = evaluator_mlp.evaluate(resultDF)
print("Accuracy (MLP with TrainValidationSplit):", accuracy_mlp_tvs)

# For final result table
result_row_mlp_tvs = Row(Method="MLP (TVS)", Parameters=params_mlp_tvs, Accuracy=accuracy_mlp_tvs)
# Multilayer Perceptron
tuned_model_mlp.bestModel.save('/content/drive/MyDrive/models/tuned_model_mlp_bestModel')  # Save best model for MLP with TrainValidationSplit Validator
spark.createDataFrame([result_row_mlp_tvs]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_row_mlp_tvs.csv")



Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|      36.0|
|[1.0,0.50924,1.21...|   30|       9.0|
|[1.0,0.60267,1.25...|   27|      24.0|
|[1.0,0.71763,1.50...|   13|      33.0|
|[1.0,0.86224,2.07...|   32|       2.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (MLP with TrainValidationSplit): 0.4807692307692308


Multilayer Perceptron Classifier with CrossValidator

In [None]:
# Multilayer Perceptron Classifier with CrossValidator and carefully selected hyperparameter tuning
# This version uses 5-fold cross-validation for robust model evaluation on the leaf dataset (340 samples, 40 classes).
# Cross-validation increases training cost, so the tuning grid is kept minimal but meaningful.
# Parameters such as probabilityCol, rawPredictionCol, thresholds, initialWeights, and solver are excluded,
# as they do not affect learning performance or are not practical for tuning in this context.

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import Row

# Set seed for reproducibility
seed_value = 42

# Prepare dataset
leafDF = leafDF.select("features", "Class")
trainDF, testDF = leafDF.randomSplit([0.8, 0.2], seed=seed_value)

print("Training set size:", trainDF.count())
print("Test set size:", testDF.count())

# Initialize classifier
mlp = MultilayerPerceptronClassifier(labelCol="Class", featuresCol="features", seed=seed_value)

# Define two network architectures
# Due to high number of classes and low sample count, we test one simple and one moderately deep topology
layers_1 = [15, 32, 40]         # One hidden layer
layers_2 = [15, 32, 16, 40]     # Two hidden layers

# Define hyperparameter grid
paramGrid_mlp = (
    ParamGridBuilder()
    .addGrid(mlp.layers, [layers_1, layers_2])              # layers: Light vs deeper model to test complexity effect
    .addGrid(mlp.maxIter, [100])                            # maxIter: Single value to reduce crossval duration
    .addGrid(mlp.stepSize, [0.03])                          # stepSize: Fixed small step to ensure convergence
    .addGrid(mlp.tol, [1e-4])                               # tol: Default convergence threshold
    .addGrid(mlp.blockSize, [64])                           # blockSize: Mini-batch size kept fixed for reproducibility
    .build()
)

# Accuracy evaluator
evaluator_mlp = MulticlassClassificationEvaluator(
    labelCol="Class",
    predictionCol="prediction",
    metricName="accuracy"
)

# CrossValidator setup
crossval_mlp = CrossValidator(
    estimator=mlp,
    estimatorParamMaps=paramGrid_mlp,
    evaluator=evaluator_mlp,
    numFolds=5,          # 5 folds = good trade-off between stability and training data coverage
    seed=seed_value
)

# Train model
cv_model_mlp = crossval_mlp.fit(trainDF)
bestModel = cv_model_mlp.bestModel

# Save best parameters as string
params_mlp_cv = f"layers={bestModel.getLayers()}, blockSize={bestModel.getBlockSize()}, " \
                f"maxIter={bestModel.getMaxIter()}, stepSize={bestModel.getStepSize()}, tol={bestModel.getTol()}"

# Evaluate on test set
resultDF = cv_model_mlp.transform(testDF)
resultDF.select("features", "Class", "prediction").show(5)

accuracy_mlp_cv = evaluator_mlp.evaluate(resultDF)
print("Accuracy (MLP with CrossValidator):", accuracy_mlp_cv)

# Store for final summary table
result_row_mlp_cv = Row(Method="MLP (CV)", Parameters=params_mlp_cv, Accuracy=accuracy_mlp_cv)
cv_model_mlp.bestModel.save('/content/drive/MyDrive/models/cv_model_mlp_bestModel')  # Save best model for MLP with Cross Validation Validator

spark.createDataFrame([result_row_mlp_cv]).coalesce(1).write.mode("overwrite").option("header", "true").csv("/content/drive/MyDrive/results/result_row_mlp_cv.csv")


Training set size: 288
Test set size: 52
+--------------------+-----+----------+
|            features|Class|prediction|
+--------------------+-----+----------+
|[1.0,0.4132,1.038...|   15|      36.0|
|[1.0,0.50924,1.21...|   30|       9.0|
|[1.0,0.60267,1.25...|   27|       1.0|
|[1.0,0.71763,1.50...|   13|      33.0|
|[1.0,0.86224,2.07...|   32|       2.0|
+--------------------+-----+----------+
only showing top 5 rows

Accuracy (MLP with CrossValidator): 0.5


In [None]:
################################################################################ REPORT CREATING THE TABLE ##########################################################################################
################################################################################ REPORT CREATING THE TABLE ##########################################################################################
################################################################################ REPORT CREATING THE TABLE ##########################################################################################
################################################################################ REPORT CREATING THE TABLE ##########################################################################################

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
# Stored model names, parameters and accuracies from direve for each training and fine tuning result
paths = [
    "/content/drive/MyDrive/results/result_dt_tvs.csv/part*.csv",
    "/content/drive/MyDrive/results/result_dt_cv.csv/part*.csv",
    "/content/drive/MyDrive/results/result_rf_tvs.csv/part*.csv",
    "/content/drive/MyDrive/results/result_row_rf_cv.csv/part*.csv",
    "/content/drive/MyDrive/results/result_row_mlp_tvs.csv/part*.csv",
    "/content/drive/MyDrive/results/result_row_mlp_cv.csv/part*.csv"
]

# Read the first file
df_final = spark.read.option("header", "true").csv(paths[0])

# Merge with other files one by one
for path in paths[1:]:
    df_part = spark.read.option("header", "true").csv(path)
    df_final = df_final.union(df_part)

df_final.show(truncate=False)


+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
|Method             |Parameters                                                                                                                                                                            |Accuracy          |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
|Decision Tree (TVS)|maxDepth=7, impurity=entropy, maxBins=64, minInstancesPerNode=2, minInfoGain=0.0, minWeightFractionPerNode=0.0, maxMemoryInMB=256, cacheNodeIds=True, checkpointInterval=10           |0.6538461538461539|
|Decision Tree (CV) |maxDepth=10, impurity=gini, maxBins=64, minInstancesPerNode=1, minInfoGain=0.0, min

In [None]:
# In this notebook, I applied extensive and diverse hyperparameter tuning for decision tree, random forest and multilayer perceptron classifiers
# to each classifier using both TrainValidationSplit and CrossValidator.
# To ensure reproducibility, I used fixed random seeds throughout training and splitting.
# Due to the large parameter grids and 5-fold cross-validation,
# some training sessions took up to 2 hours.
# Since it was not feasible to run all models within the same Colab runtime session,
# I saved each model's best result to Google Drive and later combined them into a single summary table.

# The models can be reached from this drive link: https://drive.google.com/drive/folders/19dl89-bdIc8U5rFM4HUBx5mWSvJfnoHr?usp=sharing
# The results can be reached from this drive link: https://drive.google.com/drive/folders/1yJnPVaDa2iLqKbcHi72Tj9q8XWj_bgKj?usp=sharing