## **In this notebook (GBT Regressor)**
I followed a step-by-step process to train and tune each model efficiently. First, I created a baseline model using a 100k-row subset with basic parameters, giving me a reference point to measure how hyperparameter tuning impacted performance. Next, I experimented with different parameter settings on this 100k subset to identify the best configurations. Once I had the optimal parameters, I scaled up to the full 600k-row dataset to test how well these settings performed on a larger scale.
<br>

This approach was necessary because a single run of `GBT Regressor` was taking around 11 hours, making it impractical to tune all 8 parameters directly on the full dataset. By downsizing hyperparameter tuning to 100k rows, I could find the best parameters and then apply them to 600k rows for final testing.

In [None]:
import importlib
import subprocess
import sys
import gc

def check_and_install_package(package_name, version=None):
    try:
        importlib.import_module(package_name)
        print(f"\n{package_name} is already installed.")
    except ImportError:
        print(f"\n{package_name} is NOT installed. Installing now...")
        if version:
            subprocess.check_call([sys.executable, "-m", "pip", "install", f"{package_name}=={version}"])
        else:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"{package_name} installation completed.")

# List of packages to check along with specific versions if necessary
packages = [
    {"name": "tqdm", "version": None},
    {"name": "pyspark", "version": "3.5.2"},
    {"name": "gdown", "version": None},
    {"name": "numpy", "version": "1.23.5"}
]

# Checking and installing the packages
for package in packages:
    check_and_install_package(package["name"], package["version"])


tqdm is already installed.

pyspark is NOT installed. Installing now...
pyspark installation completed.

gdown is already installed.

numpy is already installed.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

# Defining the path to the copied jar files in the local instance
jar_files = "/resources/xgboost4j_2.12-1.7.6.jar,/resources/xgboost4j-spark_2.12-1.7.6.jar"

# Set the environment variable to include the JARs from the local instance
spark = SparkSession.builder \
    .appName("GBTModel") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.executor.memoryOverhead", "12g") \
    .config("spark.executor.cores", "5") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2047m") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4") \
    .getOrCreate()

# Verifying Spark session creation
print(f"Spark session started with version: {spark.version}")

Spark session started with version: 3.5.2


In [None]:
!cp '/content/drive/MyDrive/Big Data Analytics - Project/Datasets/Feature_Engineered_DF.parquet' /content/

output_path = '/content/Feature_Engineered_DF.parquet'
df = spark.read.parquet(output_path)
print("The Feature Engineered DataFrame has been loaded successfully.")


The Feature Engineered DataFrame has been loaded successfully.


In [None]:
# Printing the shape of the DataFrame
total_rows = df.count()
total_columns = len(df.columns)

print(f"The shape of the loaded DataFrame is: ({total_rows}, {total_columns})")

The shape of the loaded DataFrame is: (3000040, 47)


In [None]:
# Calculating the average price
avg_price = df.agg({"price": "avg"}).collect()[0][0]
print(f"Average price of a car: {round(avg_price)}")

Average price of a car: 29933


In [None]:
import pandas as pd
from IPython.display import display
import pyspark.sql.functions as F

# Converting the Spark DataFrame to a Pandas DataFrame and displaying 5 random rows with all columns
pd.set_option('display.max_columns', None)
pandas_df = df.orderBy(F.rand()).limit(5).toPandas()
display(pandas_df)


Unnamed: 0,fuel_type,body_type,city,city_fuel_economy,days_in_market,dealer_zip,engine_displacement,engine_type,exterior_color,franchise_dealer,fuel_tank_volume,height,highway_fuel_economy,horsepower,interior_color,is_new,latitude,length,listing_color,longitude,make_name,maximum_seating,model_name,price,savings_amount,seller_rating,sp_name,torque,transmission,transmission_display,wheel_system_display,wheelbase,width,manufactured_year,combined_fuel_economy,legroom,log_mileage,major_options_count,hp_x_engine_disp,hp_x_torque,listed_day,listed_month,listed_year,age,resale_value_score,maintenance_cost,luxury_score
0,Gasoline,Sedan,San Antonio,31.0,36,78249,2000.0,I4,Other,True,12.4,55.7,40.0,158.0,Gray,False,29.5802,182.3,UNKNOWN,-98.597298,Honda,5.0,Civic,16600.0,138,4.28,Gunn Honda,138.0,CVT,Continuously Variable Transmission,Front-Wheel Drive,106.3,70.8,2017,35.5,79.7,10.08,3,0.73,1.19255,6,8,2020,3,21,34,28
1,Gasoline,SUV / Crossover,Midvale,16.0,46,84047,3700.0,V6,Black,False,21.0,68.2,21.0,300.0,Black,False,40.599899,191.6,BLACK,-111.890999,Acura,7.0,MDX,11995.0,1217,4.428571,America Auto Group,270.0,A,Automatic,All-Wheel Drive,108.3,78.5,2012,18.5,79.9,11.76,7,0.34,0.02622,27,7,2020,8,17,35,33
2,Gasoline,Sedan,Tallahassee,28.0,1,32308,1800.0,I4,Red,False,13.2,57.3,36.0,132.0,Black,False,30.476999,183.1,RED,-84.236397,Toyota,5.0,Corolla,15688.0,1986,4.692307,David Lloyd Tallahassee,128.0,A,Automatic,Front-Wheel Drive,106.3,69.9,2017,32.0,83.7,10.62,0,1.14,1.65957,9,9,2020,3,26,30,28
3,Gasoline,Hatchback,Staunton,24.0,49,24401,2000.0,I4,Black,True,13.2,57.8,32.0,228.0,Black,True,38.117901,168.0,BLACK,-79.068199,Volkswagen,5.0,GTI,34475.0,0,5.0,Valley Volkswagen,258.0,Dual Clutch,7-Speed Dual Clutch,Front-Wheel Drive,103.6,70.8,2020,28.0,76.8,0.0,3,0.16,0.0148,23,7,2020,0,30,42,30
4,Hybrid,SUV / Crossover,Columbia,43.0,54,65203,2500.0,I4,Black,True,14.2,68.6,37.0,198.0,Black,True,38.9608,180.5,BLACK,-92.367599,Ford,5.0,Escape Hybrid,30805.0,0,4.26087,Joe Machens Ford Lincoln,265.22,CVT,Continuously Variable Transmission,All-Wheel Drive,106.7,85.6,2020,40.0,81.2,8.91,9,0.19,-1e-05,19,7,2020,0,26,43,34




---



# **GBT Regressor**

### **Initial Training on a Subset (100k Rows):**

I begin by training GBT Regressor on a subset of 100k rows with specific parameters, establishing a baseline model. This baseline serves as a comparison point, allowing me to evaluate how each hyperparameter tuning experiment impacts performance, either increasing or decreasing metrics relative to the baseline. The initial 100k subset is thus used specifically for benchmarking improvements.

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

print("Processing the data...")
with tqdm(total=5, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.033, seed=42)  # Randomly sample 100k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\n\nData preprocessing and splitting completed!")


Processing the data...


Progress: 100%|██████████| 5/5 [00:18<00:00,  3.66s/it]



Data preprocessing and splitting completed!





In [None]:
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
import time

# Starting to track overall run time
start_time = time.time()

# Model training
print("Training GBTRegressor model...")


# Using GBTRegressor with tuned parameters
gbt_regressor = GBTRegressor(
    featuresCol="scaled_features",
    labelCol="price",
    maxIter=100,
    maxDepth=5,
    seed=42,
  # parallelism=4
)

# Training the model
model = gbt_regressor.fit(train_df)

from pyspark.ml.regression import GBTRegressor

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print(f"\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\n\nOverall runtime: {round(total_runtime)} minutes.")

Training GBTRegressor model...
Making predictions...
Evaluating the model...

Train size: 79,346 samples
Test size: 19,771 samples


R-Squared Score (Accuracy): 78.56%


Overall runtime: 181 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")


Additional Metrics:
Mean Absolute Error: 3604
Mean Squared Error: 76868028
Root Mean Squared Error: 8767




---



# **Hyper Parameter Tuning**

### **Hyperparameter Tuning**  [on the same subset used in the above cells = 100k Rows] :

Once I established baseline metrics, I proceeded with hyperparameter tuning on the same subset. Training on 100k rows with different parameter combinations enabled me to evaluate the impact of various hyperparameters. This step was crucial to narrow down the most promising configurations and avoid overfitting due to excessive tuning.

In [None]:
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track overall run time
start_time = time.time()

# Data preprocessing and feature engineering
print("Processing the data...")
with tqdm(total=5, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.033, seed=42)  # Randomly sample 100k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\n\nData preprocessing and splitting completed!")

# Defining the GBT Regressor model
gbt_regressor = GBTRegressor(
    featuresCol="scaled_features",  # Using scaled features
    labelCol="price",               # Target column
    seed=42                         # Random seed
)

# Creating a ParamGridBuilder for hyperparameter tuning
param_grid = ParamGridBuilder() \
    .addGrid(gbt_regressor.maxIter, [50, 100]) \
    .addGrid(gbt_regressor.maxDepth, [5, 10]) \
    .build()

# Defining evaluators for each metric
r2_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")

# Initializing best scores and parameters
best_r2 = -float("inf")
best_mae = float("inf")
best_rmse = float("inf")
best_params_r2 = None
best_params_mae = None
best_params_rmse = None

# Manually iterate over each parameter combination and evaluate metrics
for params in param_grid:

    # Extracting the parameter names and values
    param_values = {param.name: value for param, value in params.items()}

    print(f"\nTraining model with parameters: {param_values}")

    # Using a copy to apply parameters
    model = gbt_regressor.copy(params).fit(train_df)

    # Making predictions on the test data
    predictions = model.transform(test_df)

    # Evaluating metrics
    r2 = r2_evaluator.evaluate(predictions)
    mae = mae_evaluator.evaluate(predictions)
    rmse = rmse_evaluator.evaluate(predictions)

    # Printing the metrics for this combination
    print(f"R² (Accuracy): {r2 * 100:.2f}%")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print("-" * 40)

    # Tracking the best scores and corresponding parameters
    if r2 > best_r2:
        best_r2 = r2
        best_params_r2 = param_values

    if mae < best_mae:
        best_mae = mae
        best_params_mae = param_values

    if rmse < best_rmse:
        best_rmse = rmse
        best_params_rmse = param_values

# Print the best model and its corresponding parameters
print(f"Best R² (Accuracy): {best_r2 * 100:.2f}% with parameters: {best_params_r2}")
print(f"Best MAE: {best_mae:.2f} with parameters: {best_params_mae}")
print(f"Best RMSE: {best_rmse:.2f} with parameters: {best_params_rmse}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing the data...


Progress: 100%|██████████| 5/5 [00:17<00:00,  3.42s/it]




Data preprocessing and splitting completed!

Training model with parameters: {'maxIter': 50, 'maxDepth': 5}
R² (Accuracy): 75.68%
MAE: 4137.68
RMSE: 9336.50
----------------------------------------

Training model with parameters: {'maxIter': 50, 'maxDepth': 10}
R² (Accuracy): 77.32%
MAE: 3131.38
RMSE: 9017.60
----------------------------------------

Training model with parameters: {'maxIter': 100, 'maxDepth': 5}
R² (Accuracy): 78.56%
MAE: 3604.40
RMSE: 8767.44
----------------------------------------

Training model with parameters: {'maxIter': 100, 'maxDepth': 10}
R² (Accuracy): 77.61%
MAE: 3025.09
RMSE: 8957.95
----------------------------------------
Best R² (Accuracy): 78.56% with parameters: {'maxIter': 100, 'maxDepth': 5}
Best MAE: 3025.09 with parameters: {'maxIter': 100, 'maxDepth': 10}
Best RMSE: 8767.44 with parameters: {'maxIter': 100, 'maxDepth': 5}

Overall runtime: 817 minutes.




---





## **HPT Summary**

For the GBT Regressor model, I experimented with different configurations to find the best-performing parameters. The hyperparameters tested included `maxIter` (50 and 100) and `maxDepth` (5 and 10), aiming to improve accuracy and minimize error metrics.

**Best Configuration and Performance:**
- **Best Parameters:** `maxIter` = 100, `maxDepth` = 5
- **Best R² Score (Accuracy):** 78.56%
- **Best MAE:** 3025 (achieved with `maxIter` = 100, `maxDepth` = 10)
- **Best RMSE:** 8767

**Comparison to Baseline:**
The baseline GBT model achieved an R² of 78.56%. While the baseline already performed well, tuning the parameters enabled me to explore slight performance trade-offs. Notably, the configuration with `maxIter` = 100 and `maxDepth` = 10 achieved a lower MAE of 3025, compared to the baseline's MAE of 3604, showing an improvement in the model’s error metric.




---



# **Running with Best Parameters**

### **Scaling Up with Optimized Parameters (600k Rows):**
Once the best parameters were identified, I applied them to the full 600k-row dataset to test scalability. This step allowed me to verify performance improvements and confirm the effectiveness of the chosen parameters on the final, larger dataset. This helps me compare the model before and after hyper parameter tuning.

**Best Parameters:** `numTrees` = 100, `maxDepth` = 5, `stepSize` = 0.1,
 `minInstancesPerNode` = 10, `maxBins` = 50

## **600k records**

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

print("Processing the data...")
with tqdm(total=5, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.2, seed=42)  # Randomly sample 100k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\n\nData preprocessing and splitting completed!")


Processing the data...


Progress: 100%|██████████| 5/5 [00:19<00:00,  3.83s/it]



Data preprocessing and splitting completed!





In [None]:
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
import time

# Starting to track overall run time
start_time = time.time()

# Model training
print("Training GBTRegressor model...")


# Using GBTRegressor with tuned parameters
gbt_regressor = GBTRegressor(
    featuresCol="scaled_features",
    labelCol="price",
    maxIter=100,
    maxDepth=5,
    seed=42,
    stepSize=0.1,
    minInstancesPerNode=10,
    maxBins=50
)


# Training the model
model = gbt_regressor.fit(train_df)

from pyspark.ml.regression import GBTRegressor

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print(f"\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\n\nOverall runtime: {round(total_runtime)} minutes.")

Training GBTRegressor model...
Making predictions...
Evaluating the model...

Train size: 480,411 samples
Test size: 120,366 samples


R-Squared Score (Accuracy): 88.82%


Overall runtime: 709 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")


Additional Metrics:
Mean Absolute Error: 3717
Mean Squared Error: 36852583
Root Mean Squared Error: 6071




---



## **Comparison before and after training with `Best Hyper Parameters`**

### <font color='orange'>**Before**</font>
**Old parameters:** `{'maxDepth': 5, 'maxIter': 100}`
<br></br>
R-Squared Score (Accuracy): ***88.57 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3649

Root Mean Squared Error: 6137



### <font color='yellow'>**After**</font>

**Best parameters:** `{'maxDepth': 5,'maxIter': 100,'stepSize': 0.1,'minInstancesPerNode': 10,'maxBins': 50}`
<br></br>
R-Squared Score (Accuracy): ***88.82 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3717

Root Mean Squared Error: 6071