## **In this notebook (XGBoost)**
I followed a structured process to train and tune the `XGBoost model` efficiently. First, I created a baseline model using a `300k-row subset` with a speicific set of parameters. This initial model run served as a benchmark to measure how subsequent hyperparameter tuning would impact performance metrics such as R², MAE, and RMSE.

With XGBoost's computational efficiency (with each run on 300k rows taking at least 1.25 hours, but up to 2-4 times longer with complex parameters like higher iterations and increased depth), I was able to explore tuning across 8 different parameters.

By systematically adjusting each parameter, I aimed to identify the best settings that would enhance the model's accuracy and minimize error rates.

In [None]:
import importlib
import subprocess
import sys
import gc

def check_and_install_package(package_name, version=None):
    try:
        importlib.import_module(package_name)
        print(f"\n{package_name} is already installed.")
    except ImportError:
        print(f"\n{package_name} is NOT installed. Installing now...")
        if version:
            subprocess.check_call([sys.executable, "-m", "pip", "install", f"{package_name}=={version}"])
        else:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"{package_name} installation completed.")

# List of packages to check along with specific versions if necessary
packages = [
    {"name": "tqdm", "version": None},
    {"name": "pyspark", "version": "3.1.1"},
    {"name": "gdown", "version": None},
    {"name": "numpy", "version": "1.22.4"},
    {"name": "xgboost", "version": None},
    {"name": "sparkxgb", "version": None},
]

# Checking and installing packages
for package in packages:
    check_and_install_package(package["name"], package["version"])


tqdm is already installed.

pyspark is NOT installed. Installing now...
pyspark installation completed.

gdown is already installed.

numpy is already installed.

xgboost is NOT installed. Installing now...
xgboost installation completed.

sparkxgb is NOT installed. Installing now...
sparkxgb installation completed.


In [None]:
!pip install numpy==1.22.4



In [None]:
import numpy
print(numpy.__version__)

1.22.4


In [None]:
!pip install sparkxgb



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
import shutil

# Defining local resources directory
local_resources_path = "/resources"
os.makedirs(local_resources_path, exist_ok=True)

# Defining the source paths from your mounted Google Drive
xgboost4j_source = "/content/drive/MyDrive/Big Data Analytics - Project/resources/xgboost4j_2.12-1.7.6.jar"
xgboost4j_spark_source = "/content/drive/MyDrive/Big Data Analytics - Project/resources/xgboost4j-spark_2.12-1.7.6.jar"

# Defining the destination paths in the instance's local file system
xgboost4j_dest = os.path.join(local_resources_path, "xgboost4j_2.12-1.7.6.jar")
xgboost4j_spark_dest = os.path.join(local_resources_path, "xgboost4j-spark_2.12-1.7.6.jar")

# Copying the files from Google Drive to the local instance
shutil.copyfile(xgboost4j_source, xgboost4j_dest)
shutil.copyfile(xgboost4j_spark_source, xgboost4j_spark_dest)

# Verifying that the files are copied
print(f"Jar Files copied to: {local_resources_path}")
print(os.listdir(local_resources_path))


Jar Files copied to: /resources
['xgboost4j-spark_2.12-1.7.6.jar', 'xgboost4j_2.12-1.7.6.jar']


In [None]:
from pyspark.sql import SparkSession

# Defining the path to the copied jar files in the local instance
jar_files = "/resources/xgboost4j_2.12-1.7.6.jar,/resources/xgboost4j-spark_2.12-1.7.6.jar"

# Initializing Spark session with the JAR files
spark = SparkSession.builder \
    .appName("XGBoostRegressor") \
    .config("spark.driver.memory", "120g") \
    .config("spark.executor.memory", "120g") \
    .config("spark.driver.maxResultSize", "40g") \
    .config("spark.executor.memoryOverhead", "40g") \
    .config("spark.executor.cores", "5") \
    .config("spark.kryoserializer.buffer.max", "2047m") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "400") \
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4") \
    .config("spark.jars", jar_files) \
    .getOrCreate()

# Verifying Spark session creation
print(f"Spark session started with version: {spark.version}")

Spark session started with version: 3.1.1


In [None]:
# Testing if sparkxgb is loaded properly
try:
    from sparkxgb import XGBoostRegressor

    model = XGBoostRegressor()
    print("sparkxgb loaded successfully!")
except Exception as e:
    print(f"Error loading sparkxgb: {e}")


sparkxgb loaded successfully!


In [None]:
!cp '/content/drive/MyDrive/Big Data Analytics - Project/Datasets/Feature_Engineered_DF.parquet' /content/

output_path = '/content/Feature_Engineered_DF.parquet'
df = spark.read.parquet(output_path)
print("The Feature Engineered DataFrame has been loaded successfully.")


The Feature Engineered DataFrame has been loaded successfully.


In [None]:
# Printing the shape of the DataFrame
total_rows = df.count()
total_columns = len(df.columns)

print(f"The shape of the loaded DataFrame is: ({total_rows}, {total_columns})")

The shape of the loaded DataFrame is: (3000040, 47)


In [None]:
# Calculating the average price
avg_price = df.agg({"price": "avg"}).collect()[0][0]
print(f"Average price of a car: {round(avg_price)}")

Average price of a car: 29933


In [None]:
import pandas as pd
from IPython.display import display
import pyspark.sql.functions as F

# Converting the Spark DataFrame to a Pandas DataFrame and displaying the first 5 rows
pd.set_option('display.max_columns', None)
pandas_df = df.orderBy(F.rand()).limit(5).toPandas()
display(pandas_df)


Unnamed: 0,fuel_type,body_type,city,city_fuel_economy,days_in_market,dealer_zip,engine_displacement,engine_type,exterior_color,franchise_dealer,fuel_tank_volume,height,highway_fuel_economy,horsepower,interior_color,is_new,latitude,length,listing_color,longitude,make_name,maximum_seating,model_name,price,savings_amount,seller_rating,sp_name,torque,transmission,transmission_display,wheel_system_display,wheelbase,width,manufactured_year,combined_fuel_economy,legroom,log_mileage,major_options_count,hp_x_engine_disp,hp_x_torque,listed_day,listed_month,listed_year,age,resale_value_score,maintenance_cost,luxury_score
0,Gasoline,Sedan,Venice,26.0,12,34285,1800.0,I4,Red,True,13.2,57.7,34.0,132.0,Other,False,27.0926,178.7,RED,-82.432999,Toyota,5.0,Corolla,7980.0,821,2.909091,Nissan of Venice,128.0,A,4-Speed Automatic,Front-Wheel Drive,102.4,69.3,2010,30.0,78.0,11.06,0,1.14,1.65957,30,8,2020,10,20,27,24
1,Gasoline,SUV / Crossover,Texas City,21.0,19,77590,2300.0,I4,White,True,19.200001,69.9,28.0,300.0,White,True,29.395201,198.8,WHITE,-94.931702,Ford,7.0,Explorer,42700.0,0,4.25,Cook Ford,265.22,A,Automatic,Rear-Wheel Drive,119.1,89.3,2020,24.5,82.0,1.1,6,-0.29,1e-05,23,8,2020,0,34,40,37
2,Gasoline,SUV / Crossover,Knoxville,22.690001,50,37923,5700.0,Gasoline engine,Black,True,24.6,69.4,29.469999,360.0,Mixed Colors,False,35.9175,189.8,BLACK,-84.073601,Jeep,5.0,Grand Cherokee,16999.0,3409,4.1,Grayson Hyundai Subaru,390.0,A,8-Speed Automatic,Four-Wheel Drive,114.8,84.8,2015,26.08,78.9,11.63,8,2.69,1.46753,22,7,2020,5,20,39,35
3,Gasoline,SUV / Crossover,Murfreesboro,26.0,76,37129,1200.0,I3,Other,True,13.2,64.1,30.0,137.0,Gray,True,35.866199,171.4,UNKNOWN,-86.458702,Buick,5.0,Encore GX,21963.0,0,4.366667,Chevrolet Buick Cadillac GMC of Murfreesboro,162.0,CVT,Continuously Variable Transmission,Front-Wheel Drive,102.2,71.4,2020,28.0,76.9,8.91,4,1.67,1.19436,27,6,2020,0,24,35,30
4,Gasoline,SUV / Crossover,Houston,21.0,22,77034,2300.0,I4,Black,True,19.200001,69.9,28.0,300.0,Black,True,29.6222,198.8,BLACK,-95.222099,Ford,7.0,Explorer,36806.0,0,4.490566,AutoNation Ford Gulf Freeway,265.22,A,Automatic,Rear-Wheel Drive,119.1,89.3,2020,24.5,82.0,2.3,3,-0.29,1e-05,20,8,2020,0,31,40,36




---



# **XG Boost**

### **Initial Training on a Subset (300k Rows):**

This was the baseline I trained on (after feature engineering).

I begin by training XGB on a subset of 300k rows with specific parameters, establishing a baseline model. This baseline serves as a comparison point, allowing me to evaluate how each hyperparameter tuning experiment impacts performance, either increasing or decreasing metrics relative to the baseline. The initial 300k subset is thus used specifically for benchmarking improvements.

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

print("Processing the data...")
with tqdm(total=6, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.1, seed=42)   # Randomly sampling 10% of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and apply the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Filling in missing values
    for col in df_sample.columns:
        if df_sample.schema[col].dataType.typeName() in ["double", "float", "int", "long"]:
            mean_value = df_sample.select(sql_mean(col)).first()[0]
            df_sample = df_sample.na.fill({col: mean_value})
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\n\nData preprocessing and splitting completed!")


Processing the data...


Progress: 100%|██████████| 6/6 [00:34<00:00,  5.79s/it]



Data preprocessing and splitting completed!





In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from sparkxgb import XGBoostRegressor
import time

# Model training
print("Training XGBoost model...")

xgb_regressor = XGBoostRegressor(
    featuresCol="scaled_features",
    labelCol="price",
    maxDepth=6,
    numRound=100,
    objective="reg:squarederror",
    treeMethod="hist",
)


# Before training
start_time = time.time()

# Training the model
model = xgb_regressor.fit(train_df)

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print(f"\nTrain size: {train_df.count()} samples")
print(f"Test size: {test_df.count()} samples")
print(f"\n\nR-Squared Score (Accuracy): {round(r2 * 100)}%\n")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")

Training XGBoost model...
Making predictions...
Evaluating the model...

Train size: 240,048 samples
Test size: 59,933 samples

R-Squared Score (Accuracy): 91.84%

Overall runtime: 75 minutes.



In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\n\nOverall runtime: {round(total_runtime)} minutes.")

Additional Metrics:
Mean Absolute Error: 3018
Mean Squared Error: 27751838
Root Mean Squared Error: 5268




---



## **Hyper Parameter Tuning**

### **Hyperparameter Tuning on 300k Rows:**

Once I established baseline metrics, I proceeded with hyperparameter tuning on the same subset. Training on 300k rows with different parameter combinations enabled me to evaluate the impact of various hyperparameters. This step was crucial to narrow down the most promising configurations (increased accuracy and decreased RMSE/MAE).

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from sparkxgb import XGBoostRegressor
from pyspark.ml.tuning import ParamGridBuilder
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Start tracking overall runtime
start_time = time.time()

# Data preprocessing and feature engineering
print("Processing the data...")
with tqdm(total=5, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.1, seed=42)  # Random sampling 300k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Filling in missing values
    for col in df_sample.columns:
        if df_sample.schema[col].dataType.typeName() in ["double", "float", "int", "long"]:
            mean_value = df_sample.select(F.mean(col)).first()[0]
            df_sample = df_sample.na.fill({col: mean_value})
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\nData preprocessing and splitting completed!")

# Defining the XGBoost Regressor model
xgb_regressor = XGBoostRegressor(
    featuresCol="scaled_features",  # Using the scaled features
    labelCol="price",               # Target column
    objective="reg:squarederror",   # Regression task
    treeMethod="hist",              # Tree construction algorithm
    seed=42                         # Random seed
)

# Creating a ParamGridBuilder for hyperparameter tuning
param_grid = ParamGridBuilder() \
    .addGrid(xgb_regressor.maxDepth, [6, 8]) \
    .addGrid(xgb_regressor.numRound, [100, 200]) \
    .addGrid(xgb_regressor.eta, [0.1, 0.05]) \
    .build()

# Defining evaluators for each metric
r2_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")

# Initializing best scores and parameters
best_r2 = -float("inf")
best_mae = float("inf")
best_rmse = float("inf")
best_params_r2 = None
best_params_mae = None
best_params_rmse = None
print('---------------------------------------------------------------------------')

# Manually iterating over each parameter combination and evaluating metrics
for params in param_grid:

    # Extracting the parameter names and values
    param_values = {param.name: value for param, value in params.items()}

    print(f"\nTraining model with parameters: {param_values}")

    # Using copy to apply parameters
    model = xgb_regressor.copy(params).fit(train_df)

    # Making predictions on the test data
    predictions = model.transform(test_df)

    # Evaluating metrics
    r2 = r2_evaluator.evaluate(predictions)
    mae = mae_evaluator.evaluate(predictions)
    rmse = rmse_evaluator.evaluate(predictions)

    # Printing the metrics for this combination
    print(f"R² (Accuracy): {r2 * 100:.2f}%")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print("-" * 40)

    # Tracking the best scores and corresponding parameters
    if r2 > best_r2:
        best_r2 = r2
        best_params_r2 = param_values

    if mae < best_mae:
        best_mae = mae
        best_params_mae = param_values

    if rmse < best_rmse:
        best_rmse = rmse
        best_params_rmse = param_values

# Printing the best model and its corresponding parameters
print(f"Best R² (Accuracy): {best_r2 * 100:.2f}% with parameters: {best_params_r2}")
print(f"Best MAE: {best_mae:.2f} with parameters: {best_params_mae}")
print(f"Best RMSE: {best_rmse:.2f} with parameters: {best_params_rmse}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing the data...


Progress: 6it [00:42,  7.16s/it]                       



Data preprocessing and splitting completed!
---------------------------------------------------------------------------

Training model with parameters: {'maxDepth': 6, 'numRound': 100, 'eta': 0.1}
R² (Accuracy): 90.63%
MAE: 3371.68
RMSE: 5644.62
----------------------------------------

Training model with parameters: {'maxDepth': 6, 'numRound': 100, 'eta': 0.05}
R² (Accuracy): 88.75%
MAE: 3739.64
RMSE: 6185.30
----------------------------------------

Training model with parameters: {'maxDepth': 6, 'numRound': 200, 'eta': 0.1}
R² (Accuracy): 91.75%
MAE: 3091.11
RMSE: 5294.57
----------------------------------------

Training model with parameters: {'maxDepth': 6, 'numRound': 200, 'eta': 0.05}
R² (Accuracy): 90.73%
MAE: 3360.87
RMSE: 5612.34
----------------------------------------

Training model with parameters: {'maxDepth': 8, 'numRound': 100, 'eta': 0.1}
R² (Accuracy): 92.06%
MAE: 2966.11
RMSE: 5196.14
----------------------------------------

Training model with parameters: {'ma



---



#### The above cell got stopped because it got timed out. So I am manually checking for the last parameter (which couldnt complete its run).

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

print("Processing the data...")
with tqdm(total=5, desc="Progress") as pbar:

    df_sample = df.sample(fraction=0.1, seed=42)  # Randomly sample 300k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Filling in missing values
    for col in df_sample.columns:
        if df_sample.schema[col].dataType.typeName() in ["double", "float", "int", "long"]:
            mean_value = df_sample.select(sql_mean(col)).first()[0]
            df_sample = df_sample.na.fill({col: mean_value})
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

print("\n\nData preprocessing and splitting completed!")


Processing the data...


Progress: 6it [00:35,  5.84s/it]                       



Data preprocessing and splitting completed!





In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from sparkxgb import XGBoostRegressor
import time

# Model training
print("Training XGBoost model...")

xgb_regressor = XGBoostRegressor(
    featuresCol="scaled_features",
    labelCol="price",
    maxDepth=8,
    eta=0.05,
    numRound=200,
    objective="reg:squarederror",
    treeMethod="hist",
)


# Before training
start_time = time.time()

# Training the model
model = xgb_regressor.fit(train_df)

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print(f"\nTrain size: {train_df.count()} samples")
print(f"Test size: {test_df.count()} samples")
print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\n\nOverall runtime: {round(total_runtime)} minutes.")

Training XGBoost model...
Making predictions...
Evaluating the model...

Train size: 240048 samples
Test size: 59933 samples


R-Squared Score (Accuracy): 91.98%


Overall runtime: 121 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 2973
Mean Squared Error: 27269544
Root Mean Squared Error: 5222




---



## **Comparison before and after training with `Best Hyper Parameters`**


### <font color='orange'>**Before**</font>
**Old parameters  :** `{'maxDepth': 6, 'numRound': 100}`
<br></br>
R-Squared Score (Accuracy): ***91.84 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3018

Root Mean Squared Error: 5268



### <font color='yellow'>**After**</font>
**Best parameters**  : `{'maxDepth': 8, 'numRound': 200, 'eta': 0.1}`
<br></br>
R-Squared Score (Accuracy): ***92.75 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 2759

Root Mean Squared Error: 4964