## **In this notebook (Random Forest)**
I followed a step-by-step process to train and tune each model efficiently. First, I created a baseline model using a 100k-row subset with basic parameters, giving me a reference point to measure how hyperparameter tuning impacted performance. Next, I experimented with different parameter settings on this 100k subset to identify the best configurations. Once I had the optimal parameters, I scaled up to the full 600k-row dataset to test how well these settings performed on a larger scale.
<br>

This approach was necessary because a single run of `Random Forest` was taking around 9 hours, making it impractical to tune all 8 parameters directly on the full dataset. By downsizing hyperparameter tuning to 100k rows, I could find the best parameters and then apply them to 600k rows for final testing.

In [None]:
import importlib
import subprocess
import sys
import gc

def check_and_install_package(package_name):
    try:
        importlib.import_module(package_name)
        print(f"\n{package_name} is already installed.")
    except ImportError:
        print(f"\n{package_name} is NOT installed. Installing now...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"{package_name} installation completed.")

# List of packages to check
packages = [
    "tqdm",
    "dask",
    "nltk",
    "scikit-learn",
    "numpy",
    "pyspark",
    "gdown"
]

# Checking and installing the packages
for package in packages:
    check_and_install_package(package)



tqdm is already installed.

dask is already installed.

nltk is already installed.

scikit-learn is NOT installed. Installing now...
scikit-learn installation completed.

numpy is already installed.

pyspark is already installed.

gdown is already installed.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("RandomForestModel") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.executor.memoryOverhead", "12g") \
    .config("spark.executor.cores", "5") \
    .config("spark.kryoserializer.buffer.max", "2047m") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4") \
    .getOrCreate()

# Verifying Spark session creation
print(f"Spark session started with version: {spark.version}")

Spark session started with version: 3.5.3


In [None]:
!cp '/content/drive/MyDrive/Big Data Analytics - Project/Datasets/Feature_Engineered_DF.parquet' /content/

output_path = '/content/Feature_Engineered_DF.parquet'
df = spark.read.parquet(output_path)
print("The Feature Engineered DataFrame has been loaded successfully.")


The Feature Engineered DataFrame has been loaded successfully.


In [None]:
# Printing the shape of the DataFrame
total_rows = df.count()
total_columns = len(df.columns)

print(f"The shape of the loaded DataFrame is: ({total_rows}, {total_columns})")

The shape of the loaded DataFrame is: (3000040, 47)


In [None]:
# Calculating the average price
avg_price = df.agg({"price": "avg"}).collect()[0][0]
print(f"Average price of a car: {round(avg_price)}")

Average price of a car: 29933


In [None]:
import pandas as pd
from IPython.display import display
import pyspark.sql.functions as F

# Converting the Spark DataFrame to a Pandas DataFrame and displaying the first 5 rows
pd.set_option('display.max_columns', None)
pandas_df = df.orderBy(F.rand()).limit(5).toPandas()
display(pandas_df)


Unnamed: 0,fuel_type,body_type,city,city_fuel_economy,days_in_market,dealer_zip,engine_displacement,engine_type,exterior_color,franchise_dealer,fuel_tank_volume,height,highway_fuel_economy,horsepower,interior_color,is_new,latitude,length,listing_color,longitude,make_name,maximum_seating,model_name,price,savings_amount,seller_rating,sp_name,torque,transmission,transmission_display,wheel_system_display,wheelbase,width,manufactured_year,combined_fuel_economy,legroom,log_mileage,major_options_count,hp_x_engine_disp,hp_x_torque,listed_day,listed_month,listed_year,age,resale_value_score,maintenance_cost,luxury_score
0,Gasoline,SUV / Crossover,North Dartmouth,21.0,18,2747,2000.0,I4,Blue,True,17.4,64.5,28.0,241.0,Mixed Colors,False,41.6399,183.3,BLUE,-71.005402,Mercedes-Benz,5.0,GLC-Class,37635.0,276,3.263158,Dartmouth Nissan,273.0,A,Automatic,All-Wheel Drive,113.1,82.5,2018,24.5,78.1,9.28,5,0.05,-0.00537,22,8,2020,2,29,40,32
1,Gasoline,Sedan,Antioch,33.0,19,37013,1600.0,I4,Red,True,11.9,57.1,41.0,120.0,Black,True,36.043598,172.6,RED,-86.657501,Kia,5.0,Rio,17202.0,0,4.636364,Greenway Kia Hickory Hollow,112.0,CVT,Continuously Variable Transmission,Front-Wheel Drive,101.6,67.9,2020,37.0,75.6,0.69,3,1.48,2.04545,23,8,2020,0,32,35,30
2,Gasoline,Sedan,San Antonio,29.0,168,78233,2000.0,I4,Other,True,12.4,56.9,39.0,149.0,Black,True,29.526501,182.7,UNKNOWN,-98.393097,Nissan,5.0,Sentra,19418.0,0,4.15942,World Car Nissan Hyundai,146.0,CVT,Continuously Variable Transmission,Front-Wheel Drive,106.8,71.5,2020,34.0,81.4,1.1,6,0.8,1.22982,27,3,2020,0,28,37,31
3,Gasoline,Pickup Truck,Apopka,16.0,61,32703,3500.0,V6,Red,True,26.0,77.2,22.0,375.0,Black,True,28.672899,231.9,RED,-81.481102,Ford,6.0,F-150,56199.0,0,4.285714,Mullinax Ford of Central Florida,400.0,A,Automatic,Four-Wheel Drive,145.0,96.8,2020,19.0,87.5,3.91,14,0.62,1.79666,11,7,2020,0,29,43,37
4,Gasoline,Sedan,Jupiter,16.0,67,33458,4400.0,V8,Black,True,18.0,57.8,25.0,456.0,Other,False,26.934799,195.4,BLACK,-80.122498,BMW,5.0,5 Series,51876.0,1139,4.777778,Braman BMW Jupiter,480.0,A,8-Speed Automatic,All-Wheel Drive,117.1,83.7,2018,20.5,77.9,10.81,12,2.63,4.68331,5,7,2020,2,24,47,38




---



# **Random Forest Regressor**

### **Initial Training on a Subset (100k Rows):**

I begin by training Random Forest on a subset of 100k rows with specific parameters, establishing a baseline model. This baseline serves as a comparison point, allowing me to evaluate how each hyperparameter tuning experiment impacts performance, either increasing or decreasing metrics relative to the baseline. The initial 100k subset is thus used specifically for benchmarking improvements.

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder, Imputer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql.functions import mean as sql_mean, log
import pyspark.sql.functions as F
import time

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track overall runtime
start_time = time.time()

with tqdm(total=7, desc="Processing and Training") as pbar:

    df_sample = df.sample(fraction=0.033, seed=42)  # Random sampling 100k records
    pbar.update(1)

    # Removing rows where 'price' is <= 0 (to avoid issues with log transformation)
    df_sample = df_sample.filter(F.col("price") > 0)

    # Log transforming the target variable
    df_sample = df_sample.withColumn("log_price", log("price"))
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features (ensure all columns used in 'VectorAssembler' are numeric)
    num_columns = [col for col in df_sample.columns if col not in ['price', 'log_price'] + cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline (to scale the assembled feature vectors)
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    try:
        pipeline_model = pipeline.fit(df_sample)
        df_sample = pipeline_model.transform(df_sample)
        pbar.update(1)
    except Exception as e:
        print(f"Error during pipeline fit: {e}")
        pbar.update(1)

    # Splitting the data into training and test sets
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    # Defining the RandomForestRegressor model
    rf = RandomForestRegressor(
      featuresCol="scaled_features",
      labelCol="log_price",
      numTrees=50,
      maxDepth=10,
      minInstancesPerNode=10,
      seed=42
    )

    # Fitting the model to the training data
    try:
        model = rf.fit(train_df)
        pbar.update(1)
    except Exception as e:
        print(f"Error during model training: {e}")
        pbar.update(1)

# Making predictions
print("Making predictions...")
try:
    predictions = model.transform(test_df)
    predictions = predictions.withColumn("exp_prediction", F.exp("prediction"))
except Exception as e:
    print(f"Error during prediction: {e}")

# Evaluating the model
print("Evaluating the model...")
try:
    evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="r2")
    r2 = evaluator.evaluate(predictions)
    print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")
except Exception as e:
    print(f"Error during evaluation: {e}")

# Displaying results
print(f"\n\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 7/7 [50:48<00:00, 435.51s/it]


Making predictions...
Evaluating the model...


R-Squared Score (Accuracy): 75.25%


Train size: 79,346 samples
Test size: 19,771 samples

Overall runtime: 52 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 3517
Mean Squared Error: 88738726
Root Mean Squared Error: 9420




---



# **Hyper parameter tuning**

### **Hyperparameter Tuning**  [on the same subset used in the above cells = 100k Rows] :

Once I established baseline metrics, I proceeded with hyperparameter tuning on the same subset. Training on 100k rows with different parameter combinations enabled me to evaluate the impact of various hyperparameters. This step was crucial to narrow down the most promising configurations and avoid overfitting due to excessive tuning.

In [None]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Start to track overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:
    df_sample = df.sample(fraction=0.033, seed=42)  # Randomly sample 100k records of the data
    pbar.update(1)

    # Remove rows where 'price' is <= 0 (to avoid issues with log transformation)
    df_sample = df_sample.filter(F.col("price") > 0)

    # Log transforming the target variable
    df_sample = df_sample.withColumn("log_price", F.log("price"))
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features
    num_columns = [col for col in df_sample.columns if col not in ['price', 'log_price'] + cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data into training and test sets
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

# Defining hyperparameters for Random Forest
param_grid = ParamGridBuilder() \
    .addGrid(RandomForestRegressor.numTrees, [50, 100]) \
    .addGrid(RandomForestRegressor.maxDepth, [10,20]) \
    .addGrid(RandomForestRegressor.minInstancesPerNode, [10,20]) \
    .build()

# Initializing best scores and parameters
best_r2 = -float("inf")
best_mae = float("inf")
best_rmse = float("inf")
best_params_r2 = None
best_params_mae = None
best_params_rmse = None

# Manually iterating over each parameter combination and evaluate metrics
for params in param_grid:

    # Extracting the parameter names and values
    param_values = {param.name: value for param, value in params.items()}

    print(f"\nTraining Random Forest model with parameters: {param_values}")

    # Defining the RandomForestRegressor model with the current hyperparameters
    rf = RandomForestRegressor(
        featuresCol="scaled_features",
        labelCol="log_price",
        **param_values
    )

    # Fitting the model to the training data
    try:
        model = rf.fit(train_df)
    except Exception as e:
        print(f"Error during model training: {e}")
        continue

    # Making predictions
    try:
        predictions = model.transform(test_df)
        predictions = predictions.withColumn("exp_prediction", F.exp("prediction"))
    except Exception as e:
        print(f"Error during prediction: {e}")
        continue

    # Evaluating the model
    try:
        evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="r2")
        r2 = evaluator.evaluate(predictions)
        print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")
    except Exception as e:
        print(f"Error during evaluation: {e}")
        continue

    # Calculating additional metrics
    mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mae")
    mae = mae_evaluator.evaluate(predictions)

    mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mse")
    mse = mse_evaluator.evaluate(predictions)

    rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="rmse")
    rmse = rmse_evaluator.evaluate(predictions)

    print("Additional Metrics:")
    print(f"Mean Absolute Error: {round(mae)}")
    print(f"Mean Squared Error: {round(mse)}")
    print(f"Root Mean Squared Error: {round(rmse)}")
    print("-" * 40)

    # Tracking the best scores and corresponding parameters
    if r2 > best_r2:
        best_r2 = r2
        best_params_r2 = param_values

    if mae < best_mae:
        best_mae = mae
        best_params_mae = param_values

    if rmse < best_rmse:
        best_rmse = rmse
        best_params_rmse = param_values

# Printing the best model and its corresponding parameters
print(f"\nBest R² (Accuracy): {best_r2 * 100:.2f}% with parameters: {best_params_r2}")
print(f"Best MAE: {best_mae:.2f} with parameters: {best_params_mae}")
print(f"Best RMSE: {best_rmse:.2f} with parameters: {best_params_rmse}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 6/6 [00:08<00:00,  1.49s/it]

Training Random Forest model with parameters: {'numTrees': 50, 'maxDepth': 10, 'minInstancesPerNode': 10}

R-Squared Score (Accuracy): 75.71%
Additional Metrics:
Mean Absolute Error: 3504
Mean Squared Error: 87085658
Root Mean Squared Error: 9332
----------------------------------------

Training Random Forest model with parameters: {'numTrees': 50, 'maxDepth': 10, 'minInstancesPerNode': 20}

R-Squared Score (Accuracy): 74.52%
Additional Metrics:
Mean Absolute Error: 3538
Mean Squared Error: 91340987
Root Mean Squared Error: 9557

----------------------------------------

Training Random Forest model with parameters: {'numTrees': 50, 'maxDepth': 20, 'minInstancesPerNode': 10}

R-Squared Score (Accuracy): 79.30%
Additional Metrics:
Mean Absolute Error: 2689
Mean Squared Error: 74216693
Root Mean Squared Error: 8615
----------------------------------------

Training Random Forest model with parameters: {'numTrees': 5



---




## **Summary of HPT on 100k rows**

I experimented with various configurations of the Random Forest model to identify the best-performing parameters. The parameters tested included `numTrees` (50 and 100), `maxDepth` (10 and 20), and `minInstancesPerNode` (10 and 20).

**Best Configuration and Performance:**
- **Best Parameters:** `numTrees` = 100, `maxDepth` = 20, `minInstancesPerNode` = 10
- **Best R² Score (Accuracy):** 79.49%
- **Best MAE:** 2675
- **Best RMSE:** 8574

**Comparison to Baseline:**
The baseline model, with an R² of 75.25%, used simpler parameters. By optimizing the parameters, I achieved a `4.24% increase in R²`, improving the model's accuracy from 75.25% to 79.49%. Additionally, `error metrics were reduced significantly`, with MAE dropping from 3517 to 2675 and RMSE from 9420 to 8574.



---



# **Running with Best Parameters**

### **Scaling Up with Optimized Parameters (600k Rows):**
Once the best parameters were identified, I applied them to the full 600k-row dataset to test scalability. This step allowed me to verify performance improvements and confirm the effectiveness of the chosen parameters on the final, larger dataset.

**Best Parameters:** `numTrees` = 100, `maxDepth` = 20, `minInstancesPerNode` = 10

## **600k records**

In [None]:
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder, Imputer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql.functions import mean as sql_mean, log
import pyspark.sql.functions as F
import time  # To measure overall runtime

# Ignore warnings
warnings.filterwarnings('ignore')

# Start tracking overall runtime
start_time = time.time()

with tqdm(total=7, desc="Processing and Training") as pbar:

    df_sample = df.sample(fraction=0.2, seed=42)  # Randomly sample 100k records
    pbar.update(1)

    # Removing rows where 'price' is <= 0 (to avoid issues with log transformation)
    df_sample = df_sample.filter(F.col("price") > 0)

    # Log transforming the target variable
    df_sample = df_sample.withColumn("log_price", log("price"))
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features (to ensure all columns used in 'VectorAssembler' are numeric)
    num_columns = [col for col in df_sample.columns if col not in ['price', 'log_price'] + cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline (scale the assembled feature vectors)
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    try:
        pipeline_model = pipeline.fit(df_sample)
        df_sample = pipeline_model.transform(df_sample)
        pbar.update(1)
    except Exception as e:
        print(f"Error during pipeline fit: {e}")
        pbar.update(1)

    # Splitting the data into training and test sets
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    # Defining the RandomForestRegressor model
    rf = RandomForestRegressor(
      featuresCol="scaled_features",
      labelCol="log_price",
      numTrees=100,
      maxDepth=20,
      minInstancesPerNode=10,
      seed=42
    )

    # Fitting the model to the training data
    try:
        model = rf.fit(train_df)
        pbar.update(1)
    except Exception as e:
        print(f"Error during model training: {e}")
        pbar.update(1)

# Making predictions
print("Making predictions...")
try:
    predictions = model.transform(test_df)
    predictions = predictions.withColumn("exp_prediction", F.exp("prediction"))
except Exception as e:
    print(f"Error during prediction: {e}")

# Evaluating the model
print("Evaluating the model...")
try:
    evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="r2")
    r2 = evaluator.evaluate(predictions)
    print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")
except Exception as e:
    print(f"Error during evaluation: {e}")

# Displaying the results
print(f"\n\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

# Calculate total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 7/7 [8:22:00<00:00, 4302.89s/it]
Making predictions...
Evaluating the model...


R-Squared Score (Accuracy): 87.92%

Train size: 480,411 samples
Test size: 120,366 samples

Overall runtime: 1134 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="exp_prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 3442
Mean Squared Error: 44980420
Root Mean Squared Error: 6543




---



## **Comparison before and after training with `Best Hyper Parameters`**

### <font color='orange'>**Before**</font>
**Old parameters:** `{'numTrees': 50, 'maxDepth': 10, 'minInstancesPerNode': 10}`
<br></br>
R-Squared Score (Accuracy): ***86.35 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3488

Root Mean Squared Error: 6707



### <font color='yellow'>**After**</font>
**Best parameters:** `{'numTrees': 100, 'maxDepth': 20, 'minInstancesPerNode': 10}`
<br></br>
R-Squared Score (Accuracy): ***87.92%***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3442

Root Mean Squared Error: 6543