## **In this notebook (Decision Tree)**
I followed a structured process to train and tune the `Decision tree regression model` efficiently. First, I created a baseline model using a `600k row subset ` with a speicific set of parameters. This initial model run served as a benchmark to measure how subsequent hyperparameter tuning would impact performance metrics such as R², MAE, and RMSE.

With Decision tree's computational efficiency (with each run on 600k rows taking at least 1.25 hours, but up to 2-4 times longer with complex parameters like higher no.of Bins and increased Depth, more min.number of bins), I was able to explore tuning across 8 different parameters.

By systematically adjusting each parameter, I aimed to identify the best settings that would enhance the model's accuracy and minimize error rates.

In [None]:
import importlib
import subprocess
import sys
import gc

def check_and_install_package(package_name, version=None):
    try:
        importlib.import_module(package_name)
        print(f"\n{package_name} is already installed.")
    except ImportError:
        print(f"\n{package_name} is NOT installed. Installing now...")
        if version:
            subprocess.check_call([sys.executable, "-m", "pip", "install", f"{package_name}=={version}"])
        else:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"{package_name} installation completed.")

# List of packages to check along with specific versions if necessary
packages = [
    {"name": "tqdm", "version": None},
    {"name": "pyspark", "version": "3.5.2"},
    {"name": "gdown", "version": None},
    {"name": "numpy", "version": "1.23.5"}
]

# Checking and installing the packages
for package in packages:
    check_and_install_package(package["name"], package["version"])



tqdm is already installed.

pyspark is NOT installed. Installing now...
pyspark installation completed.

gdown is already installed.

numpy is already installed.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DecisionTreeModel") \
    .config("spark.driver.memory", "120g") \
    .config("spark.executor.memory", "120g") \
    .config("spark.driver.maxResultSize", "40g") \
    .config("spark.executor.memoryOverhead", "40g") \
    .config("spark.executor.cores", "5") \
    .config("spark.kryoserializer.buffer.max", "2047m") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "400") \
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4") \
    .getOrCreate()

# Verifying Spark session creation
print(f"Spark session started with version: {spark.version}")

Spark session started with version: 3.5.2


In [None]:
!cp '/content/drive/MyDrive/Big Data Analytics - Project/Datasets/Feature_Engineered_DF.parquet' /content/

output_path = '/content/Feature_Engineered_DF.parquet'
df = spark.read.parquet(output_path)
print("The Feature Engineered DataFrame has been loaded successfully.")


The Feature Engineered DataFrame has been loaded successfully.


In [None]:
# Printing the shape of the DataFrame
total_rows = df.count()
total_columns = len(df.columns)

print(f"The shape of the loaded DataFrame is: ({total_rows}, {total_columns})")

The shape of the loaded DataFrame is: (3000040, 47)


In [None]:
# Calculating the average price
avg_price = df.agg({"price": "avg"}).collect()[0][0]
print(f"Average price of a car: {round(avg_price)}")

Average price of a car: 29933


In [None]:
import pandas as pd
from IPython.display import display
import pyspark.sql.functions as F

# Converting the Spark DataFrame to a Pandas DataFrame and displaying 5 random rows with all columns
pd.set_option('display.max_columns', None)
pandas_df = df.orderBy(F.rand()).limit(5).toPandas()
display(pandas_df)


Unnamed: 0,fuel_type,body_type,city,city_fuel_economy,days_in_market,dealer_zip,engine_displacement,engine_type,exterior_color,franchise_dealer,fuel_tank_volume,height,highway_fuel_economy,horsepower,interior_color,is_new,latitude,length,listing_color,longitude,make_name,maximum_seating,model_name,price,savings_amount,seller_rating,sp_name,torque,transmission,transmission_display,wheel_system_display,wheelbase,width,manufactured_year,combined_fuel_economy,legroom,log_mileage,major_options_count,hp_x_engine_disp,hp_x_torque,listed_day,listed_month,listed_year,age,resale_value_score,maintenance_cost,luxury_score
0,Biodiesel,Pickup Truck,Rector,22.690001,30,72461,6600.0,V8,White,True,36.0,79.8,29.469999,445.0,Other,True,36.258999,250.1,WHITE,-90.287498,GMC,5.0,Sierra 2500HD,74770.0,0,5.0,Glen Sain Motor Sales,265.22,A,Automatic,Four-Wheel Drive,158.9,81.9,2020,26.08,87.9,1.95,0,6.26,2e-05,12,8,2020,0,34,46,32
1,Gasoline,Pickup Truck,Naperville,16.0,190,60540,6200.0,V8,White,True,24.0,75.5,22.0,420.0,Black,True,41.7728,231.7,WHITE,-88.185898,Chevrolet,6.0,Silverado 1500,49719.0,0,4.608696,Chevrolet of Naperville,383.0,A,Automatic,Four-Wheel Drive,147.4,81.2,2020,19.0,87.9,1.39,15,4.87,2.12459,4,3,2020,0,30,46,37
2,Gasoline,SUV / Crossover,Houston,22.0,254,77090,2000.0,I4,White,True,17.1,65.7,28.0,272.0,Other,True,30.0119,186.8,WHITE,-95.446297,Acura,5.0,RDX,42025.0,0,4.391304,Team Gillman Acura,280.0,A,Automatic,Front-Wheel Drive,108.3,74.8,2020,25.0,80.0,8.91,0,-0.2,0.03774,1,1,2020,0,22,40,32
3,Gasoline,Sedan,Covington,22.0,4,70433,1600.0,I4,Red,True,14.0,56.5,30.0,201.0,Black,False,30.431999,179.9,RED,-90.089996,Hyundai,5.0,Elantra,16500.0,274,4.540984,Honda of Covington,195.0,M,6-Speed Manual,Front-Wheel Drive,106.3,70.9,2017,26.0,77.9,10.28,6,0.54,0.34232,7,9,2020,3,26,34,29
4,Gasoline,SUV / Crossover,Platteville,26.0,319,53818,1500.0,I3,Other,True,14.7,66.1,31.0,180.0,Gray,True,42.7262,180.5,UNKNOWN,-90.482399,Ford,5.0,Escape,30485.0,0,4.833333,"Pioneer Ford Sales, Ltd.",265.22,A,8-Speed Automatic,All-Wheel Drive,106.7,85.6,2019,28.5,83.1,1.1,1,0.84,-1e-05,27,10,2019,0,28,39,33




---



# **Decision Tree Regressor**

### **Initial Training on a Subset (600k Rows):**

This was the baseline I trained on. (after feature engineering).

I begin by training Decision Tree Regressor on a subset of 600k rows with specific parameters, establishing a baseline model. This baseline serves as a comparison point, allowing me to evaluate how each hyperparameter tuning experiment impacts performance, either increasing or decreasing metrics relative to the baseline. The initial 600k subset is thus used specifically for benchmarking improvements.

In [None]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:

    df_sample = df.sample(fraction=0.2, seed=42)  # Random sampling 20% of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    # Training Decision Tree Regressor model
    dt = DecisionTreeRegressor(
    featuresCol="scaled_features",
    labelCol="price",
    maxDepth=15,
    maxBins=128,
    minInstancesPerNode=5,
    minInfoGain=0.01,
    seed=42
    )

    model = dt.fit(train_df)
    pbar.update(1)


# Making predictions
predictions = model.transform(test_df)

# Evaluating the model
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

# Displaying results
print(f"\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

# Multiplying R-Squared by 100 for percentage calculation
print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 6/6 [1:06:35<00:00, 665.95s/it]

Train size: 480,411 samples
Test size: 120,366 samples


R-Squared Score (Accuracy): 88.38%


Overall runtime: 77 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)


print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 3161
Mean Squared Error: 38316567
Root Mean Squared Error: 6190




---



## **Hyper Parameter Tuning**

### **Hyperparameter Tuning on 600k Rows:**

Once I established baseline metrics, I proceeded with hyperparameter tuning on the same subset. Training on 600k rows with different parameter combinations enabled me to evaluate the impact of various hyperparameters. This step was crucial to narrow down the most promising configurations (increased accuracy and decreased RMSE/MAE).

In [None]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:

    # Sampling and repartitioning the data
    df_sample = df.sample(fraction=0.2, seed=42) # Randomly sample 600k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(
            inputCol=col_name,
            outputCol=f"{col_name}_indexed",
            handleInvalid="keep")

        encoder = OneHotEncoder(
            inputCol=f"{col_name}_indexed",
            outputCol=f"{col_name}_encoded")

        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn(
        "franchise_dealer",
        F.col("franchise_dealer").cast("int"))

    # Assembling the features
    num_columns = [
        col for col in df_sample.columns
        if col != 'price' and col not in cat_columns]

    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns

    assembler = VectorAssembler(
        inputCols=feature_columns,
        outputCol="features")

    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(
        inputCol="features",
        outputCol="scaled_features",
        withMean=True,
        withStd=True)

    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    train_df.cache()

    # Defining the Decision Tree Regressor model
    dt = DecisionTreeRegressor(
        featuresCol="scaled_features",
        labelCol="price"
    )

    # Creating a ParamGridBuilder for hyperparameter tuning
    param_grid = ParamGridBuilder() \
        .addGrid(dt.maxDepth, [10, 15]) \
        .addGrid(dt.maxBins, [32, 64]) \
        .addGrid(dt.minInstancesPerNode, [1, 5]) \
        .build()

    # Defining evaluators for each metric
    r2_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="r2")

    mae_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="mae")

    rmse_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="rmse")

    pbar.update(1)

# Initializing best scores and parameters
best_r2 = -float("inf")
best_mae = float("inf")
best_rmse = float("inf")
best_params_r2 = None
best_params_mae = None
best_params_rmse = None

# Manually iterating over each parameter combination and evaluating metrics
for params in param_grid:

    # Extracting the parameter names and values
    param_values = {param.name: value for param, value in params.items()}

    print(f"\nTraining model with parameters: {param_values}")

    # Using copy to apply parameters
    model = dt.copy(params).fit(train_df)

    # Making predictions on the test data
    predictions = model.transform(test_df)

    # Evaluating metrics
    r2 = r2_evaluator.evaluate(predictions)
    mae = mae_evaluator.evaluate(predictions)
    rmse = rmse_evaluator.evaluate(predictions)

    # Printing the metrics for this combination
    print(f"R² (Accuracy): {r2 * 100:.2f}%")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print("-" * 40)

    # Tracking the best scores and corresponding parameters
    if r2 > best_r2:
        best_r2 = r2
        best_params_r2 = param_values

    if mae < best_mae:
        best_mae = mae
        best_params_mae = param_values

    if rmse < best_rmse:
        best_rmse = rmse
        best_params_rmse = param_values

# Printing the best model and its corresponding parameters
print(f"Best R² (Accuracy): {best_r2 * 100:.2f}% with parameters: {best_params_r2}")
print(f"Best MAE: {best_mae:.2f} with parameters: {best_params_mae}")
print(f"Best RMSE: {best_rmse:.2f} with parameters: {best_params_rmse}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 6/6 [00:21<00:00,  3.60s/it]

Training model with parameters: {'maxDepth': 10, 'maxBins': 64, 'minInstancesPerNode': 5}
R² (Accuracy): 80.72%
MAE: 4209.52
RMSE: 7899.94
----------------------------------------
Training model with parameters: {'maxDepth': 10, 'maxBins': 64, 'minInstancesPerNode': 10}
R² (Accuracy): 81.64%
MAE: 4183.13
RMSE: 7451.94
----------------------------------------
Training model with parameters: {'maxDepth': 10, 'maxBins': 128, 'minInstancesPerNode': 5}
R² (Accuracy): 81.20%
MAE: 4171.91
RMSE: 7042.25
----------------------------------------
Training model with parameters: {'maxDepth': 10, 'maxBins': 128, 'minInstancesPerNode': 10}
R² (Accuracy): 83.52%
MAE: 4130.43
RMSE: 6833.63
----------------------------------------
Training model with parameters: {'maxDepth': 15, 'maxBins': 64, 'minInstancesPerNode': 5}
R² (Accuracy): 85.08%
MAE: 3222.04
RMSE: 6902.65
----------------------------------------
Training model with para



---



## **Comparison before and after training with `Best Hyper Parameters`**

### <font color='orange'>**Before**</font>

**Old parameters:** `{'maxDepth': 15, 'maxBins': 128, 'minInstancesPerNode': 5,'minInfoGain': 0.01}`
<br></br>
R-Squared Score (Accuracy): ***88.38 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3161

Root Mean Squared Error: 7204



### <font color='yellow'>**After**</font>

**Best parameters:** `{'maxDepth': 15, 'maxBins': 128, 'minInstancesPerNode': 10,'minInfoGain': 0.01}`
<br></br>
R-Squared Score (Accuracy): ***88.90 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 3232

Root Mean Squared Error: 6577