## **In this notebook (Linear Regression)**
I followed a step-by-step process to train and tune each model efficiently. First, I created a baseline model using a 100k-row subset with basic parameters, giving me a reference point to measure how hyperparameter tuning impacted performance. Next, I experimented with different parameter settings on this 100k subset to identify the best configurations. Once I had the optimal parameters, I scaled up to the full 600k-row dataset to test how well these settings performed on a larger scale.
<br>

This approach was necessary because a single run of `Linear Regression` was taking around 17 hours, making it impractical to tune all 8 parameters directly on the full dataset. By downsizing hyperparameter tuning to 100k rows, I could find the best parameters and then apply them to 600k rows for final testing.

In [None]:
import importlib
import subprocess
import sys
import gc

def check_and_install_package(package_name, version=None):
    try:
        importlib.import_module(package_name)
        print(f"\n{package_name} is already installed.")
    except ImportError:
        print(f"\n{package_name} is NOT installed. Installing now...")
        if version:
            subprocess.check_call([sys.executable, "-m", "pip", "install", f"{package_name}=={version}"])
        else:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"{package_name} installation completed.")

# List of packages to check along with specific versions if necessary
packages = [
    {"name": "tqdm", "version": None},
    {"name": "pyspark", "version": "3.5.2"},
    {"name": "gdown", "version": None},
    {"name": "numpy", "version": "1.23.5"}
]

# Checking and installing the packages
for package in packages:
    check_and_install_package(package["name"], package["version"])


tqdm is already installed.

pyspark is already installed.

gdown is already installed.

numpy is already installed.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("LinearRegressor") \
    .config("spark.driver.memory", "24g") \
    .config("spark.executor.memory", "24g") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.executor.memoryOverhead", "12g") \
    .config("spark.executor.cores", "5") \
    .config("spark.kryoserializer.buffer.max", "2047m") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4") \
    .getOrCreate()

# Verifying Spark session creation
print(f"Spark session started with version: {spark.version}")


Spark session started with version: 3.5.3


In [None]:
!cp '/content/drive/MyDrive/Big Data Analytics - Project/Datasets/Feature_Engineered_DF.parquet' /content/

output_path = '/content/Feature_Engineered_DF.parquet'
df = spark.read.parquet(output_path)
print("The Feature Engineered DataFrame has been loaded successfully.")


The Feature Engineered DataFrame has been loaded successfully.


In [None]:
# Printing the shape of the DataFrame
total_rows = df.count()
total_columns = len(df.columns)

print(f"The shape of the loaded DataFrame is: ({total_rows}, {total_columns})")

The shape of the loaded DataFrame is: (3000040, 47)


In [None]:
# Calculating the average price
avg_price = df.agg({"price": "avg"}).collect()[0][0]
print(f"Average price of a car: {round(avg_price)}")

Average price of a car: 29933


In [None]:
import pandas as pd
from IPython.display import display
import pyspark.sql.functions as F

# Converting the Spark DataFrame to a Pandas DataFrame and displaying 5 random rows with all columns
pd.set_option('display.max_columns', None)
pandas_df = df.orderBy(F.rand()).limit(5).toPandas()
display(pandas_df)


Unnamed: 0,fuel_type,body_type,city,city_fuel_economy,days_in_market,dealer_zip,engine_displacement,engine_type,exterior_color,franchise_dealer,fuel_tank_volume,height,highway_fuel_economy,horsepower,interior_color,is_new,latitude,length,listing_color,longitude,make_name,maximum_seating,model_name,price,savings_amount,seller_rating,sp_name,torque,transmission,transmission_display,wheel_system_display,wheelbase,width,manufactured_year,combined_fuel_economy,legroom,log_mileage,major_options_count,hp_x_engine_disp,hp_x_torque,listed_day,listed_month,listed_year,age,resale_value_score,maintenance_cost,luxury_score
0,Gasoline,Sedan,Westmont,22.690001,146,60559,4000.0,V8,Other,True,21.1,56.4,29.469999,677.0,Other,True,41.8097,198.8,UNKNOWN,-87.970497,Porsche,4.0,Panamera E-Hybrid,224240.0,0,4.375,Napleton Westmont Porsche,265.22,A,8-Speed Automatic,All-Wheel Drive,116.1,85.2,2020,26.08,80.16,3.18,0,3.94,5e-05,17,4,2020,0,30,49,36
1,Gasoline,Pickup Truck,Roanoke,13.0,61,24012,5700.0,V8,Black,False,26.4,76.2,17.0,381.0,Black,False,37.335701,228.9,BLACK,-79.859001,Toyota,6.0,Tundra,34995.0,0,3.666667,Blue Ridge Auto Sales Inc.,401.0,A,Automatic,Four-Wheel Drive,145.7,79.9,2014,15.0,84.8,11.03,3,3.19,1.89523,11,7,2020,6,15,38,28
2,Gasoline,SUV / Crossover,Costa Mesa,13.0,54,92626,5600.0,V8,Black,False,26.0,75.8,18.0,390.0,Black,False,33.687,208.9,BLACK,-117.918999,Nissan,8.0,Armada,30998.0,1337,5.0,CarMax Costa Mesa - Now offering Curbside Pick...,394.0,A,Automatic,Four-Wheel Drive,121.1,79.9,2019,15.5,82.9,10.35,7,3.28,1.91879,20,7,2020,1,24,44,32
3,Gasoline,Pickup Truck,Conyers,20.0,217,30013,3500.0,V6,White,True,26.0,75.6,26.0,375.0,Black,True,33.645802,231.9,WHITE,-83.992203,Ford,6.0,F-150,35799.0,0,4.314286,Courtesy Ford Mitsubishi,400.0,A,Automatic,4X2,145.0,96.8,2020,23.0,87.5,1.1,9,0.62,1.79666,6,2,2020,0,28,43,36
4,Gasoline,SUV / Crossover,Milwaukee,20.0,72,53227,2000.0,I4,Black,True,19.799999,65.2,27.0,220.0,Black,False,43.000301,182.6,BLACK,-88.046898,Audi,5.0,Q5,23000.0,2317,4.255814,International Autos West Allis,258.0,A,8-Speed Automatic,All-Wheel Drive,110.5,82.2,2017,23.5,78.4,11.02,9,0.23,0.02084,30,6,2020,3,21,37,37


# **Linear Regression**

### **Hyperparameter Tuning**  [on the same subset used in the above cells = 100k Rows] :

I begin by training Linear regression on a subset of 100k rows with specific parameters, establishing a baseline model. This baseline serves as a comparison point, allowing me to evaluate how each hyperparameter tuning experiment impacts performance, either increasing or decreasing metrics relative to the baseline. The initial 100k subset is thus used specifically for benchmarking improvements.

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track the overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:

    df_sample = df.sample(fraction=0.033, seed=42)  # Randomly sampling 100k records
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes.items() if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCols=[f"{col_name}_indexed"], outputCols=[f"{col_name}_encoded"])
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    # Using Linear Regression
    lr = LinearRegression(featuresCol="scaled_features", labelCol="price")
    model = lr.fit(train_df)
    pbar.update(1)

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

# Displaying results
print(f"\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

# Multiplying R-Squared by 100 and rounding it off
print(f"\n\nR-Squared Score (Accuracy): {round(r2 * 100)}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 6/6 [07:28<00:00, 74.74s/it]


Making predictions...
Evaluating the model...

Train size: 79,346 samples
Test size: 19,771 samples


R-Squared Score (Accuracy): 79.50%

Overall runtime: 8 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 4470
Mean Squared Error: 73492920
Root Mean Squared Error: 8573




---



# **Hyper Parameter Tuning**

### **Hyperparameter Tuning on 100k Rows:**

Once I established baseline metrics, I proceeded with hyperparameter tuning on the same subset. Training on 100k rows with different parameter combinations enabled me to evaluate the impact of various hyperparameters. This step was crucial to narrow down the most promising configurations and avoid overfitting due to excessive tuning.

### **for 100k records**

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track the overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:

    # Sampling and repartitioning data
    df_sample = df.sample(fraction=0.033, seed=42)
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(
            inputCol=col_name,
            outputCol=f"{col_name}_indexed",
            handleInvalid="keep")

        encoder = OneHotEncoder(
            inputCol=f"{col_name}_indexed",
            outputCol=f"{col_name}_encoded")

        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn(
        "franchise_dealer",
        F.col("franchise_dealer").cast("int"))

    # Assembling features
    num_columns = [
        col for col in df_sample.columns
        if col != 'price' and col not in cat_columns]

    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns

    assembler = VectorAssembler(
        inputCols=feature_columns,
        outputCol="features")

    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(
        inputCol="features",
        outputCol="scaled_features",
        withMean=True,
        withStd=True)

    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    train_df.cache()

    # Defining the Linear Regression model
    lr = LinearRegression(
        featuresCol="scaled_features",
        labelCol="price"
    )

    # Creating a ParamGridBuilder for hyperparameter tuning
    param_grid = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.1, 0.5]) \
        .addGrid(lr.elasticNetParam, [0.5, 1.0]) \
        .addGrid(lr.maxIter, [50, 100]) \
        .build()

    # Defining evaluators for each metric
    r2_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="r2")

    mae_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="mae")

    rmse_evaluator = RegressionEvaluator(
        labelCol="price",
        predictionCol="prediction",
        metricName="rmse")

    pbar.update(1)

# Initializing best scores and parameters
best_r2 = -float("inf")
best_mae = float("inf")
best_rmse = float("inf")
best_params_r2 = None
best_params_mae = None
best_params_rmse = None

# Manually iterating over each parameter combination and evaluating metrics
for params in param_grid:

    # Extracting the parameter names and values
    param_values = {param.name: value for param, value in params.items()}

    print(f"\nTraining model with parameters: {param_values}")

    # Using copy to apply parameters
    model = lr.copy(params).fit(train_df)

    # Making predictions on the test data
    predictions = model.transform(test_df)

    # Evaluating metrics
    r2 = r2_evaluator.evaluate(predictions)
    mae = mae_evaluator.evaluate(predictions)
    rmse = rmse_evaluator.evaluate(predictions)

    # Printing the metrics for this combination
    print(f"R² (Accuracy): {r2 * 100:.2f}%")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print("-" * 40)

    # Tracking the best scores and corresponding parameters
    if r2 > best_r2:
        best_r2 = r2
        best_params_r2 = param_values

    if mae < best_mae:
        best_mae = mae
        best_params_mae = param_values

    if rmse < best_rmse:
        best_rmse = rmse
        best_params_rmse = param_values

# Printing the best model and its corresponding parameters
print(f"Best R² (Accuracy): {best_r2 * 100:.2f}% with parameters: {best_params_r2}")
print(f"Best MAE: {best_mae:.2f} with parameters: {best_params_mae}")
print(f"Best RMSE: {best_rmse:.2f} with parameters: {best_params_rmse}")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")


Processing and Training: 100%|██████████| 6/6 [00:09<00:00,  1.50s/it]



Training model with parameters: {'regParam': 0.1, 'elasticNetParam': 0.5, 'maxIter': 50}
R² (Accuracy): 79.36%
MAE: 4685.96
RMSE: 8730.45
----------------------------------------

Training model with parameters: {'regParam': 0.1, 'elasticNetParam': 0.5, 'maxIter': 100}
R² (Accuracy): 79.60%
MAE: 4671.10
RMSE: 8680.27
----------------------------------------

Training model with parameters: {'regParam': 0.1, 'elasticNetParam': 1.0, 'maxIter': 50}
R² (Accuracy): 79.38%
MAE: 4680.76
RMSE: 8727.83
----------------------------------------

Training model with parameters: {'regParam': 0.1, 'elasticNetParam': 1.0, 'maxIter': 100}
R² (Accuracy): 79.60%
MAE: 4671.27
RMSE: 8679.60
----------------------------------------

Training model with parameters: {'regParam': 0.5, 'elasticNetParam': 0.5, 'maxIter': 50}
R² (Accuracy): 79.42%
MAE: 4672.95
RMSE: 8717.34
----------------------------------------

Training model with parameters: {'regParam': 0.5, 'elasticNetParam': 0.5, 'maxIter': 100}
R² (Acc



---



## **Summary of HPT on 100k rows**

For the Linear Regression model, I explored various configurations to maximize accuracy and minimize errors. The primary parameters tuned included `regParam`, `elasticNetParam`, and `maxIter`.

**Best Configuration and Performance:**
- **Best Parameters:** `regParam` = 0.5, `elasticNetParam` = 1.0, `maxIter` = 100
- **Best R² Score (Accuracy):** 79.74%
- **Best MAE:** 4640.64
- **Best RMSE:** 8650.42

**Comparison to Baseline:**
The baseline Linear Regression model achieved an R² of 79.50% with a runtime of 8 minutes. Through hyperparameter tuning, I was able to `increase the R² slightly to 79.74%`, it slightly increased MAE from 4470 to 4640.64 and RMSE from 8573 to 8650.42.



---



# **Running with Best Parameters**

### **Scaling Up with Optimized Parameters (600k Rows):**
Once the best parameters were identified, I applied them to the full 600k-row dataset to test scalability. This step allowed me to verify performance improvements and confirm the effectiveness of the chosen parameters on the final, larger dataset.

**Best Parameters:** `regParam` = 0.5, `elasticNetParam` = 1.0, `maxIter` = 100

## **600k records**

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import time
import warnings
from tqdm import tqdm
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean as sql_mean, log
import pyspark.sql.functions as F

# Ignore warnings
warnings.filterwarnings('ignore')

# Starting to track the overall runtime
start_time = time.time()

with tqdm(total=6, desc="Processing and Training") as pbar:

    df_sample = df.sample(fraction=0.2, seed=42)  # Randomly sample 600k records of the data
    pbar.update(1)

    # Handling categorical columns
    cat_columns = [field for (field, dtype) in df_sample.dtypes if dtype == "string"]
    stages = []
    for col_name in cat_columns:
        indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed", handleInvalid="keep")
        encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
        stages += [indexer, encoder]
    pbar.update(1)

    # Converting 'franchise_dealer' to numeric
    df_sample = df_sample.withColumn("franchise_dealer", F.col("franchise_dealer").cast("int"))

    # Assembling features
    num_columns = [col for col in df_sample.columns if col != 'price' and col not in cat_columns]
    encoded_columns = [f"{col}_encoded" for col in cat_columns]
    feature_columns = num_columns + encoded_columns
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    stages += [assembler]
    pbar.update(1)

    # Adding scaling to the pipeline
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    stages += [scaler]

    # Creating and applying the pipeline
    pipeline = Pipeline(stages=stages)
    pipeline_model = pipeline.fit(df_sample)
    df_sample = pipeline_model.transform(df_sample)
    pbar.update(1)

    # Splitting the data
    train_df, test_df = df_sample.randomSplit([0.8, 0.2], seed=42)
    pbar.update(1)

    # Using Linear Regression
    lr = LinearRegression(
    featuresCol="scaled_features",
    labelCol="price",
    regParam=0.5,
    elasticNetParam=1.0,
    maxIter=100
)

    model = lr.fit(train_df)
    pbar.update(1)

# Making predictions
print("Making predictions...")
predictions = model.transform(test_df)

# Evaluating the model
print("Evaluating the model...")
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

# Displaying results
print(f"\nTrain size: {train_df.count():,} samples")
print(f"Test size: {test_df.count():,} samples")

print(f"\n\nR-Squared Score (Accuracy): {r2 * 100:.2f}%")

# Calculating total runtime
end_time = time.time()
total_runtime = (end_time - start_time) / 60
print(f"\nOverall runtime: {round(total_runtime)} minutes.")

Processing and Training: 100%|██████████| 6/6 [17:03:52<00:00, 10238.69s/it]
Making predictions...
Evaluating the model...

Train size: 480,411 samples
Test size: 120,366 samples


R-Squared Score (Accuracy): 84.41%

Overall runtime: 1296 minutes.


In [None]:
# Calculating additional metrics
mae_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = mae_evaluator.evaluate(predictions)

mse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
mse = mse_evaluator.evaluate(predictions)

rmse_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = rmse_evaluator.evaluate(predictions)

print("Additional Metrics:")
print(f"Mean Absolute Error: {round(mae)}")
print(f"Mean Squared Error: {round(mse)}")
print(f"Root Mean Squared Error: {round(rmse)}")

Additional Metrics:
Mean Absolute Error: 4097
Mean Squared Error: 52079104
Root Mean Squared Error: 7204




---



## **Comparison before and after training with `Best Hyper Parameters`**

### <font color='orange'>**Before**</font>
**Old parameters:**

`{**No specific parameters** : lr = LinearRegression(featuresCol="scaled_features", labelCol="price")}`
<br></br>
R-Squared Score (Accuracy): ***84.20 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 4131

Root Mean Squared Error: 7217



### <font color='yellow'>**After**</font>
**Best parameters:** `{'regParam': 0.5, 'elasticNetParam': 1.0, 'maxIter': 100}`
<br></br>
R-Squared Score (Accuracy): ***84.41 %***
<br></br>
**Additional Metrics:**

Mean Absolute Error: 4097

Root Mean Squared Error: 7204