### Phase 1: Infrastructure & Data Ingestion (Bronze Layer)

#### Project Setup & Database Initialization



In [0]:
# Import required libraries & functions
from pyspark.sql import functions as f

In [0]:

# Creating the tiered database structure
# The Source of Truth : bronze_db
# The Cleaning Zone : silver_db
# The Insight Hub : gold_db
spark.sql("CREATE DATABASE IF NOT EXISTS bronze_db") 
spark.sql("CREATE DATABASE IF NOT EXISTS silver_db") 
spark.sql("CREATE DATABASE IF NOT EXISTS gold_db")

# Define the cloud storage path for the Amazon Fitness source file
source_path = "/Volumes/workspace/default/pyspark/AmazonProducts.csv"

In [0]:
# Verifying the database creation in the catalog
spark.sql("SHOW DATABASES").filter("databaseName LIKE '%_db%'").show()

+------------+
|databaseName|
+------------+
|   bronze_db|
|     gold_db|
|   silver_db|
+------------+



#### Bronze Layer - Raw Data Ingestion

In [0]:
# Loading the Amazon Fitness dataset into the Bronze Layer
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(source_path)

#### Data Lineage

In [0]:
from pyspark.sql import functions as f
# Adding audit metadata (ingestion time) for tracking
bronze_df = raw_df.withColumn("_ingested_at", f.current_timestamp())

In [0]:
# Save as a Delta Table for high-performance access
bronze_df.write.mode("overwrite").saveAsTable("bronze_db.amazon_fitness_raw")

In [0]:
bronze_df.count()

551585

In [0]:
bronze_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- main_category: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- image: string (nullable = true)
 |-- link: string (nullable = true)
 |-- ratings: string (nullable = true)
 |-- no_of_ratings: string (nullable = true)
 |-- discount_price: string (nullable = true)
 |-- actual_price: string (nullable = true)
 |-- _ingested_at: timestamp (nullable = false)



In [0]:
display(
    bronze_df.filter(f.col("main_category") == "sports & fitness").limit(10)
)

_c0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price,_ingested_at
0,Kore Regular Gym Bag,sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/715B2hG++tL._AC_UL320_.jpg,https://www.amazon.in/Kore-K-REGULAR-GYM-BAG-Regular-Gym-Bag/dp/B09VLK27NF/ref=sr_1_49?qid=1679218203&s=sports&sr=1-49,2.9,3836,,₹59,2026-01-29T21:26:36.910Z
1,"Boldfit Resistance Tube with Foam Handles, Door Anchor for Exercise & Stretching, Suitable in Home & Gym Workout for Men &...",sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/71l0xZ29TfS._AC_UL320_.jpg,https://www.amazon.in/Boldfit-Resistance-Tube-Handles-10KG/dp/B08NDNFGTP/ref=sr_1_50?qid=1679218203&s=sports&sr=1-50,4.0,3018,₹279,₹699,2026-01-29T21:26:36.910Z
2,"Strauss Exercise Latex Resistance Bands, (Set of 5)",sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/71yLuDGjJUL._AC_UL320_.jpg,https://www.amazon.in/Strauss-Latex-Band-Set-5/dp/B07TLVRKKH/ref=sr_1_51_mod_primary_new?qid=1679218203&s=sports&sbo=RZvfv%2F%2FHxDF%2BO5021pAnSA%3D%3D&sr=1-51,4.1,4403,₹489,"₹1,049",2026-01-29T21:26:36.910Z
3,AUXTER Faux Leather 23 Cms Duffle Bag(AUX_GB_LE_Black -SE_0521_Black),sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/71xeudlt5EL._AC_UL320_.jpg,https://www.amazon.in/AUXTER-BLACKY-Duffel-Emboss-Black/dp/B07F2H25NP/ref=sr_1_52?qid=1679218203&s=sports&sr=1-52,4.3,8901,₹376,₹999,2026-01-29T21:26:36.910Z
4,Kore PVC 10-40 Kg Home Gym Set with One 3 Ft Curl and One Pair Dumbbell Rods with Gym Accessories,sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/81XNzjmXi+L._AC_UL320_.jpg,https://www.amazon.in/Kore-PVC-Home-Gym-Accessories/dp/B089DF1PXS/ref=sr_1_53?qid=1679218203&s=sports&sr=1-53,3.6,33078,"₹1,049","₹4,090",2026-01-29T21:26:36.910Z
5,MAXWELL®FIT-Yoga Mat for Women and Men Fitness (SIZE-4mm) with Carrying Strap Extra Thick & Large Excercise Mat for Workou...,sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/61xWyTStjTL._AC_UL320_.jpg,https://www.amazon.in/MAXWELL%C2%AEFIT-Yoga-SIZE-4mm-Carrying-Excercise-Color-BLACK/dp/B0BPXTSCG8/ref=sr_1_54?qid=1679218203&s=sports&sr=1-54,4.0,165,₹320,"₹1,699",2026-01-29T21:26:36.910Z
6,"OJS Skipping Rope for Men and Women Jumping Rope With Adjustable Height Speed Skipping Rope for Exercise, Gym, Sports Fitn...",sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/71V7BaHBcqL._AC_UL320_.jpg,https://www.amazon.in/OJS-Skipping-Jumping-Adjustable-Exercise/dp/B0BP24TYCC/ref=sr_1_55?qid=1679218203&s=sports&sr=1-55,3.9,48,₹99,₹999,2026-01-29T21:26:36.910Z
7,"Kore PVC DM-5kg-Combo 161 Fixed Dumbbells Set and Fitness Kit for Men and Women Whole Body Workout, Multicolor",sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/713zoEM3wlL._AC_UL320_.jpg,https://www.amazon.in/Kore-DM-5kg-Combo-Dumbbells-Fitness-Multicolor/dp/B098MGTF4X/ref=sr_1_56?qid=1679218203&s=sports&sr=1-56,3.8,5158,₹449,₹499,2026-01-29T21:26:36.910Z
8,NIVIA Beast Gym Bag-4 Polyester/Unisex Gym Bags/Shoulder Bag for Men & Women with Separate Shoes Compartment/Carry Gym Acc...,sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/51i9ODagRZL._AC_UL320_.jpg,https://www.amazon.in/Nivia-5183-Polyester-Duffle-Medium/dp/B071KNXQ7Y/ref=sr_1_57?qid=1679218203&s=sports&sr=1-57,4.2,3755,₹528,₹799,2026-01-29T21:26:36.910Z
9,"Starx Plastic Xpvc Fixed Dumbbell Set, Adult 1Kg, Set of 2 (Blue)",sports & fitness,All Exercise & Fitness,https://m.media-amazon.com/images/I/51k8rbleUML._AC_UL320_.jpg,https://www.amazon.in/StarX-XPVC-Plastic-Dumbbell-Adult/dp/B01HD66M8M/ref=sr_1_58?qid=1679218203&s=sports&sr=1-58,3.4,990,₹99,₹590,2026-01-29T21:26:36.910Z


### Phase 2: Data Cleaning & Standardization (Silver Layer)

In [0]:
# Filter for Sports & Fitness domain
silver_df = bronze_df.filter(f.col("main_category") == "sports & fitness")

# Record initial count for comparison
initial_count = silver_df.count()

# Drop non-analytical columns and duplicate rows
silver_df = silver_df.drop("image", "link", "_c0").dropDuplicates()

# Record final count
final_count = silver_df.count()

# Standardize sub_category: replace umbrella/empty/null with 'Unknown'
umbrella_cats = ["All Exercise & Fitness", "All Sports, Fitness & Outdoors", "", None]
silver_df = silver_df.withColumn(
    "sub_category_standardized",
    f.when(
        f.col("sub_category").isin(umbrella_cats) | f.col("sub_category").isNull(), "Unknown"
    ).otherwise(f.col("sub_category"))
)

# Verification Output
print(f"Initial Records: {initial_count}")
print(f"Records after Deduplication: {final_count}")
print(f"Duplicates Removed: {initial_count - final_count}")
print(f"")
print("Remaining Columns:", silver_df.columns)
print(f"")
print("Distinct standardized sub-categories")
display(silver_df.select("sub_category_standardized").distinct())

Initial Records: 12629
Records after Deduplication: 12496
Duplicates Removed: 133

Remaining Columns: ['name', 'main_category', 'sub_category', 'ratings', 'no_of_ratings', 'discount_price', 'actual_price', '_ingested_at', 'sub_category_standardized']

Distinct standardized sub-categories


sub_category_standardized
Unknown
Badminton
Cycling
Cardio Equipment
Cricket
Camping & Hiking
Football
Fitness Accessories
Running
Strength Training


#### Price Normalization & Data Quality Assurance

In [0]:
# Clean actual_price
silver_df = silver_df.withColumn(
    "actual_price_clean",
    f.when(
        f.length(f.regexp_extract(f.col("actual_price"), r"(\d+[\d,.]*)", 1)) > 0,
        f.regexp_replace(f.regexp_extract(f.col("actual_price"), r"(\d+[\d,.]*)", 1), ",", "")
    ).otherwise(None).cast("float")
)

# Clean discount_price
silver_df = silver_df.withColumn(
    "discount_price_clean",
    f.when(
        f.length(f.regexp_extract(f.col("discount_price"), r"(\d+[\d,.]*)", 1)) > 0,
        f.regexp_replace(f.regexp_extract(f.col("discount_price"), r"(\d+[\d,.]*)", 1), ",", "")
    ).otherwise(None).cast("float")
)

# Audit schema and values
print("Current Schema (Verify types for 'clean' columns)")
silver_df.printSchema()
print("Row Audit: Old String Columns vs. New Float Columns")
silver_df.select(
    "actual_price", "actual_price_clean", 
    "discount_price", "discount_price_clean"
).show(10, truncate=False)

Current Schema (Verify types for 'clean' columns)
root
 |-- name: string (nullable = true)
 |-- main_category: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- ratings: string (nullable = true)
 |-- no_of_ratings: string (nullable = true)
 |-- discount_price: string (nullable = true)
 |-- actual_price: string (nullable = true)
 |-- _ingested_at: timestamp (nullable = false)
 |-- sub_category_standardized: string (nullable = true)
 |-- actual_price_clean: float (nullable = true)
 |-- discount_price_clean: float (nullable = true)

Row Audit: Old String Columns vs. New Float Columns
+------------+------------------+--------------+--------------------+
|actual_price|actual_price_clean|discount_price|discount_price_clean|
+------------+------------------+--------------+--------------------+
|₹7,999      |7999.0            |₹2,499        |2499.0              |
|₹999        |999.0             |₹251.74       |251.74              |
|₹1,999      |1999.0            |₹909 

In [0]:
from pyspark.sql import functions as f

# Clean and safely cast using try_cast
# try_cast handles empty strings by returning NULL instead of crashing
silver_df_cleaned = silver_df.select(
    "name", 
    "main_category", 
    "sub_category",
    "sub_category_standardized",
    
    # Aggressive cleaning for every numerical column
    f.coalesce(f.expr("try_cast(regexp_replace(ratings, '[^0-9.]', '') as double)"), f.lit(0.0)).alias("ratings"),
    f.coalesce(f.expr("try_cast(regexp_replace(no_of_ratings, '[^0-9.]', '') as double)"), f.lit(0.0)).alias("no_of_ratings"),
    f.coalesce(f.expr("try_cast(regexp_replace(actual_price, '[^0-9.]', '') as double)"), f.lit(0.0)).alias("actual_price"),
    f.coalesce(f.expr("try_cast(regexp_replace(discount_price, '[^0-9.]', '') as double)"), f.lit(0.0)).alias("discount_price"),
    "_ingested_at"
)

# Filter out rows where actual_price is 0 to get a REAL correlation
# A correlation of 0.01 happens because '0' is being compared. 
# We only want to correlate products that actually have prices.
valid_assets_df = silver_df_cleaned.filter(f.col("actual_price") > 0)

# Validation: Correlation Check
corr_val = valid_assets_df.stat.corr("discount_price", "actual_price")
print(f"Validation: Correlation between Discount and Actual Price is {corr_val:.4f}")

# Persist the cleaned data
silver_df_cleaned.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("silver_db.amazon_fitness_clean")

print(f"Silver Layer SUCCESS: {silver_df_cleaned.count()} rows persisted.")

Validation: Correlation between Discount and Actual Price is 0.0164
Silver Layer SUCCESS: 12496 rows persisted.


#### Verifying Data 

In [0]:
display(silver_df_cleaned.limit(15))

name,main_category,sub_category,sub_category_standardized,ratings,no_of_ratings,actual_price,discount_price,_ingested_at
"Gyming World { 7 feet } Solid 28 mm Thickness Barbell | Standard Straight Weight Bar | with 4 Locks, Steel Chrome",sports & fitness,All Exercise & Fitness,Unknown,3.9,17.0,3999.0,1804.0,2026-01-29T21:26:58.039Z
Compression T-Shirt Top Full Sleeve Trainer Fit Multi Sports Lycra Skin Inner Wear,sports & fitness,"All Sports, Fitness & Outdoors",Unknown,3.8,138.0,999.0,399.0,2026-01-29T21:26:58.039Z
"Inditradition Yoga Strap Resistance Band, Latex Rubber, 01 Pc",sports & fitness,All Exercise & Fitness,Unknown,3.9,966.0,400.0,249.0,2026-01-29T21:26:58.039Z
"Nivia Adjustable Hand Gripper 2.0 for Strengthener (10Kg to 40Kg), Training, Forearm Exerciser, Finger Exercise, Power Gri...",sports & fitness,"All Sports, Fitness & Outdoors",Unknown,4.1,540.0,370.0,269.0,2026-01-29T21:26:58.039Z
Klapp Zigma Badminton Set; Pack of Two Badminton Set with 10 Pcs Feather Shuttlecock with Cover,sports & fitness,Badminton,Badminton,3.6,19.0,570.0,349.0,2026-01-29T21:26:58.039Z
"Nivia Powerstrike Badminton Shoe, Green, 11",sports & fitness,Badminton,Badminton,4.0,13.0,1749.0,1610.0,2026-01-29T21:26:58.039Z
"Lining Airforce 80 Lite Badmintonracquet,Ivory/Purple with Free Cover, Material: Carbon Fibre",sports & fitness,Badminton,Badminton,0.0,0.0,5590.0,4082.0,2026-01-29T21:26:58.039Z
Neu Look Gym wear Leggings Ankle Length Workout Tights | Stretchable Sports Leggings | Sports Fitness Yoga Track Pants for...,sports & fitness,"All Sports, Fitness & Outdoors",Unknown,3.6,88.0,1499.0,349.0,2026-01-29T21:26:58.039Z
"Eliteemo LED Badminton Pickleball Net Light, 17Ft Remote Control LED Net Light,16Color change by yourself, Waterproof, A g...",sports & fitness,Badminton,Badminton,3.0,4.0,7429.0,5198.0,2026-01-29T21:26:58.039Z
HEAD HCD-402 Tennis T-Shirt,sports & fitness,All Exercise & Fitness,Unknown,4.2,18.0,549.0,494.0,2026-01-29T21:26:58.039Z


#### Verifying NULL Values in Data

In [0]:
# Count nulls for key columns in silver_df_cleaned
null_counts = silver_df_cleaned.select(
    f.sum(f.col("ratings").isNull().cast("int")).alias("ratings_nulls"),
    f.sum(f.col("no_of_ratings").isNull().cast("int")).alias("no_of_ratings_nulls"),
    f.sum(f.col("actual_price").isNull().cast("int")).alias("actual_price_nulls"),
    f.sum(f.col("discount_price").isNull().cast("int")).alias("discount_price_nulls")
)
display(null_counts)

ratings_nulls,no_of_ratings_nulls,actual_price_nulls,discount_price_nulls
0,0,0,0


In [0]:
silver_df_cleaned.select("sub_category_standardized").distinct().show()

+-------------------------+
|sub_category_standardized|
+-------------------------+
|                  Unknown|
|                Badminton|
|                  Cycling|
|         Cardio Equipment|
|                  Cricket|
|         Camping & Hiking|
|                 Football|
|      Fitness Accessories|
|                  Running|
|        Strength Training|
|                     Yoga|
+-------------------------+



### Phase 3: Feature Selection & Dimensionality Reduction

In [0]:
# Check if the column actually provides any information
variance_check = silver_df_cleaned.select("main_category").distinct().count()

if variance_check == 1:
    print("Action: Dropping main_category (Zero Variance detected)")
 

Action: Dropping main_category (Zero Variance detected)


### Phase 4: Machine Learning Preparation (Gold Layer)

In [0]:
# Standard Spark Utilities 
from pyspark.ml.linalg import Vectors

# Data Transformation
from pyspark.ml.feature import StringIndexer, VectorAssembler

# The Three Competitors (All built-in to PySpark)
from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor

# Accuracy & Error Benchmarking
from pyspark.ml.evaluation import RegressionEvaluator

#### Categorical Encoding converting Text to Numbers

In [0]:
# String Indexing
indexer = StringIndexer(
    inputCol="sub_category_standardized", 
    outputCol="sub_cat_index",
    handleInvalid="keep" # Ensures the code doesn't crash on new/unseen labels
)

# Fit and Transform the data
indexed_df = indexer.fit(silver_df_cleaned).transform(silver_df_cleaned)

In [0]:
# Displaying the mapping to verify logic
indexed_df.select("sub_category_standardized", "sub_cat_index").distinct().show(10)

+-------------------------+-------------+
|sub_category_standardized|sub_cat_index|
+-------------------------+-------------+
|                  Unknown|          0.0|
|                     Yoga|          7.0|
|         Cardio Equipment|         10.0|
|         Camping & Hiking|          9.0|
|      Fitness Accessories|          1.0|
|                  Running|          8.0|
|                  Cycling|          5.0|
|                  Cricket|          2.0|
|                 Football|          3.0|
|                Badminton|          4.0|
+-------------------------+-------------+
only showing top 10 rows


#### Vector Assembly

In [0]:
# Define numerical columns for model input
feature_columns = ["sub_cat_index", "ratings", "no_of_ratings", "discount_price"]

# Initialize Assembler to bundle individual columns into a single vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Transform data into Gold format and rename target variable to 'label'
gold_df = assembler.transform(indexed_df) \
                   .withColumnRenamed("actual_price", "label") \
                   .select("name", "sub_category_standardized", "features", "label")

print("Gold Layer Created: Features bundled into vectors.")
display(gold_df.limit(10))

Gold Layer Created: Features bundled into vectors.


name,sub_category_standardized,features,label
Reach B-101 Stationary Upright Bike for Home Gym | Exercise Cycle with Adjustable Resistance and Height Adjusting Cushione...,Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""3.8"",""241.0"",""4397.0""]}",12800.0
GRS® Rajson Double Rod Badminton Racquet Pair with 10 Shuttles for Kids 4 to 8 Years for Kids (Multicolor),Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""4.0"",""442.0"",""269.0""]}",529.0
"SPEED BIRD Cooper Kids Cycle 16-T Baby Cycle for Boys & Girls - Age Group 3-8 Years (Cooper Blue) (Blue, Cooper) 16 Inches",Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""3.1"",""13.0"",""3149.0""]}",4599.0
serveuttam Weight Lifting Wrist Support with Long Straps Gym and Fitness,Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""3.3"",""7.0"",""219.0""]}",500.0
"PulGos Half Cut Bike Motorcycle Gloves/Motorbike Half Racing Gloves (M, Black)_P61",Unknown,"{""type"":""0"",""size"":""4"",""indices"":[""3""],""values"":[""299.0""]}",379.0
Adidas Men's Shorts,Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""3.9"",""514.0"",""1049.0""]}",1499.0
Gosen Gungnir 85R-HT - 35lbs - Grey - Fully Graphite Badminton Racquet - Unstrung - with Cover,Badminton,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""4.0"",""5.0"",""2.0"",""2850.0""]}",5750.0
"Lining Airforce 80 Lite Badmintonracquet,Coral/Ivory with Free Cover, Material: Carbon Fibre",Badminton,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""4.0"",""0.0"",""0.0"",""4082.0""]}",5590.0
"USI UNIVERSAL THE UNBEATABLE Power Lifting Belt, 790PL_S Light Lifting Belt for Men & Women Weightlifting Competition Work...",Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""4.4"",""982.0"",""1529.0""]}",1799.0
USI UNIVERSAL THE UNBEATABLE Wall Pull Up System,Unknown,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.0"",""4.1"",""49.0"",""3855.0""]}",3999.0


In [0]:
# Check for price anomalies
gold_df.describe("label").show()

# Visualize the largest prices to identify outliers
gold_df.orderBy(f.col("label").desc()).show(10)

+-------+-----------------+
|summary|            label|
+-------+-----------------+
|  count|            12496|
|   mean|7893.890125640205|
| stddev|546496.5794390141|
|    min|              0.0|
|    max|       6.108299E7|
+-------+-----------------+

+--------------------+-------------------------+--------------------+----------+
|                name|sub_category_standardized|            features|     label|
+--------------------+-------------------------+--------------------+----------+
|Clovia Women's Ac...|                  Running| [8.0,3.2,4.0,688.0]|6.108299E7|
|PowerMax Fitness ...|         Cardio Equipment|[10.0,0.0,0.0,225...|  365000.0|
|PowerMax Fitness ...|         Cardio Equipment|[10.0,4.0,1.0,145...|  280000.0|
|PowerMax Fitness ...|         Cardio Equipment|[10.0,0.0,0.0,201...|  265000.0|
|PowerMax Fitness ...|         Cardio Equipment|[10.0,0.0,0.0,155...|  255000.0|
|FITNESS WORLD Ser...|         Cardio Equipment|[10.0,0.0,0.0,115...|  224600.0|
|Spirit XT 385 Mot

#### Outliers Removal

In [0]:

# 1. Filter to realistic fitness equipment price ranges
# Removes the 6.1 Crore outliers and items with 0 price
gold_dfc = gold_df.filter(
    (f.col("label") > 100) &          # Must be more than 100 Rs
    (f.col("label") < 100000) &       # Must be less than 1 Lakh (removes the 6.1 Cr error)
    (f.col("discount_price") > 0)     # Must have a valid selling price
)

# 2. Check the new distribution
gold_dfc.describe("label").show()


+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|             11221|
|   mean|2790.5724204616345|
| stddev| 6118.556094972235|
|    min|             119.0|
|    max|           99999.0|
+-------+------------------+



#### Dataset Partitioning

In [0]:
# Partition data using an 80/20 split
# Seed=42 ensures consistent results across multiple executions
train_data, test_data = gold_dfc.randomSplit([0.8, 0.2], seed=42)

print(f"Training set count: {train_data.count()}")
print(f"Testing set count: {test_data.count()}")

Training set count: 9033
Testing set count: 2190


#### Model Comparative Analysis

In [0]:
# Initialize candidate models
lr = LinearRegression(featuresCol="features", labelCol="label")
rf = RandomForestRegressor(featuresCol="features", labelCol="label", numTrees=100, seed=42)
gbt = GBTRegressor(featuresCol="features", labelCol="label", maxIter=20, seed=42)

# Candidate List for the Tournament
model_list = [
    ("Linear Regression", lr),
    ("Random Forest", rf),
    ("Gradient Boosted Trees (GBT)", gbt)
]

# Execution and Evaluation Loop
for name, model_obj in model_list:
    # Train model on 80% of the data
    model_fit = model_obj.fit(train_data)
    
    # Generate predictions on the remaining 20%
    predictions = model_fit.transform(test_data)
    
    # Calculate performance metrics
    eval_rmse = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    eval_r2 = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
    
    rmse = eval_rmse.evaluate(predictions)
    r2 = eval_r2.evaluate(predictions)
    
    # ---
    print(f"--- {name} Performance ---")
    print(f"RMSE: {rmse:.2f} (Avg. Price Valuation Error in ₹)")
    print(f"R2 Score: {r2:.4f} (Model Accuracy / Variance Explained)")
    print("-" * 35)

--- Linear Regression Performance ---
RMSE: 2292.00 (Avg. Price Valuation Error in ₹)
R2 Score: 0.8776 (Model Accuracy / Variance Explained)
-----------------------------------
--- Random Forest Performance ---
RMSE: 2533.29 (Avg. Price Valuation Error in ₹)
R2 Score: 0.7688 (Model Accuracy / Variance Explained)
-----------------------------------
--- Gradient Boosted Trees (GBT) Performance ---
RMSE: 2500.68 (Avg. Price Valuation Error in ₹)
R2 Score: 0.8280 (Model Accuracy / Variance Explained)
-----------------------------------


### Phase 5: Prediction & Strategic Insights

In [0]:
# Apply the final Linear Regression model to estimate expected prices,
# identify underpriced fitness products, and calculate deal strength.

# 1. Final Model Selection
lr_final = LinearRegression(featuresCol="features", labelCol="label")
final_model = lr_final.fit(train_data)
print("Final pricing model trained successfully.")

# 2: Generate the expected market price
# Apply the trained model to estimate expected market price
final_predictions = final_model.transform(test_data)

# 3. Model Coefficients (Interpretability)
columns = ["product_category", "customer_rating", "rating_volume", "listed_price"]
coefficients = final_model.coefficients.toArray()

print("\nModel Feature Weights (Impact on Price Prediction):")
for col, weight in zip(columns, coefficients):
    print(f"{col}: {weight:.4f}")

# 4. Identify products priced below expected market value
# Expected Market Price - Current Market Price
analysis_df = final_predictions.withColumn(
    "price_gap_vs_market",
    f.round(f.col("prediction") - f.col("label"), 2)
)

# 5. Undervalued Product Identification (Readable Output)
undervalued_market_assets = analysis_df.filter(f.col("price_gap_vs_market") > 500) \
    .withColumn(
        "Product Name",
        f.when(
            f.length(f.col("name")) > 55,
            f.concat(f.substring(f.col("name"), 1, 42), f.lit("..."))
        ).otherwise(f.col("name"))
    ) \
    .withColumn(
        "Undervaluation Percentage (%)",
        f.round((f.col("price_gap_vs_market") / f.col("prediction")) * 100, 1)
    ) \
    .select(
        f.col("Product Name"),
        f.col("sub_category_standardized").alias("Product Category"),
        f.col("label").alias("Current Market Price (₹)"),
        f.round(f.col("prediction"), 2).alias("Predicted Fair Market Price (₹)"),
        f.round(f.col("price_gap_vs_market"), 2).alias("Price Gap vs Market Expectation (₹)"),
        f.col("Undervaluation Percentage (%)")
    ) \
    .orderBy(f.col("Price Gap vs Market Expectation (₹)").desc()) \
    .limit(10)

# 6. Display Results
print("\n--- MARKET PRICING INSIGHTS: TOP UNDERVALUED FITNESS PRODUCTS ---")
display(undervalued_market_assets)



Final pricing model trained successfully.

Model Feature Weights (Impact on Price Prediction):
product_category: 18.7438
customer_rating: 44.5259
rating_volume: 0.0102
listed_price: 1.6387

--- MARKET PRICING INSIGHTS: TOP UNDERVALUED FITNESS PRODUCTS ---


Product Name,Product Category,Current Market Price (₹),Predicted Fair Market Price (₹),Price Gap vs Market Expectation (₹),Undervaluation Percentage (%)
COSCO FITNESS Low Noise LCD Blue Screen Ru...,Cardio Equipment,89900.0,125700.81,35800.81,28.5
Lifeline Exercise Motorized Treadmill DK 1...,Cardio Equipment,49500.0,69112.37,19612.37,28.4
Viva Fitness Magnetic Exercise Bike,Cardio Equipment,36500.0,52806.84,16306.84,30.9
Cosco CEB Wave 600 U Upright Bike,Cardio Equipment,29999.0,46124.87,16125.87,35.0
Wecare Fitness and Sports K2 2.75 CHP (5.5...,Cardio Equipment,83999.0,98690.13,14691.13,14.9
Montra Downtown 7X3 Geared with Disc 700CX...,Cycling,25500.0,38163.14,12663.14,33.2
Jordan Fitness JF-38 (4 HP DC Peak) Motori...,Cardio Equipment,66000.0,76077.05,10077.05,13.2
Jordan Fitness 3.5 HP DC Peak with Incline...,Cardio Equipment,64500.0,73618.93,9118.93,12.4
Hashtag Fitness 20 In 1 Home Gym Equipment...,Strength Training,18599.0,26672.14,8073.14,30.3
"Fitbit Ace 2 Activity Tracker for Kids, One Size",Cycling,31712.0,39504.83,7792.83,19.7


### Notebook Summary
**Problem:** Fitness product prices are inconsistent; high-value deals often hidden.

**Data Work:** Loaded into catalog volumes; cleaned, standardized, handled missing values and outliers.

**Modeling:** Trained Linear Regression, Random Forest, and GBT. Linear Regression chosen and other models accurate but less explainable; outlier removal improved its performance.

**Insights:** Predicted fair prices, calculated Price Gaps and Undervaluation %, and identified top underpriced fitness products for buyers and sellers.