## Training GBT Model

In [14]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

In [13]:
spark = SparkSession.builder \
    .appName("Amazon Price Prediction RF") \
    .config("spark.driver.memory", "8g") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")


In [15]:
DATA_PATH = "../../data/cleaned/regression_data"

In [16]:
df = spark.read.parquet(DATA_PATH)

print(f"Total rows: {df.count()}")

Total rows: 699283


In [17]:
from pyspark.sql.functions import col, size, length, when

has_array_features = [f.dataType for f in df.schema.fields if f.name == "features"]
if str(has_array_features[0]).startswith("ArrayType"):
    df = df.withColumn("features_count", size(col("features")))
else:
    if "features_count" not in df.columns and "features" in df.columns:
         df = df.withColumn("features_count", length(col("features")))

In [18]:
required_cols = ["rating_number", "average_rating", "main_category", "price"]

In [19]:
if "features_count" in df.columns: required_cols.append("features_count")
if "desc_len" in df.columns: required_cols.append("desc_len")

df_clean = df.dropna(subset=required_cols)

In [20]:
df_clean = df_clean.withColumn("title_len", length(col("title")))

In [21]:
from pyspark.sql.functions import length, col, count

store_counts = df_clean.groupBy("store").agg(count("*").alias("store_freq"))

In [22]:
df_improved = df_clean.join(store_counts, on="store", how="left")
df_improved = df_improved.na.fill(0, subset=["store_freq", "features_count", "title_len"])

#### Prepare data and training the model

This code transforms the target variable using log1p(price) to stabilize variance and improve training.

It prepares NLP and numeric features, builds a full pipeline, and trains a Gradient Boosted Trees model.

After training, it converts predictions back to real prices using expm1, evaluates the model using R², RMSE, and MAE, and shows example predictions.

In [23]:
from pyspark.sql.functions import log1p, expm1, col
from pyspark.ml.regression import GBTRegressor 
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, StopWordsRemover

df_log = df_improved.withColumn("label", log1p(col("price")))

stages = []

tokenizer = Tokenizer(inputCol="title", outputCol="words")
stages.append(tokenizer)
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
stages.append(remover)
hashingTF = HashingTF(inputCol="filtered_words", outputCol="title_features", numFeatures=1000)
stages.append(hashingTF)

# Indexer
indexer = StringIndexer(inputCol="main_category", outputCol="category_index", handleInvalid="keep")
stages.append(indexer)

# Assembler
numeric_cols = ["category_index", "store_freq", "average_rating", "rating_number"]
if "title_len" in df_log.columns: numeric_cols.append("title_len")
if "features_count" in df_log.columns: numeric_cols.append("features_count")
if "desc_len" in df_log.columns: numeric_cols.append("desc_len")

input_cols = numeric_cols + ["title_features"]
assembler = VectorAssembler(inputCols=input_cols, outputCol="features_vector")
stages.append(assembler)

gbt = GBTRegressor(
    featuresCol="features_vector", 
    labelCol="label",    
    maxIter=50,           
    maxDepth=8,           
    stepSize=0.1,         
    seed=42,
    maxBins=64
)
stages.append(gbt)

pipeline = Pipeline(stages=stages)

train_data, test_data = df_log.randomSplit([0.8, 0.2], seed=42)
print(f"Train: {train_data.count()}, Test: {test_data.count()}")

print("Training GBT on Log(Price)...")
model = pipeline.fit(train_data)
print("Ready!")

predictions = model.transform(test_data)

predictions = predictions.withColumn("prediction_price", expm1(col("prediction")))

r2_eval = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
print(f"R2 (Log Scale): {r2_eval.evaluate(predictions):.4f}")

rmse_eval = RegressionEvaluator(labelCol="price", predictionCol="prediction_price", metricName="rmse")
mae_eval = RegressionEvaluator(labelCol="price", predictionCol="prediction_price", metricName="mae")

print(f"RMSE (Real $): {rmse_eval.evaluate(predictions):.2f}")
print(f"MAE (Real $):  {mae_eval.evaluate(predictions):.2f}")

predictions.select("title", "price", "prediction_price").show(5, truncate=False)

Train: 515254, Test: 128985
Training GBT on Log(Price)...
Ready!
R2 (Log Scale): 0.3104
RMSE (Real $): 187.57
MAE (Real $):  45.27
+-------------------------------------------------------------------------------+------+------------------+
|title                                                                          |price |prediction_price  |
+-------------------------------------------------------------------------------+------+------------------+
|Warhammer 40k Adeptus Mechanicus Codex                                         |22.41 |17.730866095623913|
|[ST125] 2 Piece VVT-I DOHC Vinyl Sticker JDM Stickers 2JZ Supra Corrolla SILVER|5.99  |5.794295090694468 |
|Analog/Dual Shock Controller - Emerald                                         |57.77 |31.35317626681251 |
|Beetle Adventure Racing                                                        |43.98 |29.378666176271825|
|Wave Race 64 (Japan)                                                           |180.09|29.892769364260023|
+----

We used log(price) instead of raw price because:

1. Log transformation reduces the impact of extreme outliers.

2. It makes the distribution more symmetric.

3. Many ML models perform better when the target follows a near-normal distribution.

4. It stabilizes variance and improves error behavior.

In [24]:
model_path = "../../models/regression/gbt_price_log_v1"

print(f"\nSave model to{model_path}...")

model.write().overwrite().save(model_path)


Save model to../../models/regression/gbt_price_log_v1...


GBTRegressor significantly outperformed the Random Forest model because:

1. GBT learns iteratively and corrects previous errors.

2. It captures complex non-linear relationships better.

3. It handles skewed and noisy data more effectively.

As a result, GBT provides:

1. Higher R² on the log scale

2. Lower RMSE and MAE after inverse-transforming predictions

#### Try another way to process data

In [25]:
from pyspark.sql.functions import (
    log1p, expm1, col, when, length, regexp_extract, 
    lower, expr, percentile_approx, mean, stddev
)
from pyspark.ml import Pipeline
from pyspark.ml.feature import (
    StringIndexer, VectorAssembler, StandardScaler, 
    Tokenizer, StopWordsRemover, HashingTF
)
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator


initial_count = df_improved.count()
print(f"Initial number of records: {initial_count:,}")

price_stats = df_improved.select(
    percentile_approx("price", 0.25).alias("q1"),
    percentile_approx("price", 0.75).alias("q3")
).collect()[0]

q1, q3 = price_stats["q1"], price_stats["q3"]
iqr = q3 - q1
lower_bound = max(1.0, q1 - 1.5 * iqr)  
upper_bound = q3 + 1.5 * iqr

print(f"\nRemoving emissions at a price:")
print(f"  Q1 = ${q1:.2f}, Q3 = ${q3:.2f}, IQR = ${iqr:.2f}")

df_clean = df_improved.filter(
    (col("price") >= lower_bound) & (col("price") <= upper_bound)
)
print(f"  Deleted: {initial_count - df_clean.count():,}")

df_clean = df_clean.filter(col("rating_number") >= 5)
print(f"Deleted with low ratings: {df_clean.count():,}")

df_clean = df_clean.filter(
    (col("average_rating") >= 1.0) & (col("average_rating") <= 5.0)
)


df_featured = df_clean

df_featured = df_featured.withColumn(
    "rating_popularity", col("average_rating") * log1p(col("rating_number"))
)
df_featured = df_featured.withColumn(
    "rating_confidence", when(col("rating_number") < 10, 0)
                         .when(col("rating_number") < 50, 1)
                         .when(col("rating_number") < 100, 2)
                         .otherwise(3)
)

if "title" in df_featured.columns:
    df_featured = df_featured.withColumn("title_len", length(col("title")))
    
    df_featured = df_featured.withColumn(
        "is_premium", 
        when(lower(col("title")).rlike("premium|deluxe|professional|pro"), 1).otherwise(0)
    )
    df_featured = df_featured.withColumn(
        "is_bundle", 
        when(lower(col("title")).rlike("bundle|pack|set"), 1).otherwise(0)
    )
    df_featured = df_featured.withColumn(
        "has_size", 
        when(lower(col("title")).rlike("\\d+\\s*(oz|lb|ml|kg|inch|ft)"), 1).otherwise(0)
    )

if "features_count" in df_featured.columns:
    df_featured = df_featured.withColumn(
        "features_count", when(col("features_count").isNull(), 0).otherwise(col("features_count"))
    )
if "desc_len" in df_featured.columns:
    df_featured = df_featured.withColumn(
        "desc_len", when(col("desc_len").isNull(), 0).otherwise(col("desc_len"))
    )
df_featured = df_featured.withColumn("log_rating_number", log1p(col("rating_number")))
if "store_freq" in df_featured.columns:
    df_featured = df_featured.withColumn("log_store_freq", log1p(col("store_freq")))

df_featured = df_featured.withColumn("label", log1p(col("price")))


stages = []

if "title" in df_featured.columns:
    tokenizer = Tokenizer(inputCol="title", outputCol="words")
    stages.append(tokenizer)
    
    hashingTF = HashingTF(inputCol="words", outputCol="title_features", numFeatures=500)
    stages.append(hashingTF)

indexer = StringIndexer(
    inputCol="main_category", 
    outputCol="category_index", 
    handleInvalid="keep"
)
stages.append(indexer)

numeric_cols = [
    "category_index",
    "average_rating",
    "log_rating_number",      
    "rating_popularity",
    "rating_confidence",
]

optional_cols = [
    "title_len", "is_premium", "is_bundle", "has_size",
    "features_count", "desc_len", "log_store_freq"
]

for col_name in optional_cols:
    if col_name in df_featured.columns:
        numeric_cols.append(col_name)

input_cols = numeric_cols
if "title_features" in df_featured.columns:
    input_cols = numeric_cols + ["title_features"]


assembler = VectorAssembler(inputCols=input_cols, outputCol="features_raw")
stages.append(assembler)

scaler = StandardScaler(
    inputCol="features_raw", 
    outputCol="features_vector",
    withStd=True,
    withMean=False 
)
stages.append(scaler)


gbt = GBTRegressor(
    featuresCol="features_vector",
    labelCol="label",
    maxIter=30,         
    maxDepth=6,          
    stepSize=0.05,       
    subsamplingRate=0.8, 
    minInstancesPerNode=10,
    seed=42
)
stages.append(gbt)

Initial number of records: 644,239

Removing emissions at a price:
  Q1 = $12.99, Q3 = $49.99, IQR = $37.00
  Deleted: 78,561
Deleted with low ratings: 444,684


In [26]:
pipeline = Pipeline(stages=stages)

train_data, test_data = df_featured.randomSplit([0.8, 0.2], seed=42)
print(f"Train: {train_data.count():,}, Test: {test_data.count():,}")

train_data.cache()
test_data.cache()

model = pipeline.fit(train_data)

Train: 355,713, Test: 88,971


In [27]:
predictions = model.transform(test_data)

predictions = predictions.withColumn("prediction_price", expm1(col("prediction")))
predictions = predictions.withColumn("error", col("price") - col("prediction_price"))
predictions = predictions.withColumn("error_pct", 
    (col("error") / col("price")) * 100
)

r2_log = RegressionEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="r2"
).evaluate(predictions)
rmse = RegressionEvaluator(
    labelCol="price", 
    predictionCol="prediction_price", 
    metricName="rmse"
).evaluate(predictions)

mae = RegressionEvaluator(
    labelCol="price", 
    predictionCol="prediction_price", 
    metricName="mae"
).evaluate(predictions)

r2_real = RegressionEvaluator(
    labelCol="price", 
    predictionCol="prediction_price", 
    metricName="r2"
).evaluate(predictions)

print("\Results:")
print(f"  R² (Log Scale):  {r2_log:.4f}")
print(f"  R² (Real Price): {r2_real:.4f}")
print(f"  RMSE:            ${rmse:.2f}")
print(f"  MAE:             ${mae:.2f}")

print("\n Examples of predictions:")
predictions.select(
    "title", 
    col("price").alias("real_$"), 
    col("prediction_price").alias("pred_$"),
    col("error_pct").alias("error_%")
).show(10, truncate=60)


gbt_model = model.stages[-1]

importances = gbt_model.featureImportances.toArray()
feature_names = numeric_cols + (["title_features"] if "title_features" in df_featured.columns else [])

feature_imp = list(zip(feature_names, importances))
feature_imp.sort(key=lambda x: x[1], reverse=True)

print("\nImportant features:")
for i, (name, importance) in enumerate(feature_imp[:10], 1):
    print(f"  {i:2d}. {name:25s} {importance:.4f}")

train_data.unpersist()
test_data.unpersist()

\Results:
  R² (Log Scale):  0.0943
  R² (Real Price): -0.0215
  RMSE:            $22.37
  MAE:             $15.18

 Examples of predictions:
+------------------------------------------------------------+------+------------------+-------------------+
|                                                       title|real_$|            pred_$|            error_%|
+------------------------------------------------------------+------+------------------+-------------------+
|          .30-06 Outdoors Premium Parallel Limb Bow Case 41"| 49.99| 25.99273602864092|  48.00413051428783|
|                .30-06 Mustang Compact Camo Release Web Stem| 25.99|29.892387867223743|-15.014960101854927|
|                               .30-06 Outdoors K3 Stabilizer| 54.95|21.933703036420997|  60.08425344515375|
|                   30-06 Outdoors Alpha Crossbow Case, Black| 94.76| 26.21370614809161|  72.33673959778324|
|                       .30-06 10 Ring Paper Target 100 Count| 51.59|26.105859987138903|  49.39

DataFrame[store: string, parent_asin: string, title: string, main_category: string, average_rating: double, rating_number: bigint, features: array<string>, description: array<string>, price: float, main_category_label: string, features_count: int, title_len: int, store_freq: bigint, rating_popularity: double, rating_confidence: int, is_premium: int, is_bundle: int, has_size: int, log_rating_number: double, log_store_freq: double, label: double]

Shows which inputs have the strongest impact on price prediction, with category, store frequency, and title length being the most influential.