## Training Random Forest Model

In [1]:
import zipfile
import os

zip_path = "../../data/cleaned/regression_price.zip"  
extract_path = "../../data/cleaned/regression_data"   

if not os.path.exists(extract_path):
    print("Unzip...")
    with zipfile.ZipFile(zip_path, 'r') as zf:
        zf.extractall(extract_path)
    print("Ready!")
else:
    print("Folder already exists.")

Folder already exists.


In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

In [2]:
spark = SparkSession.builder \
    .appName("Amazon Price Prediction RF") \
    .config("spark.driver.memory", "8g") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

In [3]:
DATA_PATH = "../../data/cleaned/regression_data"

In [4]:
df = spark.read.parquet(DATA_PATH)

print(f"Total rows: {df.count()}")

Total rows: 699283


In [5]:
from pyspark.sql.functions import col, size, length, when

has_array_features = [f.dataType for f in df.schema.fields if f.name == "features"] # checks the data type of the features column and creates a new column features_count
if str(has_array_features[0]).startswith("ArrayType"):
    df = df.withColumn("features_count", size(col("features")))
else:
    if "features_count" not in df.columns and "features" in df.columns:
         df = df.withColumn("features_count", length(col("features")))

In [6]:
required_cols = ["rating_number", "average_rating", "main_category", "price"]

#### Add some features

In [7]:
if "features_count" in df.columns: required_cols.append("features_count")
if "desc_len" in df.columns: required_cols.append("desc_len")

df_clean = df.dropna(subset=required_cols)

In [8]:
df_clean = df_clean.withColumn("title_len", length(col("title")))

In [9]:
from pyspark.sql.functions import length, col, count

store_counts = df_clean.groupBy("store").agg(count("*").alias("store_freq"))

In [10]:
df_improved = df_clean.join(store_counts, on="store", how="left")
df_improved = df_improved.na.fill(0, subset=["store_freq", "features_count", "title_len"])

#### Split data and train the model

Join additional store statistics, fills missing numeric values, cleans and tokenizes product titles, removes stopwords, converts text into numerical vectors, encodes the product category, assembles all features into a single vector, and finally trains the model

In [17]:
from pyspark.ml.feature import Tokenizer, HashingTF, StopWordsRemover
from pyspark.sql.functions import broadcast

df_improved = df_clean.join(broadcast(store_counts), on="store", how="left")

cols_to_fill = ["store_freq", "title_len", "average_rating", "rating_number"]
if "features_count" in df_improved.columns: cols_to_fill.append("features_count")
if "desc_len" in df_improved.columns: cols_to_fill.append("desc_len")

df_improved = df_improved.na.fill(0, subset=cols_to_fill)

df_improved = df_improved.filter(col("title").isNotNull())
df_improved.cache()
print(f"Caching...: {df_improved.count()}")
stages = []

tokenizer = Tokenizer(inputCol="title", outputCol="words")
stages.append(tokenizer)

remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
stages.append(remover)

hashingTF = HashingTF(inputCol="filtered_words", outputCol="title_features", numFeatures=1000)
stages.append(hashingTF)

indexer = StringIndexer(inputCol="main_category", outputCol="category_index", handleInvalid="keep")
stages.append(indexer)

numeric_cols = ["category_index", "store_freq", "average_rating", "rating_number"]
if "title_len" in df_improved.columns: numeric_cols.append("title_len")
if "features_count" in df_improved.columns: numeric_cols.append("features_count")
if "desc_len" in df_improved.columns: numeric_cols.append("desc_len")

assembler_inputs = numeric_cols + ["title_features"]
print(f"Features: {assembler_inputs}")

assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features_vector")
stages.append(assembler)

rf = RandomForestRegressor(
    featuresCol="features_vector", 
    labelCol="price", 
    numTrees=30, 
    maxDepth=8, 
    seed=42,
    maxBins=64  
)
stages.append(rf)

pipeline = Pipeline(stages=stages)


train_data, test_data = df_improved.randomSplit([0.8, 0.2], seed=42)
print("Training with NLP features...")
model = pipeline.fit(train_data)
print("Ready!")

predictions = model.transform(test_data)
r2_eval = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
print(f"R2 Score: {r2_eval.evaluate(predictions):.4f}")

Caching...: 644239
Features: ['category_index', 'store_freq', 'average_rating', 'rating_number', 'title_len', 'features_count', 'title_features']
Training with NLP features...
Ready!
R2 Score: 0.1024


In [18]:
model_path = "../../models/regression/random_forest_price_v1"

print(f"Save model to {model_path}...")
model.write().overwrite().save(model_path)

Save model to ../../models/regression/random_forest_price_v1...


Calculate descriptive statistics and percentiles for the product price distribution, prints min/median/max values, and computes correlations between price and selected numeric features such as ratings, feature counts, or description length.

In [16]:
from pyspark.sql.functions import (
    log1p, expm1, col, when, length, regexp_extract, 
    lower, expr, percentile_approx, mean, stddev
)

print("\nPrice distribution:")
df_improved.select("price").describe().show()

price_stats = df_improved.agg(
    expr("percentile_approx(price, 0.05) as p5"),
    expr("percentile_approx(price, 0.25) as p25"),
    expr("percentile_approx(price, 0.50) as median"),
    expr("percentile_approx(price, 0.75) as p75"),
    expr("percentile_approx(price, 0.95) as p95"),
    expr("min(price) as min_price"),
    expr("max(price) as max_price")
).collect()[0]

print(f"  Min:    ${price_stats['min_price']:.2f}")
print(f"  5%:     ${price_stats['p5']:.2f}")
print(f"  25%:    ${price_stats['p25']:.2f}")
print(f"  Median: ${price_stats['median']:.2f}")
print(f"  75%:    ${price_stats['p75']:.2f}")
print(f"  95%:    ${price_stats['p95']:.2f}")
print(f"  Max:    ${price_stats['max_price']:.2f}")

print("\nCorrelation:")

numeric_features = ["average_rating", "rating_number"]
if "features_count" in df_improved.columns:
    numeric_features.append("features_count")
if "desc_len" in df_improved.columns:
    numeric_features.append("desc_len")

for feat in numeric_features:
    if feat in df_improved.columns:
        corr = df_improved.stat.corr("price", feat)
        print(f"  {feat:20s} <-> price: {corr:.4f}")


Price distribution:
+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|            644239|
|   mean|63.137878977474756|
| stddev|183.67803447921332|
|    min|              0.01|
|    max|          21999.98|
+-------+------------------+

  Min:    $0.01
  5%:     $6.25
  25%:    $12.99
  Median: $23.99
  75%:    $49.99
  95%:    $219.99
  Max:    $21999.98

Correlation:
  average_rating       <-> price: 0.0094
  rating_number        <-> price: -0.0138
  features_count       <-> price: 0.0023


### Conclusions from Price Distribution:

1. Highly skewed price distribution

    The average price is $63, but the median is only $23.99, meaning most products are inexpensive while a small number of very expensive products pull the mean upward.

2. Most prices fall in the lower range

    25% of products cost $12.99 or less.

    75% cost $49.99 or less.

    Only 5% exceed $219.99.

3. There are extreme outliers

    Maximum price is $21,999, which is far above typical product prices and indicates the presence of niche or incorrectly priced items.

4. Price range is wide

    From $0.01 to $21,999, showing very diverse product types or potentially inconsistent listings.

### Conclusions from Correlation Analysis:

1. All correlations between numerical features and price are close to 0, meaning almost no linear relationship:

    Average rating → price (0.0094)

    Product rating has no meaningful correlation with price.

    High-rated products are not necessarily more expensive.

2. Number of ratings → price (-0.0138)

    Popular products (many reviews) are not systematically more expensive or cheaper.

    Features count → price (0.0023)

3. The number of features extracted from the listing shows no statistical relationship with price.

In [19]:
spark.stop()