# Sports and Outdoors Category - Business Questions Implementation

**Category**: Sports and Outdoors  
**Team Member**: Denys Koval


In [1]:
import os
from pathlib import Path
from datetime import datetime, timedelta
import warnings

import rootutils

rootutils.setup_root(Path.cwd(), indicator=".project-root", pythonpath=True)

ROOT_DIR = Path(os.environ.get("PROJECT_ROOT", Path.cwd()))
CLEANED_REVIEWS_PATH = ROOT_DIR / "data/cleaned/review_categories/sports_and_outdoors_reviews_cleaned.parquet"
CLEANED_METADATA_PATH = ROOT_DIR / "data/cleaned/meta_categories/sports_and_outdoors_metadata_cleaned.parquet"

warnings.filterwarnings("ignore")

In [2]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from src.amazon_reviews_analysis.utils import build_spark

# Initialize Spark
spark = build_spark()
print("âœ“ Spark Session created successfully!")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/11 16:26:54 WARN Utils: Your hostname, LT-W-7826.local, resolves to a loopback address: 127.0.0.1; using 192.168.31.191 instead (on interface en0)
25/11/11 16:26:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/11 16:26:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


âœ“ Spark Session created successfully!


In [3]:
print("ðŸ“‚ Loading cleaned datasets...")
reviews_df = spark.read.parquet(str(CLEANED_REVIEWS_PATH))
metadata_df = spark.read.parquet(str(CLEANED_METADATA_PATH))

print(f"Reviews: {reviews_df.count():,} records")
print(f"Metadata: {metadata_df.count():,} records")

ðŸ“‚ Loading cleaned datasets...


                                                                                

Reviews: 19,362,313 records
Metadata: 1,587,309 records


## Q1: Heavy-Duty Outdoor Gear (FILTER)

**Business Question**: Which outdoor equipment products are highly rated (â‰¥4.5 stars), frequently reviewed (â‰¥200 reviews), and mentioned in detailed customer feedback?

**Purpose**: Identify reliable, battle-tested outdoor gear that customers trust enough to write comprehensive reviews about their experiences.


In [4]:
print("Finding heavy-duty outdoor gear with extensive customer feedback\n")

# Calculate average review text length per product
reviews_with_length = reviews_df.withColumn("text_length", length(col("text")))

avg_review_length_per_product = (
    reviews_with_length
    .groupBy("parent_asin")
    .agg(
        avg("text_length").alias("avg_text_length"),
        count("*").alias("review_count")
    )
)

# Join with metadata and filter
q1_result = (
    metadata_df
    .join(avg_review_length_per_product, "parent_asin", "inner")
    .filter(
        (col("average_rating") >= 4.5)
        & (col("review_count") >= 200)
        & (col("avg_text_length") >= 150)  # Detailed reviews
    )
    .select("parent_asin", "title", "average_rating", "rating_number", "review_count", "avg_text_length", "store")
    .orderBy(col("review_count").desc())
)

print(f"Found {q1_result.count()} heavy-duty outdoor gear products with detailed reviews\n")
q1_result.show(15, truncate=60)

Finding heavy-duty outdoor gear with extensive customer feedback



                                                                                

Found 4934 heavy-duty outdoor gear products with detailed reviews



                                                                                

+-----------+------------------------------------------------------------+--------------+-------------+------------+------------------+-----------------+
|parent_asin|                                                       title|average_rating|rating_number|review_count|   avg_text_length|            store|
+-----------+------------------------------------------------------------+--------------+-------------+------------+------------------+-----------------+
| B0C5RBPW2Y|BalanceFrom All Purpose 1/2-Inch Extra Thick High Density...|           4.5|        88491|       20076|164.03905160390516|      BalanceFrom|
| B0B7J8Y581|Repel Umbrella The Original Portable Travel Umbrella - Um...|           4.5|        62293|       14407|170.20163809259387|   Repel Umbrella|
| B0C5XW2T2N|                                    Physix Gear Sport Modern|           4.5|        71254|       14344| 176.0022308979364|Physix Gear Sport|
| B0BBFB48YQ|Howard Leight by Honeywell Impact Sport Sound Amplificati...|  

**Business Insights:**

- Products with long, detailed reviews indicate customers are passionate and engaged
- High review volume with detailed feedback = proven reliability in real-world use
- These are "workhorse" products that outdoor enthusiasts depend on
- Perfect for featuring in "Pro's Choice" or "Field-Tested" collections
- Can be highlighted in content marketing and influencer partnerships

---


## Q2: Polarizing Products - High Rating Variance (FILTER)

**Business Question**: Which products have significant disagreement among reviewers (mix of 5-star and 1-star ratings), indicating quality inconsistency or specific use-case suitability?

**Purpose**: Identify products that work great for some but not others - these need better descriptions, quality control, or targeted marketing.


In [None]:
print("Finding products with high rating variance (polarizing reviews)\n")

# Calculate rating statistics per product - cache this small aggregated result
rating_variance = (
    reviews_df
    .select("parent_asin", "rating")
    .groupBy("parent_asin")
    .agg(
        count("*").alias("total_reviews"),
        avg("rating").alias("avg_rating"),
        stddev("rating").alias("rating_stddev"),
        sum(when(col("rating") == 5, 1).otherwise(0)).alias("five_star_count"),
        sum(when(col("rating") == 1, 1).otherwise(0)).alias("one_star_count"),
    )
    .filter(col("total_reviews") >= 100)  # Minimum reviews for statistical significance
)

# Join with metadata
q2_result = (
    rating_variance.join(metadata_df.select("parent_asin", "title", "store", "average_rating"), "parent_asin", "inner")
    .filter(
        (col("rating_stddev") >= 1.3)  # High variance
        & (col("five_star_count") >= 20)  # Has both extremes
        & (col("one_star_count") >= 20)
    )
    .withColumn(
        "polarization_score",
        col("rating_stddev") * (col("five_star_count") + col("one_star_count")) / col("total_reviews"),
    )
    .select(
        "parent_asin",
        "title",
        "store",
        "average_rating",
        "total_reviews",
        "rating_stddev",
        "five_star_count",
        "one_star_count",
        "polarization_score",
    )
    .orderBy(col("polarization_score").desc())
)

print(f"Found {q2_result.count()} polarizing products\n")
q2_result.show(15, truncate=60)

# Clean up
rating_variance.unpersist()
print("âœ“ Q2 memory cleaned")

Finding products with high rating variance (polarizing reviews)



                                                                                

Found 8394 polarizing products





+-----------+-------------------------------------------------------------+-----------------+--------------+-------------+------------------+---------------+--------------+------------------+
|parent_asin|                                                        title|            store|average_rating|total_reviews|     rating_stddev|five_star_count|one_star_count|polarization_score|
+-----------+-------------------------------------------------------------+-----------------+--------------+-------------+------------------+---------------+--------------+------------------+
| B08JYSZX78| EVAPLUS 42V Charger PowerFast 3-Prong Inline Connector fo...|          EVAPLUS|           4.1|          170| 1.924289210653477|             75|            79|1.7431796378860909|
| B07Z9HD61R|                               NCAA License Plate Frame Black|   Elite Fan Shop|           4.6|          115| 1.891729964296823|             65|            40|1.7272317065318818|
| B00MHTDQCG|                           

                                                                                

**Business Insights:**

- High variance = product quality issues OR highly specific use-case fit
- Products with mixed reviews need investigation: manufacturing inconsistency? Size/fit issues? Misleading descriptions?
- Can improve product listings with better specifications, size guides, use-case descriptions
- May need quality control review or customer education content
- Understanding why ratings vary helps improve customer satisfaction

---


## Q3: Gift-Worthy Products - High Ratings, Low Price (FILTER)

**Business Question**: Which low-cost products (under $25) have exceptional ratings (â‰¥4.7) and substantial reviews (â‰¥150), making them perfect gift recommendations?

**Purpose**: Identify affordable, highly-rated products ideal for holiday gift guides, starter kits, and value promotions.


In [6]:
print("Finding gift-worthy products: low price + exceptional quality\n")

q3_result = (
    metadata_df.withColumn("price", col("price").try_cast("double"))
    .filter(
        (col("price").isNotNull())
        & (col("price") > 0)  # Ensure price is positive to avoid division by zero
        & (col("price") < 25)
        & (col("average_rating") >= 4.7)
        & (col("rating_number") >= 150)
    )
    .withColumn("value_score", col("average_rating") * col("rating_number") / col("price"))
    .select("parent_asin", "title", "price", "average_rating", "rating_number", "value_score", "store")
    .orderBy(col("value_score").desc())
)

print(f"Found {q3_result.count()} gift-worthy products\n")
q3_result.show(10, truncate=30)

Finding gift-worthy products: low price + exceptional quality

Found 9305 gift-worthy products





+-----------+------------------------------+-----+--------------+-------------+------------------+---------+
|parent_asin|                         title|price|average_rating|rating_number|       value_score|    store|
+-----------+------------------------------+-----+--------------+-------------+------------------+---------+
| B081ZSPJ9Z|Crosman Copperhead 4.5mm Co...| 2.98|           4.7|        50432| 79540.40268456377|  Crosman|
| B0C1VN55TH|Crosman CO2 Cartridges for ...| 4.64|           4.7|        56123| 56848.72844827588|  Crosman|
| B0B6R8N6WQ|SABRE Pepper Spray, Quick R...| 9.99|           4.7|        99602|  46859.7997997998|    SABRE|
| B0849H54B7|HOPPE'S No. 9 Lubricating O...| 3.99|           4.8|        21307|25632.481203007515|  HOPPE'S|
| B0BX5QFWQN|LifeStraw Personal Water Fi...|19.95|           4.8|        90349|21738.105263157897|LifeStraw|
| B09LW2KHPM|Vont LED Camping Lantern, L...|14.38|           4.7|        65951|21555.611961057024|     Vont|
| B0C7DTY1CL|      

                                                                                

**Business Insights:**

- "Value score" = (rating Ã— reviews) / price â†’ best bang for buck
- Perfect for gift guides, stocking stuffers, starter packs
- Low risk purchases that over-deliver on quality
- High satisfaction at low price = great for customer acquisition
- Can bundle these into "Beginner's Outdoor Kit" promotions

---


## Q4: Activity Category Performance Comparison (JOIN + GROUP BY)

**Business Question**: How do different sports/outdoor activity categories (based on product titles) compare in customer satisfaction and engagement?

**Purpose**: Identify which activity categories are thriving vs struggling to optimize inventory and marketing focus.


In [7]:
print("Comparing performance across different sports/outdoor activity categories\n")

# Categorize products based on keywords in title
metadata_categorized = metadata_df.withColumn(
    "activity_category",
    when(lower(col("title")).rlike("camping|tent|sleeping bag|backpack|hiking"), "Camping & Hiking")
    .when(lower(col("title")).rlike("bike|cycling|bicycle"), "Cycling")
    .when(lower(col("title")).rlike("fishing|rod|reel|lure"), "Fishing")
    .when(lower(col("title")).rlike("water|kayak|swim|boat|paddle"), "Water Sports")
    .when(lower(col("title")).rlike("fitness|gym|weight|dumbbell|yoga|exercise"), "Fitness")
    .when(lower(col("title")).rlike("running|runner|marathon|trail"), "Running")
    .when(lower(col("title")).rlike("golf|club|ball|tee"), "Golf")
    .otherwise("Other"),
).select("parent_asin", "activity_category", "average_rating", "rating_number")

# Cache the categorized metadata (small dataset)
metadata_categorized.cache()
metadata_categorized.count()

# Join with reviews and analyze
q4_result = (
    reviews_df.select("parent_asin", "rating", "helpful_vote")
    .join(metadata_categorized, "parent_asin", "inner")
    .groupBy("activity_category")
    .agg(
        count_distinct("parent_asin").alias("unique_products"),
        count("*").alias("total_reviews"),
        avg("rating").alias("avg_review_rating"),
        avg("helpful_vote").alias("avg_helpful_votes"),
        (count("*") / count_distinct("parent_asin")).alias("reviews_per_product"),
    )
    .filter(col("unique_products") >= 50)  # Categories with at least 50 products
    .orderBy(col("avg_review_rating").desc())
)

print("Activity Category Performance:\n")
q4_result.show(truncate=False)

# Clean up
metadata_categorized.unpersist()
print("âœ“ Q4 memory cleaned")

Comparing performance across different sports/outdoor activity categories



                                                                                

Activity Category Performance:





+-----------------+---------------+-------------+-----------------+------------------+-------------------+
|activity_category|unique_products|total_reviews|avg_review_rating|avg_helpful_votes |reviews_per_product|
+-----------------+---------------+-------------+-----------------+------------------+-------------------+
|Camping & Hiking |109774         |2452017      |4.262118084825676|1.3433153195919931|22.336955927633138 |
|Golf             |190545         |1904745      |4.247968100716894|0.7117236165470968|9.996300086593719  |
|Other            |867621         |7594369      |4.24119238872907 |0.9207802517891875|8.753094957360414  |
|Fitness          |117814         |2474269      |4.234562208070344|1.3644575428136552|21.001485392228428 |
|Fishing          |81639          |917849       |4.213848901071962|0.8758728287550567|11.242776124156347 |
|Running          |14177          |153326       |4.19735726491267 |1.155250903304071 |10.81512308668971  |
|Water Sports     |107052         |19

                                                                                

**Business Insights:**

- Reveals which sports/activities have the most satisfied customers
- "Reviews per product" shows engagement level for each category
- High ratings in a category = opportunity to expand inventory
- Low ratings = need to improve product selection or quality in that category
- Can inform seasonal marketing (camping in summer, fitness in January)
- Helps allocate marketing budget to highest-performing categories

---


## Q5: Review Quality by Store/Brand (JOIN + GROUP BY)

**Business Question**: Which stores/brands generate the most helpful and informative reviews from their customers?

**Purpose**: Identify brands that inspire customer engagement and loyalty - potential partners for co-marketing and featured brand campaigns.


In [8]:
print("Analyzing review quality and customer engagement by brand\n")

# Calculate review quality metrics - select only needed columns
reviews_with_quality = (
    reviews_df
    .select("parent_asin", "rating", "helpful_vote", "text", "images", "verified_purchase")
    .withColumn("text_length", length(col("text")))
    .withColumn("has_images", when(size(col("images")) > 0, 1).otherwise(0))
    .select("parent_asin", "rating", "helpful_vote", "text_length", "has_images", "verified_purchase")
)

# Join with metadata and aggregate by store
q5_result = (
    reviews_with_quality.join(metadata_df.select("parent_asin", "store"), "parent_asin", "inner")
    .filter(col("store").isNotNull())
    .groupBy("store")
    .agg(
        count("*").alias("total_reviews"),
        count_distinct("parent_asin").alias("unique_products"),
        avg("rating").alias("avg_rating"),
        avg("helpful_vote").alias("avg_helpful_votes"),
        avg("text_length").alias("avg_review_length"),
        (sum("has_images") / count("*") * 100).alias("pct_reviews_with_images"),
        (sum(when(col("verified_purchase") == True, 1).otherwise(0)) / count("*") * 100).alias(
            "verified_purchase_pct"
        ),
    )
    .filter(col("total_reviews") >= 100)  # Brands with sufficient data
    .withColumn(
        "engagement_score",
        (col("avg_helpful_votes") * 0.4 + col("avg_review_length") / 10 * 0.3 + col("pct_reviews_with_images") * 0.3),
    )
    .orderBy(col("engagement_score").desc())
)

print("Top brands by review quality and customer engagement:\n")
q5_result.show(20, truncate=50)
print("âœ“ Q5 complete")

Analyzing review quality and customer engagement by brand

Top brands by review quality and customer engagement:



                                                                                

+-----------------------+-------------+---------------+------------------+--------------------+------------------+-----------------------+---------------------+------------------+
|                  store|total_reviews|unique_products|        avg_rating|   avg_helpful_votes| avg_review_length|pct_reviews_with_images|verified_purchase_pct|  engagement_score|
+-----------------------+-------------+---------------+------------------+--------------------+------------------+-----------------------+---------------------+------------------+
|                    GMC|          157|              3| 3.515923566878981|   2.522292993630573| 1454.216560509554|     3.1847133757961785|   49.681528662420384|  45.5908280254777|
|          Yowza Fitness|          162|             23| 3.574074074074074|   7.074074074074074|1218.8086419753085|     3.7037037037037033|    64.19753086419753|40.504999999999995|
|       LifeSpan Fitness|          358|             17| 4.279329608938547|  10.600558659217876|1007.

**Business Insights:**

- **Engagement score** combines: helpful votes, review length, and image usage
- Brands with high engagement have passionate, loyal customers who share detailed experiences
- These brands are ideal for:
  - Featured brand partnerships and co-marketing campaigns
  - Ambassador programs (customers already advocate for the brand)
  - Premium placement in store (they drive organic marketing through reviews)
- High verified purchase % = authentic customer base
- Long reviews with images = customers invested in helping others make informed decisions

---


## Q6: Power Reviewers - Customer Loyalty Ranking (WINDOW FUNCTION)

**Business Question**: Who are the most active reviewers in Sports & Outdoors, and how do their rating patterns compare to each other?

**Purpose**: Identify influential customers for potential brand ambassador programs, beta testing, and customer advisory panels.


In [9]:
print("Identifying power reviewers and their influence in Sports & Outdoors\n")

# Aggregate reviewer statistics - cache this aggregated result
reviewer_stats = (
    reviews_df
    .select("user_id", "parent_asin", "rating", "helpful_vote", "text")
    .groupBy("user_id")
    .agg(
        count("*").alias("total_reviews"),
        count_distinct("parent_asin").alias("unique_products_reviewed"),
        avg("rating").alias("avg_rating_given"),
        sum("helpful_vote").alias("total_helpful_votes"),
        avg("helpful_vote").alias("avg_helpful_per_review"),
        avg(length("text")).alias("avg_review_length"),
    )
    .filter(col("total_reviews") >= 15)  # Active reviewers
)

reviewer_stats.cache()
reviewer_stats.count()

# Use window function to rank reviewers
window_spec = Window.orderBy(col("total_helpful_votes").desc())

q6_result = (
    reviewer_stats.withColumn("reviewer_rank", row_number().over(window_spec))
    .withColumn(
        "influence_score",
        (col("total_helpful_votes") * 0.5 + col("total_reviews") * col("avg_helpful_per_review") * 0.5),
    )
    .withColumn(
        "reviewer_type",
        when(col("avg_rating_given") >= 4.5, "Optimistic")
        .when(col("avg_rating_given") >= 3.5, "Balanced")
        .otherwise("Critical"),
    )
    .select(
        "user_id",
        "reviewer_rank",
        "total_reviews",
        "unique_products_reviewed",
        "avg_rating_given",
        "total_helpful_votes",
        "avg_helpful_per_review",
        "avg_review_length",
        "reviewer_type",
        "influence_score",
    )
    .orderBy("reviewer_rank")
)

print("Top 30 power reviewers:\n")
q6_result.show(30, truncate=False)

# Clean up
reviewer_stats.unpersist()
print("âœ“ Q6 memory cleaned")

Identifying power reviewers and their influence in Sports & Outdoors



                                                                                

Top 30 power reviewers:



25/11/11 16:32:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 16:32:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 16:32:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 16:32:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 16:32:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 16:32:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/11 1

+----------------------------+-------------+-------------+------------------------+-----------------+-------------------+----------------------+------------------+-------------+---------------+
|user_id                     |reviewer_rank|total_reviews|unique_products_reviewed|avg_rating_given |total_helpful_votes|avg_helpful_per_review|avg_review_length |reviewer_type|influence_score|
+----------------------------+-------------+-------------+------------------------+-----------------+-------------------+----------------------+------------------+-------------+---------------+
|AE46WNJS3O6HQX3B6S55NKOYHJ5Q|1            |103          |102                     |4.660194174757281|4180               |40.58252427184466     |4000.4757281553398|Optimistic   |4180.0         |
|AFAZKXZDC3TFR6NQGPUJNUZKHSZQ|2            |15           |15                      |4.533333333333333|3708               |247.2                 |1064.5333333333333|Optimistic   |3708.0         |
|AFTNR5UYSHEYWK6ENKB4PKFZQXBA|

                                                                                

**Business Insights:**

- **Power reviewers** have significant influence on purchase decisions
- **Reviewer types**:
  - **Optimistic** (avg 4.5+): Brand advocates, perfect for ambassador programs
  - **Balanced** (3.5-4.5): Trustworthy, objective reviewers
  - **Critical** (<3.5): High standards, valuable for quality feedback
- **Influence score** combines reach (helpful votes) and engagement (reviews Ã— helpfulness)
- These customers are ideal for:
  - Beta testing new products
  - Product development advisory panels
  - Influencer/ambassador partnerships
  - Early access programs
- High helpful votes = trusted opinions that drive sales

---

## Summary of All Questions

**FILTER Questions:**

1. Q1: Heavy-duty gear with detailed reviews
2. Q2: Polarizing products with rating variance
3. Q3: Gift-worthy affordable products

**JOIN + GROUP BY Questions:** 4. Q4: Activity category performance comparison 5. Q5: Review quality by brand/store

**WINDOW FUNCTION Questions:** 6. Q6: Power reviewer ranking and influence
