# Part B: Data Modelling and Analytics

**Course**: DSC3108 - Big Data Mining and Analytics  
**Scenario**: Large-Scale Retail Recommendation System

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col
import time

# Initialize Spark
spark = SparkSession.builder \
    .appName("RetailRecommendation_Modeling") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print(f"✓ Spark {spark.version} initialized")

## 1. Technique Selection: Alternating Least Squares (ALS)

**Justification:**
- ALS is the industry-standard algorithm for **collaborative filtering** at scale
- Designed for distributed computing (parallelizes matrix factorization)
- Handles implicit feedback and works well with sparse data
- Available in Spark MLlib for seamless integration

## 2. Load Data

In [None]:
# Load cleaned data (or reload from CSV)
df = spark.read.csv("transactions.csv", header=True, inferSchema=True)

# Cast columns
df = df.withColumn("user_id", col("user_id").cast("integer")) \
       .withColumn("product_id", col("product_id").cast("integer")) \
       .withColumn("rating", col("rating").cast("float"))

print(f"Loaded {df.count():,} transactions")
df.show(5)

## 3. Train-Test Split

In [None]:
(training, test) = df.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {training.count():,} rows")
print(f"Test set: {test.count():,} rows")

## 4. Model Scalability: Base ALS Model

In [None]:
# Base model configuration
als_base = ALS(
    maxIter=5,
    regParam=0.01,
    userCol="user_id",
    itemCol="product_id",
    ratingCol="rating",
    coldStartStrategy="drop"  # Handle users/items not seen in training
)

print("Training base ALS model...")
start_time = time.time()
model_base = als_base.fit(training)
train_time_base = time.time() - start_time

print(f"✓ Base model trained in {train_time_base:.2f} seconds")

## 5. Model Execution and Evaluation

In [None]:
# Make predictions
predictions_base = model_base.transform(test)

# Evaluate using RMSE
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)

rmse_base = evaluator.evaluate(predictions_base)
print(f"Base Model RMSE: {rmse_base:.4f}")

# Show sample predictions
predictions_base.select("user_id", "product_id", "rating", "prediction").show(10)

## 6. Model Optimization

We optimize by tuning hyperparameters:
- **rank**: Number of latent factors (higher = more complex)
- **maxIter**: More iterations for convergence
- **regParam**: Regularization to prevent overfitting

In [None]:
# Optimized model
als_opt = ALS(
    rank=20,           # Increased from default 10
    maxIter=10,        # Increased from 5
    regParam=0.1,      # Adjusted regularization
    userCol="user_id",
    itemCol="product_id",
    ratingCol="rating",
    coldStartStrategy="drop"
)

print("Training optimized ALS model...")
start_time = time.time()
model_opt = als_opt.fit(training)
train_time_opt = time.time() - start_time

print(f"✓ Optimized model trained in {train_time_opt:.2f} seconds")

In [None]:
# Evaluate optimized model
predictions_opt = model_opt.transform(test)
rmse_opt = evaluator.evaluate(predictions_opt)

print(f"Optimized Model RMSE: {rmse_opt:.4f}")

## 7. Performance Comparison

In [None]:
import pandas as pd

comparison = pd.DataFrame({
    'Model': ['Base', 'Optimized'],
    'RMSE': [rmse_base, rmse_opt],
    'Training Time (s)': [train_time_base, train_time_opt]
})

print("\n=== Model Comparison ===")
display(comparison)

improvement = ((rmse_base - rmse_opt) / rmse_base) * 100
print(f"\n✓ RMSE improved by {improvement:.2f}%")

## 8. Result Interpretation: Generate Recommendations

In [None]:
# Generate top 5 product recommendations for each user
user_recs = model_opt.recommendForAllUsers(5)

print("Sample recommendations for 5 users:")
user_recs.show(5, truncate=False)

In [None]:
# Generate top 5 users for each product (useful for targeted marketing)
product_recs = model_opt.recommendForAllItems(5)

print("Sample user recommendations for 5 products:")
product_recs.show(5, truncate=False)

## 9. Business Insights

**Key Findings:**
- The ALS model successfully identifies latent patterns in user-product interactions
- Lower RMSE indicates better prediction accuracy for user preferences
- Recommendations can be used for:
  - Personalized homepage displays
  - Email marketing campaigns
  - Cross-selling at checkout

## Summary

**Part B Completed:**
- ✓ Selected and justified ALS technique
- ✓ Implemented scalable model using Spark MLlib
- ✓ Executed and optimized model
- ✓ Interpreted results and generated recommendations

**Next**: Compile final report (Part C)