# Univariate Analysis

**Goal**: Understand distribution of individual variables (sales, prices, categories) using PySpark for big data (58M rows).

> **Important Note**: This file was run using Conda environment (pyspark_env).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Setup Session (Spark Session)
spark = SparkSession.builder \
    .appName("M5 Univariate Analysis") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .config("spark.python.worker.faulthandler.enabled", "true") \
    .getOrCreate()

print("Spark Session created successfully")

In [None]:
# Load Data
df = spark.read.parquet("train.parquet")
df.printSchema()

## Analysis A: Global Sales Distribution

**Strategy**:
1. **Filter**: Exclude zero values `sales == 0` (approx 70% of data).
2. **Binning**: Use `.histogram()` function within Spark.
3. **Visualize**: Plot bar chart using Log Scale.

In [None]:
# 1. Filter Zero Sales (Sampling for robustness)
# Using sampling and Pandas to avoid RDD serialization issues in this env
pdf = df.filter(col('sales') > 0).select('sales').sample(fraction=0.1).limit(100000).toPandas()

# 2. Plotting
plt.figure(figsize=(12, 6))
plt.hist(pdf['sales'], bins=50, log=True, alpha=0.7)
plt.title('Global Sales Distribution (Log Scale - Sampled)')
plt.xlabel('Sales Units')
plt.ylabel('Count (Log Scale)')
plt.grid(axis='y', which='both', linestyle='--', alpha=0.5)
plt.show()


## Analysis B: Price Distribution

**Strategy**:
1. **Distinct Values**: Select unique `(item_id, sell_price)` pairs to reduce data size.
2. **Collection**: Convert small result to Pandas.
3. **Visualization**: Plot Kernel Density Estimation (KDE).

In [None]:
# 1. Query Distinct Values
distinct_prices_df = df.select("item_id", "sell_price") \
    .filter(col("sell_price").isNotNull()) \
    .distinct()

# 2. Convert to Pandas
pdf_prices = distinct_prices_df.toPandas()
print(f"Number of distinct price points: {len(pdf_prices)}")

# 3. Plotting (KDE)
plt.figure(figsize=(12, 6))
sns.kdeplot(data=pdf_prices, x='sell_price', fill=True, color='purple')
plt.title("Price Distribution (Distinct Products)")
plt.xlabel("Sell Price ($)")
plt.grid(True, alpha=0.3)
plt.show()

## Analysis C: Category Balance

**Strategy**:
1. **Aggregation**: Group by `cat_id` and count.
2. **Visualization**: Plot bar chart for item count per category.

In [None]:
# 1. Aggregation and Counting
cat_counts = df.groupBy("cat_id").count().orderBy(desc("count"))

# Convert to Pandas
pdf_cat = cat_counts.toPandas()

# 2. Plotting
plt.figure(figsize=(10, 6))
sns.barplot(data=pdf_cat, x='cat_id', y='count', palette='viridis')
plt.title("Category Balance (Total Rows)")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()