### Descriptive statistics

**1. The "Big Picture" Summary**
- In Databricks SQL, you can get a quick overview of any numeric column (like price) using the summary or describe functions.
```
SQL
-- This gives you Count, Mean, StdDev, Min, 25%, 50%, 75%, and Max
SELECT summary(price) FROM ecommerce_prod.silver.cleaned_events;
```
- What to look for:
- **Mean vs. Median (50%):** If the Mean is much higher than the Median, you have a few very expensive "outlier" products pulling the average up.
- **Min/Max:** If Min is negative or zero, you might have data errors (unless they are "refund" events).

**2. Measuring Central Tendency**
- These tell you where the "middle" of your data lies.
- Mean (Average): The sum of all prices divided by the count.
- Median: The middle value. It is more "robust" because it isn't affected by one $10,000 TV in a sea of $10 headphones.
- Mode: The most frequent value (e.g., the most common price point).
```
SQL
SELECT 
  avg(price) as mean_price,
  percentile(price, 0.5) as median_price, -- The 50th percentile
  count(*) as total_rows
FROM ecommerce_prod.silver.cleaned_events;
```

**3. Measuring Dispersion (The "Spread")**
- These tell you how much your data varies. Is every product $50, or do prices range from $1 to $5,000?
- Standard Deviation (StdDev): On average, how far are the prices from the Mean? A high StdDev means high price variety.
- Range: The difference between Max and Min.
- Interquartile Range (IQR): The difference between the 75th and 25th percentile. This tells you where the "middle 50%" of your sales happen.
```
SQL
SELECT 
  stddev(price) as price_volatility,
  max(price) - min(price) as price_range,
  percentile(price, 0.75) - percentile(price, 0.25) as iqr
FROM ecommerce_prod.silver.cleaned_events;
```

**4. Categorical Statistics (Frequency)**
- For non-numeric columns like event_type or brand, we look at distribution and uniqueness.
- Distinct Count: How many unique brands do we have?
- Mode (Top Category): Which category has the most events?
```
SQL
SELECT 
  count(distinct brand) as unique_brands,
  count(distinct main_category) as unique_categories
FROM ecommerce_prod.silver.cleaned_events;
```

### Hypothesis testing

**1. The Core Concept**
- Null Hypothesis (H_0): There is no real difference. (e.g., "Apple and Samsung users spend the same amount of money.")
- Alternative Hypothesis (H_1): There is a real difference. (e.g., "Apple users spend significantly more than Samsung users.")
- P-Value: The probability that the difference we see happened by pure luck.
- p < 0.05: We reject the Null. The difference is "Statistically Significant."
- p > 0.05: We fail to reject the Null. The difference could just be random noise.

**2. Common Tests for your Data**
| Test Type | When to use it | Example | 
| ----- | ----- | ----- |
| T-Test | Comparing the Means of two groups. | Do 'View' events have a higher average price than 'Purchase' events?| 
| ANOVA | Comparing the Means of 3+ groups. | Is there a price difference between Electronics, Appliances, and Apparel?
| Chi-Square | Comparing Categorical distributions. | Does one Brand have a significantly higher conversion rate than others?

**3. Practical Example: The T-Test**
- Let's test if Smartphone sales have a significantly different average price than Laptop sales.
- Hypothesis: H_0: \mu_{smartphones} = \mu_{laptops}.
- In Databricks, you can use the scipy.stats library in a Python notebook to run this on dataset:
```
Python
from scipy import stats
```
**1. Get data for two groups**
```
smartphones = spark.sql("SELECT price FROM ecommerce_prod.silver.cleaned_events WHERE main_category = 'electronics' AND sub_category = 'smartphone'").toPandas()
laptops = spark.sql("SELECT price FROM ecommerce_prod.silver.cleaned_events WHERE main_category = 'electronics' AND sub_category = 'laptop'").toPandas()
```
**2. Run T-Test**
```
t_stat, p_val = stats.ttest_ind(smartphones['price'], laptops['price'])
print(f"P-Value: {p_val}")
if p_val < 0.05:
    print("Significant: There is a real difference in average price.")
else:
    print("Not Significant: The difference is likely due to random chance.")
```

### A/B test design

**1. The Design Phase (The Blueprint)**
- Before touching any data, you must define the "rules" of the test.
- **Control Group (A):** Users who see the current website.
- **Treatment Group (B):** Users who see the "New Buy Now Button."
- **Metric:** Conversion Rate (Purchases / Total Visits).
- **Hypothesis:** The new button will increase the conversion rate by at least 2%.

**2. Splitting the Data**
- To avoid bias, you must split rows randomly. You can do this in your Silver layer using a hash of the user_id.
```
SQL-- Creating a split in SQL
CREATE OR REPLACE TABLE ecommerce_prod.silver.ab_test_data AS
SELECT 
  *,
  CASE 
    WHEN abs(hash(user_id)) % 2 = 0 THEN 'Group A' 
    ELSE 'Group B' 
  END AS test_group
FROM ecommerce_prod.silver.cleaned_events;
```

**3. Determining Sample Size (Power Analysis)**
- How many users do you need before you can trust the result? With 42M rows, you likely have enough, but for a professional design, you need to consider:
- Alpha: Usually 0.05 (the risk of a false positive).
- Power (1-beta): Usually 0.80 (the chance of detecting an effect if it exists).
- MDE (Minimum Detectable Effect): The smallest change you care about (e.g., 1%).

**4. Running the Analysis**
- After the test runs for a week, you compare the groups. You use a Chi-Square Test because "Converted" vs. "Not Converted" is categorical data (Yes/No).
- The Calculation Table:
| Group | Total Users | Purchases | Conversion Rate | 
| ----- | ----- | ----- | ----- |
| Group A | 1,000,000| 20,000 | 2.0%
| Group B | 1,000,000 | 22,000 | 2.2%

**5. Interpreting the Result in Python**
- You can use the statsmodels library to see if that 0.2% jump is "real" or just luck.
```
Python
from statsmodels.stats.proportion import proportions_ztest

# Successes (purchases) and Observations (total users)
count = [20000, 22000]
nobs = [1000000, 1000000]

stat, pval = proportions_ztest(count, nobs)

print(f"P-value: {pval:.4f}")

if pval < 0.05:
    print("Result is Significant: Deploy the new button!")
else:
    print("Result is not Significant: Keep the old design.")
```

**6. Common Pitfalls to Avoid**
- Peeking: Don't stop the test the moment it looks "good." Wait until the sample size is reached.
- Selection Bias: Ensure users don't "switch" groups halfway through.
- Seasonality: Don't run a test during Black Friday and assume it will apply to a normal Tuesday in March.

### Feature engineering

**1. Time-Based Features**
- Raw timestamps are hard for models to read. We extract parts of the date to find patterns like "Weekend Shopping" or "Night Owls."
- Is_Weekend: Do people buy more on Saturdays?
- Hour_of_Day: When is the peak traffic?
- Part_of_Day: Morning, Afternoon, Evening, or Night.
```
SQL
CREATE OR REPLACE TABLE ecommerce_prod.silver.featured_events AS
SELECT 
  *,
  date_format(event_time, 'EEEE') AS day_of_week,
  hour(event_time) AS hour_of_day,
  CASE WHEN dayofweek(event_time) IN (1, 7) THEN 1 ELSE 0 END AS is_weekend
FROM ecommerce_prod.silver.cleaned_events;
```

**2. User-Level Aggregations (Behavioral Features)**
- We want to summarize a userâ€™s entire history into a single row. This is often called a User Profile.
- Recency: How many days since their last purchase?
- Frequency: How many times have they visited?
- Monetary: What is their total lifetime spend? (This is the RFM model).
```
SQL
CREATE OR REPLACE TABLE ecommerce_prod.gold.user_features AS
SELECT 
  user_id,
  count(DISTINCT session_id) AS total_sessions,
  sum(CASE WHEN event_type = 'purchase' THEN price ELSE 0 END) AS total_spend,
  count(CASE WHEN event_type = 'cart' THEN 1 END) AS total_adds_to_cart,
  datediff(max(event_date), min(event_date)) AS customer_tenure_days
FROM ecommerce_prod.silver.cleaned_events
GROUP BY user_id;
```

**3. Product-Level Features (Popularity)**
- We can create features that describe the products themselves based on how users interact with them.
- Conversion_Rate: How many people buy this product after viewing it?
- Price_vs_Average: Is this specific product cheaper or more expensive than the brand's average?

**4. Encoding Categorical Data**
- Machine Learning models can't read "Samsung" or "Electronics." We have to turn them into numbers.
- One-Hot Encoding: Creating a column for each brand (1 if it's Samsung, 0 if not).
- Target Encoding: Replacing "Samsung" with the average purchase rate for all Samsung products.

### Task 1: Calculate Statistical Summaries

In [0]:
%sql
SELECT 
    brand,
    COUNT(*) as total_events,
    ROUND(AVG(price), 2) as avg_price,
    ROUND(STDDEV(price), 2) as price_volatility,
    PERCENTILE(price, 0.5) as median_price,
    MIN(price) as min_price,
    MAX(price) as max_price
FROM ecommerce_prod.silver.cleaned_events
WHERE event_type = 'purchase'
GROUP BY brand
ORDER BY total_events DESC
LIMIT 10;

### Task 2: Test Hypotheses (Weekday vs. Weekend)

In [0]:
%sql
SELECT 
    CASE WHEN dayofweek(event_date) IN (1, 7) THEN 'Weekend' ELSE 'Weekday' END AS day_type,
    COUNT(*) as total_purchases,
    ROUND(AVG(price), 2) as avg_spent
FROM ecommerce_prod.silver.cleaned_events
WHERE event_type = 'purchase'
GROUP BY 1;

### Task 3: Identify Correlations

In [0]:
%sql
WITH user_behavior AS (
  SELECT 
    user_id,
    COUNT(CASE WHEN event_type = 'view' THEN 1 END) as view_count,
    SUM(CASE WHEN event_type = 'purchase' THEN price ELSE 0 END) as total_spend
  FROM ecommerce_prod.silver.cleaned_events
  GROUP BY user_id
)
SELECT 
    corr(view_count, total_spend) as correlation_coefficient
FROM user_behavior;

### Task 4: Engineer Features for ML

In [0]:
%sql
CREATE OR REPLACE TABLE ecommerce_prod.gold.user_ml_features AS
SELECT 
    user_id,
    -- Feature 1: Total interactions
    COUNT(*) as interaction_count,
    -- Feature 2: Conversion rate
    COUNT(CASE WHEN event_type = 'purchase' THEN 1 END) / COUNT(*) as conversion_rate,
    -- Feature 3: Weekend warrior (percentage of activity on weekends)
    COUNT(CASE WHEN dayofweek(event_date) IN (1, 7) THEN 1 END) / COUNT(*) as weekend_ratio,
    -- Feature 4: Average price of items viewed
    AVG(price) as avg_viewed_price,
    -- Feature 5: Most frequent category (using a subquery or approx_top_k)
    count(DISTINCT main_category) as category_diversity
FROM ecommerce_prod.silver.cleaned_events
GROUP BY user_id;

In [0]:
%sql
select * from ecommerce_prod.gold.user_ml_features limit 10;