# Databricks Data Preparation in ML - Notebook 05
## Data Standardization Fundamentals

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

This notebook covers essential data standardization techniques required for Databricks ML Associate Certification:

- **Feature Scaling** - StandardScaler, MinMaxScaler, RobustScaler methods
- **Normalization** - Unit vectors and L2 normalization techniques
- **String Standardization** - Case, whitespace, and format cleaning
- **Date/Time Standardization** - Consistent formats and timezone handling
- **Categorical Standardization** - Uniform category naming conventions
- **Outlier Treatment** - Detection and handling strategies for extreme values

## Duration: ~40 minutes
## Level: Fundamental → Intermediate

---

## Why is Data Standardization Critical?

**Data Standardization** is a crucial preprocessing step for ML model success:
- **Algorithm Performance** - Many algorithms require features on similar scales
- **Convergence Speed** - Gradient descent converges faster with standardized features
- **Feature Importance** - Ensures equal treatment regardless of original scale
- **Distance-based Algorithms** - KNN, clustering, and SVM require standardized inputs

---

## Theory: Standardization Types

### Feature Scaling Methods

Different scaling methods serve different purposes and algorithm requirements:

```
Standard Scaling (Z-score): (x - μ) / σ
- Centers data around 0 with unit variance
- Best for: Linear algorithms, neural networks

Min-Max Scaling: (x - min) / (max - min)  
- Scales data to [0,1] range
- Best for: When bounded range is required

Robust Scaling: (x - median) / IQR
- Uses median and interquartile range
- Best for: Data with outliers
```

### Normalization Methods

Normalization adjusts individual samples rather than features:

```
L2 Normalization: x / ||x||₂
- Unit length vectors
- Best for: Text analysis, cosine similarity

L1 Normalization: x / ||x||₁  
- Manhattan distance normalization
- Best for: Sparse data scenarios

Unit Vector: x / |x|
- Simple magnitude normalization
- Best for: Direction-based analysis
```

### Selection Guidelines:
- **Linear Models** → Standard Scaling
- **Tree-based Models** → Often no scaling needed
- **Neural Networks** → Standard or Min-Max Scaling
- **Distance-based** → Standard or Robust Scaling

##Environment Setup

In [0]:
# Basic imports for Databricks ML
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, when, trim, lower, upper, regexp_replace, 
    to_timestamp, date_format, percentile_approx,
    mean, stddev, min as spark_min, max as spark_max
)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.ml.feature import StandardScaler, MinMaxScaler, RobustScaler, Normalizer, VectorAssembler
from pyspark.ml.stat import Summarizer
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

In [0]:
# Creating a demonstration dataset with different scales
np.random.seed(42)

n_samples = 1000
ages = np.random.randint(12, 35, n_samples).clip(18, 65).tolist()
salaries = np.random.lognormal(10.5, 0.8, n_samples).tolist()  # Different scales!
heights = np.random.normal(170, 10, n_samples).clip(150, 200).tolist()
scores = np.random.uniform(0, 100, n_samples).tolist()
years_exp = np.random.exponential(5, n_samples).clip(0, 30).tolist()

# Adding outliers
outlier_indices = np.random.choice(n_samples, 20, replace=False)
for idx in outlier_indices:
    salaries[idx] = salaries[idx] * 3  # Extreme outliers

# Different string formats for standardization
cities = ["  warsaw  ", "KRAKÓW", "gdańsk", "Poznań ", "  WROCŁAW"]
departments = ["  IT", "Finance ", "MARKETING", "hr  ", "Sales"]
city_samples = np.random.choice(cities, n_samples).tolist()
dept_samples = np.random.choice(departments, n_samples).tolist()

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True),
    StructField("height", DoubleType(), True),
    StructField("score", DoubleType(), True),
    StructField("years_experience", DoubleType(), True),
    StructField("city", StringType(), True),
    StructField("department", StringType(), True)
])

data = [(i, int(ages[i]), float(salaries[i]), float(heights[i]), 
         float(scores[i]), float(years_exp[i]), city_samples[i], dept_samples[i]) 
        for i in range(n_samples)]

df_raw = spark.createDataFrame(data, schema)
display(df_raw)

#Feature Scaling - Numerical Features

## Teoria
**Feature Scaling** sprowadza wszystkie numerical features do podobnej skali, co jest krytyczne dla wielu algorytmów ML.

### Kiedy używać różnych scalers:

| Scaler | Kiedy używać | Właściwości |
|--------|--------------|-------------|
| **StandardScaler** | Normalne rozkłady | μ=0, σ=1 |
| **MinMaxScaler** | Uniformne rozkłady | [0,1] range |
| **RobustScaler** | Outliers present | Median-based |

In [0]:
# Scale analysis before standardization
numerical_cols = ["age", "salary", "height", "score", "years_experience"]

print("Scale analysis of numerical variables:")
for col_name in numerical_cols:
    stats = df_raw.select(
        spark_min(col_name).alias("min"),
        spark_max(col_name).alias("max"),
        mean(col_name).alias("mean"),
        stddev(col_name).alias("std")
    ).collect()[0]
    
    print(f"  {col_name:15} | Min: {stats['min']:8.1f} | Max: {stats['max']:10.1f} | Mean: {stats['mean']:8.1f} | Std: {stats['std']:8.1f}")

print("\n  Problem: Salary has a much larger scale than other variables!")
print("    This can dominate ML algorithms based on distance (KNN, SVM, Neural Networks)")

In [0]:
display(df_raw.describe())

In [0]:
numerical_cols = ["age", "salary", "height", "score", "years_experience"]

In [0]:
# Data preparation for scaling
assembler = VectorAssembler(
    inputCols=numerical_cols,
    outputCol="features_raw"
)

df_assembled = assembler.transform(df_raw)
display(df_assembled.select("features_raw"))

## StandardScaler - Z-score Normalization

**Formula**: `(x - μ) / σ`

✅ **Używaj gdy**: Features mają normalny rozkład

In [0]:
# StandardScaler - most commonly used
standard_scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features_standard",
    withMean=True,
    withStd=True
)

standard_model = standard_scaler.fit(df_assembled)
df_standard = standard_model.transform(df_assembled)

display(df_standard.select("features_standard"))

After applying feature standardization (e.g., using StandardScaler), it is important to verify whether the transformation was successful. The two key statistical indicators we typically check are:
	•	Mean ≈ 0
	•	Standard deviation ≈ 1

We use pyspark.ml.stat.Summarizer to compute these values for each feature in the assembled feature vector.

The method works as follows:
We define the summarization metrics (mean and std), then apply .summary() on the vectorized feature column, and finally select the result with an alias.

If the standardization was applied correctly, the output should show that:

	•	Each feature has a mean close to zero,
  
	•	Each feature has a standard deviation close to one.

This is a crucial preprocessing step for many machine learning algorithms such as linear regression, logistic regression, or clustering, which assume features are on the same scale.

In [0]:
# Weryfikacja: mean ≈ 0, std ≈ 1
summary = Summarizer.metrics("mean", "std").summary(df_standard["features_standard"])
display(df_standard.select(summary.alias("summary")))

## MinMaxScaler - Range Normalization

**Formula**: `(x - min) / (max - min)`

**Use when**: You need features in a specific range [0,1]

In [0]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.stat import Summarizer
from pyspark.sql.functions import col

# MinMaxScaler - scale to [0,1]
minmax_scaler = MinMaxScaler(
    inputCol="features_raw",
    outputCol="features_minmax"
)

minmax_model = minmax_scaler.fit(df_assembled)
df_minmax = minmax_model.transform(df_assembled)

print("MinMaxScaler applied (wszystkie values w [0,1]):")
df_minmax.select("features_minmax").show(5, truncate=False)

In [0]:
# Verification: min = 0, max = 1
summary = Summarizer.metrics("min", "max").summary(col("features_minmax"))
display(df_minmax.select(summary.alias("summary")))

## RobustScaler - Outlier-Resistant Scaling

**Formula**: `(x - median) / IQR`

✅ **Używaj gdy**: Dane zawierają outliers (jak w naszym przypadku salary outliers)

In [0]:
# RobustScaler - odporny na outliers
robust_scaler = RobustScaler(
    inputCol="features_raw",
    outputCol="features_robust",
    withCentering=True,
    withScaling=True
)

robust_model = robust_scaler.fit(df_assembled)
df_robust = robust_model.transform(df_assembled)

print("RobustScaler applied (median-based, odporny na outliers):")
df_robust.select("features_robust").show(5, truncate=False)

In [0]:
# Select necessary columns from each DataFrame to avoid ambiguity
df_standard_selected = df_standard.select("id", "salary", "features_standard")
df_robust_selected = df_robust.select("id", "features_robust")

# Comparison with StandardScaler for salary outliers
comparison = df_standard_selected.join(df_robust_selected, "id").select(
    "salary", "features_standard", "features_robust"
).orderBy(col("salary").desc())

print("Comparison: StandardScaler vs RobustScaler for salary outliers:")
comparison.display(10)

# Normalization - Unit Vector Scaling

## Theory
**Normalization** scales each row (sample) to a unit vector of length 1.

**Use when**: 
- Features represent similar concepts
- Magnitude does not matter, only direction
- Text analysis, recommendation systems

In [0]:
# Verification: each vector has length = 1
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.linalg import VectorUDT, Vectors
import math

In [0]:
# L2 Normalization (most commonly used)
normalizer = Normalizer(
    inputCol="features_standard",  # Using already standardized features
    outputCol="features_normalized",
    p=2.0  # L2 norm
)

df_normalized = normalizer.transform(df_standard)

display(df_normalized.select("features_normalized"))

In [0]:
def vector_length(v):
    return float(math.sqrt(sum(x*x for x in v)))

length_udf = udf(vector_length, DoubleType())

df_lengths = df_normalized.withColumn("vector_length", length_udf("features_normalized"))
display(df_lengths.select("vector_length").describe())

## StandardScaler vs Normalizer (L2) – Comparison

In [0]:
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row

data = [
    Row(features=Vectors.dense([10.0, 100.0])),
    Row(features=Vectors.dense([20.0, 400.0])),
    Row(features=Vectors.dense([30.0, 900.0]))
]

df = spark.createDataFrame(data)

In [0]:
display(df)

In [0]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
model = scaler.fit(df)
df_scaled = model.transform(df)

df_scaled.select("features", "scaled_features").show(truncate=False)

In [0]:
from pyspark.ml.feature import Normalizer

normalizer = Normalizer(inputCol="features", outputCol="normalized_features", p=2.0)
df_normalized = normalizer.transform(df)

df_normalized.select("features", "normalized_features").show(truncate=False)

#String Standardization

## Teoria
**String Standardization** czyści i ujednolica text data:
- **Case normalization** (lower/upper)
- **Whitespace cleaning** (trim, multiple spaces)
- **Special character handling**
- **Consistent formats**

In [0]:
# String data analysis before cleaning
print("📊 Raw string data before standardization:")
df_raw.select("city", "department").distinct().orderBy("city", "department").display(20, truncate=False)

print("\n⚠️  Issues:")
print("   • Different case (warszawA, KRAKÓW, gdańsk)")
print("   • Leading/trailing whitespaces")
print("   • Inconsistent formatting")

In [0]:
# String standardization pipeline
df_string_clean = df_raw.select(
    "*",
    # City standardization
    trim(lower(col("city"))).alias("city_clean"),
    
    # Department standardization with additional cleaning
    trim(upper(regexp_replace(col("department"), "\\s+", " "))).alias("department_clean")
)

print("String standardization applied:")
display(df_string_clean.select("city", "city_clean", "department", "department_clean").distinct().orderBy("city_clean", "department_clean").limit(20))

# Bucketizer

## Theory
**Bucketizer** is a feature engineering tool that transforms a continuous numerical column into discrete bins or "buckets." This process, known as binning, is useful for:

- **Handling non-linear relationships:** Converts continuous features into categorical intervals.
- **Reducing the effect of outliers:** Groups extreme values into boundary buckets.
- **Improving model interpretability:** Makes features easier to understand.

### How it works:
- Define split points (boundaries) for the buckets.
- Each value is assigned to a bucket based on which interval it falls into.
- Common use cases: age groups, income ranges, or discretizing scores.

**Example:**  
A salary column can be bucketized into low, medium, and high salary ranges for downstream analysis or modeling.

In [0]:
from pyspark.ml.feature import Bucketizer

# Example boundaries: <30k, 30k-60k, 60k-100k, >100k
salary_splits = [-float("inf"), 30000.0, 60000.0, 100000.0, float("inf")]

bucketizer = Bucketizer(
    splits=salary_splits,
    inputCol="salary",
    outputCol="salary_bucket"
)

In [0]:
# Assume you have a DataFrame, e.g., `df`
df_buckets = bucketizer.transform(df_raw)

# Display the result
display(df_buckets.select("salary", "salary_bucket"))

# Outlier Detection and Treatment

## Theory
**Outliers** can significantly impact model performance. Key methods:

### Statistical Methods:
- **IQR Method**: Outliers outside Q1 - 1.5*IQR, Q3 + 1.5*IQR
- **Z-score**: |z| > 3 (or 2.5)
- **Modified Z-score**: Median-based for non-normal distributions

This block of code detects outliers in the `salary` column using the Interquartile Range (IQR) method, a common technique in data cleaning.

### Step-by-step explanation:

1. **Calculate quantiles using `percentile_approx()`:**
   - **Q1 (25th percentile):** Lower boundary of the middle 50% of values
   - **Q3 (75th percentile):** Upper boundary of the middle 50%
   - **Median (50th percentile):** The center value (not used for outlier detection, but included for insight)

2. **Compute IQR:**
   - `IQR = Q3 - Q1`
   - The IQR captures the “spread” of the middle 50% of salaries.

3. **Define outlier thresholds:**
   - **Lower bound:** `Q1 − 1.5 × IQR`
   - **Upper bound:** `Q3 + 1.5 × IQR`
   - Any value outside this range is considered a potential outlier.

4. **Print summary:**
   - Displays the computed bounds and IQR statistics in a readable format.

---

**Why use IQR?**

The IQR method is robust to skewed distributions and not sensitive to extreme values, unlike mean-based techniques. It is often used in preprocessing pipelines to flag or remove anomalous data points.

In [0]:
# IQR-based outlier detection dla salary
salary_stats = df_raw.select(
    percentile_approx("salary", 0.25).alias("q1"),
    percentile_approx("salary", 0.75).alias("q3"),
    percentile_approx("salary", 0.5).alias("median")
).collect()[0]

q1, q3, median = salary_stats['q1'], salary_stats['q3'], salary_stats['median']
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print(f"  Salary outlier bounds (IQR method):")
print(f"   Q1: {q1:,.0f}, Q3: {q3:,.0f}, IQR: {iqr:,.0f}")
print(f"   Lower bound: {lower_bound:,.0f}")
print(f"   Upper bound: {upper_bound:,.0f}")

In [0]:
# Outlier identification
df_outliers = df_raw.withColumn(
    "is_outlier",
    when((col("salary") < lower_bound) | (col("salary") > upper_bound), True).otherwise(False)
)

display(df_outliers)

#Complete Standardization Pipeline

## Połączenie wszystkich technik w jeden production-ready pipeline

In [0]:
# Complete standardization pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.sql.functions import col, trim, lower, upper, when
from pyspark.ml.stat import Summarizer

# Step 1: String cleaning
df_pipeline = df_raw.select(
    "*",
    trim(lower(col("city"))).alias("city_std"),
    trim(upper(col("department"))).alias("department_std")
)

# Step 2: Outlier capping
df_pipeline = df_pipeline.withColumn(
    "salary_std",
    when(col("salary") > upper_bound, upper_bound)
    .when(col("salary") < lower_bound, lower_bound)
    .otherwise(col("salary"))
)

# Step 3: Feature assembly for numerical features
numerical_cols_std = ["age", "salary_std", "height", "score", "years_experience"]
assembler_final = VectorAssembler(
    inputCols=numerical_cols_std,
    outputCol="features_raw_final"
)

# Step 4: Standard scaling
scaler_final = StandardScaler(
    inputCol="features_raw_final",
    outputCol="features_final",
    withMean=True,
    withStd=True
)

# Pipeline execution
df_assembled_final = assembler_final.transform(df_pipeline)
scaler_model_final = scaler_final.fit(df_assembled_final)
df_final = scaler_model_final.transform(df_assembled_final)

print("Complete standardization pipeline executed:")
display(df_final.select("city_std", "department_std", "salary_std", "features_final").limit(10))

# Final validation
print("Final standardized features statistics:")
summary_final = Summarizer.metrics("mean", "std", "min", "max").summary(col("features_final"))
display(df_final.select(summary_final.alias("summary")))

# Summary and Best Practices

## When to use each standardization method?

### 1️⃣ **StandardScaler**
 **Use when:**
- Features have a **normal distribution**
- **Linear models** (logistic regression, SVM)
- **Neural networks** (gradient descent optimization)
- **PCA** and other dimensionality reduction

### 2️⃣ **MinMaxScaler**
 **Use when:**
- You need a **bounded range** [0,1]
- **Image processing** (pixel values)
- **Neural networks** with activation functions
- **Sparse data** (preserve zero values)

### 3️⃣ **RobustScaler**
 **Use when:**
- Data contains **outliers**
- **Non-normal distributions**
- **Robust algorithms** (tree-based often do not require scaling)

### 4️⃣ **Normalizer**
 **Use when:**
- **Text analysis** (TF-IDF vectors)
- **Recommendation systems**
- Direction matters more than magnitude

##  Key principles

1. **Fit scaler only on training data** - prevent data leakage!
2. **Apply transformation to train/validation/test** with the same scaler
3. **Handle outliers before scaling** for better results
4. **Standardize strings** for consistent categorical encoding
5. **Tree-based algorithms** often do not require feature scaling
6. **Save scaler models** for production deployment

## QUICK REFERENCE for Databricks ML Associate

### **1️⃣ StandardScaler:**
```python
scaler = StandardScaler(inputCol='features', outputCol='scaled', withMean=True, withStd=True)
model = scaler.fit(train_df)
scaled_df = model.transform(df)
```

### **2️⃣ MinMaxScaler:**
```python
scaler = MinMaxScaler(inputCol='features', outputCol='scaled')
```

### **3️⃣ String cleaning:**
```python
df.withColumn('clean', trim(lower(col('text_col'))))
```

### **4️⃣ Outlier detection (IQR):**
```python
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
```

---

**Congratulations! You've completed Data Standardization in Databricks ML**  
