# Databricks Data Preparation in ML - Quick Reference
## Essential Code Patterns for ML Associate Certification

**Part of the Databricks Data Preparation in ML Training Series**

---

## Must-Know PySpark Patterns

This quick reference provides essential code patterns and concepts needed for the Databricks ML Associate certification exam, covering all key data preparation techniques from the training series.

### **Data Quality & Missing Values**

In [0]:
# Missing value analysis
from pyspark.sql.functions import col, when, count, isnan, isnull, sum as spark_sum

# Count missing values per column
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

# Completeness percentage
total_count = df.count()
df.select([(count(c)/total_count*100).alias(f"{c}_completeness") for c in df.columns]).show()

# Drop rows with missing values
df_clean = df.dropna()  # Drop any row with null
df_clean = df.dropna(subset=['col1', 'col2'])  # Drop if null in specific columns

# Fill missing values
df_filled = df.fillna(0)  # Fill all nulls with 0
df_filled = df.fillna({'numeric_col': 0, 'string_col': 'unknown'})  # Column-specific

### **Imputation with PySpark ML**

In [0]:
from pyspark.ml.feature import Imputer

# Imputer for numerical columns
imputer = Imputer(
    inputCols=['age', 'salary', 'score'],
    outputCols=['age_imp', 'salary_imp', 'score_imp'],
    strategy='mean'  # or 'median', 'mode'
)

# Fit and transform
imputer_model = imputer.fit(df)
df_imputed = imputer_model.transform(df)

# Custom imputation with business logic
df_custom = df.withColumn(
    'age_filled',
    when(col('age').isNull(), 35.0).otherwise(col('age'))  # Fill with business default
)

### 3️⃣ **Categorical Encoding**

In [0]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

# Label/String Indexing (Ordinal)
indexer = StringIndexer(inputCol='category', outputCol='category_idx')
indexer_model = indexer.fit(df)
df_indexed = indexer_model.transform(df)

# One-Hot Encoding (Nominal)
encoder = OneHotEncoder(inputCols=['category_idx'], outputCols=['category_ohe'])
encoder_model = encoder.fit(df_indexed)
df_encoded = encoder_model.transform(df_indexed)

# Complete pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[indexer, encoder])
pipeline_model = pipeline.fit(df)
df_final = pipeline_model.transform(df)

### 4️⃣ **Feature Scaling**

In [0]:
from pyspark.ml.feature import StandardScaler, MinMaxScaler, RobustScaler, VectorAssembler

# Prepare features for scaling
assembler = VectorAssembler(
    inputCols=['age', 'salary', 'score'],
    outputCol='features'
)
df_assembled = assembler.transform(df)

# StandardScaler (z-score normalization)
scaler = StandardScaler(
    inputCol='features',
    outputCol='scaled_features',
    withMean=True,
    withStd=True
)
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

# MinMaxScaler (0-1 scaling)
minmax_scaler = MinMaxScaler(
    inputCol='features',
    outputCol='minmax_features'
)

# RobustScaler (outlier-resistant)
robust_scaler = RobustScaler(
    inputCol='features',
    outputCol='robust_features'
)

### 5️⃣ **Data Splitting**

In [0]:
# Basic train-test split
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Three-way split
train_df, temp_df = df.randomSplit([0.6, 0.4], seed=42)
val_df, test_df = temp_df.randomSplit([0.5, 0.5], seed=42)

# Cross-validation setup
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

cv = CrossValidator(
    estimator=model,
    estimatorParamMaps=param_grid,
    evaluator=BinaryClassificationEvaluator(),
    numFolds=5,
    seed=42
)

cv_model = cv.fit(train_df)

### 6️⃣ **Data Validation & Quality**

In [0]:
from pyspark.sql.functions import mean, stddev, min as spark_min, max as spark_max, percentile_approx

# Basic statistics
df.describe().show()

# Outlier detection (IQR method)
stats = df.select(
    percentile_approx('salary', 0.25).alias('q1'),
    percentile_approx('salary', 0.75).alias('q3')
).collect()[0]

iqr = stats['q3'] - stats['q1']
lower_bound = stats['q1'] - 1.5 * iqr
upper_bound = stats['q3'] + 1.5 * iqr

outliers = df.filter(
    (col('salary') < lower_bound) | (col('salary') > upper_bound)
)

# Duplicate detection
duplicates = df.groupBy('customer_id').count().filter(col('count') > 1)
df_deduped = df.dropDuplicates(['customer_id'])

# Data quality checks
total_records = df.count()
quality_report = df.select(
    [((total_records - count(when(col(c).isNull(), c))) / total_records * 100).alias(f"{c}_completeness") 
     for c in df.columns]
)

## 🎯 Key Decision Trees

### **Missing Data Strategy**
```
Missing < 5%? → Drop rows
Missing 5-15%? → Impute (mean/median/mode)
Missing > 15%? → Investigate mechanism (MCAR/MAR/MNAR)
                 ↓
                Advanced imputation or feature engineering
```

### **Encoding Strategy**
```
Categorical Variable?
├─ Ordinal → StringIndexer (Label Encoding)
└─ Nominal
   ├─ Low cardinality (<10) → OneHotEncoder
   ├─ High cardinality (>50) → Target Encoding
   └─ Medium → Context dependent
```

### **Scaling Strategy**
```
Algorithm?
├─ Tree-based (RF, XGBoost) → No scaling needed
├─ Distance-based (KNN, SVM) → StandardScaler
├─ Neural Networks → StandardScaler or MinMaxScaler
└─ Linear Models → StandardScaler
```

## ⚠️ Common Pitfalls

### **Data Leakage Prevention**
```python
# ❌ WRONG - fit on full dataset
scaler = StandardScaler()
scaler_model = scaler.fit(full_df)  # Data leakage!

# ✅ CORRECT - fit only on training
train_df, test_df = df.randomSplit([0.8, 0.2])
scaler_model = scaler.fit(train_df)  # Fit only on train
train_scaled = scaler_model.transform(train_df)
test_scaled = scaler_model.transform(test_df)  # Apply same transformation
```

### **Encoding Mistakes**
```python
# ❌ WRONG - Label encoding for nominal categories
# This creates artificial ordering: Red=0, Blue=1, Green=2
colors_indexed = StringIndexer(inputCol='color', outputCol='color_idx')

# ✅ CORRECT - One-hot for nominal categories
colors_indexed = StringIndexer(inputCol='color', outputCol='color_idx')
colors_ohe = OneHotEncoder(inputCols=['color_idx'], outputCols=['color_ohe'])
```

### **Imputation Errors**
```python
# ❌ WRONG - Imputing target variable
imputer = Imputer(inputCols=['features', 'target'])  # Never impute target!

# ✅ CORRECT - Only impute features
imputer = Imputer(inputCols=['feature1', 'feature2'])  # Only features
```

## 📊 Exam Quick Facts

### **PySpark ML Key Classes**
- `StringIndexer` - Categorical → Numeric
- `OneHotEncoder` - Nominal categories → Binary vectors
- `VectorAssembler` - Multiple columns → Single vector
- `StandardScaler` - Z-score normalization
- `MinMaxScaler` - 0-1 scaling
- `Imputer` - Fill missing values
- `CrossValidator` - K-fold validation

### **Data Quality Dimensions**
- **Completeness** - % non-null values
- **Accuracy** - Correctness of values
- **Consistency** - Format uniformity
- **Validity** - Meets business rules
- **Uniqueness** - Appropriate duplicates
- **Timeliness** - Currency of data

### **Missing Data Mechanisms**
- **MCAR** - Missing Completely at Random
- **MAR** - Missing at Random (depends on observed)
- **MNAR** - Missing Not at Random (depends on missing value)

### **Medallion Architecture**
- **Bronze** - Raw data with minimal processing
- **Silver** - Cleaned, validated data
- **Gold** - Business-ready aggregated data

## 🎯 Last-Minute Review Checklist

### ✅ **Must Remember Code Patterns**
- [ ] Missing value counting: `count(when(col(c).isNull(), c))`
- [ ] Train-test split: `df.randomSplit([0.8, 0.2], seed=42)`
- [ ] String indexing: `StringIndexer(inputCol, outputCol)`
- [ ] One-hot encoding: `OneHotEncoder(inputCols, outputCols)`
- [ ] Standard scaling: `StandardScaler(inputCol, outputCol, withMean=True)`
- [ ] Imputation: `Imputer(inputCols, outputCols, strategy='mean')`
- [ ] Outlier detection: `percentile_approx` + IQR calculation

### ✅ **Must Know Concepts**
- [ ] When to use each encoding method
- [ ] When to use each scaling method
- [ ] Data leakage prevention
- [ ] Missing data mechanism types
- [ ] Cross-validation setup
- [ ] Business rule validation
- [ ] Medallion architecture layers

### ✅ **Must Avoid Mistakes**
- [ ] Fitting transformers on full dataset before split
- [ ] Using label encoding for nominal categories
- [ ] Imputing target variables
- [ ] Ignoring business context in data decisions
- [ ] Scaling tree-based algorithms unnecessarily

---

## 🚀 **You're Ready! Good Luck on Your Certification!** 🎯