# Databricks Data Preparation in ML - Notebook 06
## Data Splitting Fundamentals

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

This notebook covers essential data splitting techniques required for Databricks ML Associate Certification:

- **Basic Split** - Train/Test (80/20) for simple model validation
- **Three-way Split** - Train/Validation/Test (60/20/20) for hyperparameter tuning
- **Cross-Validation** - K-fold strategies for maximum data utilization
- **Stratified Sampling** - Maintaining class distributions across splits
- **Time-based Splitting** - Temporal considerations for time series data

## Duration: ~45 minutes
## Level: Fundamental → Intermediate

---

## Why is Proper Data Splitting Critical?

Proper data splitting forms the **foundation of every ML project**:
- **Objective Evaluation** - Avoiding overfitting and ensuring model reliability
- **Generalization** - Ensuring models perform well on unseen data
- **Production Readiness** - Simulating real-world deployment scenarios
- **Fair Comparison** - Enabling objective comparison between different models

---

## Theory: Data Splitting Strategies

### Basic Train/Test Split (80/20)
```
Dataset → Train (80%) + Test (20%)
    ↓        ↓            ↓
Training   Model       Final
 Data      Fitting    Evaluation
```
- **Use case**: Simple model validation
- **Pros**: Simple, fast, sufficient for large datasets
- **Cons**: Limited evaluation, no hyperparameter tuning

### Three-way Split (60/20/20)
```
Dataset → Train (60%) + Validation (20%) + Test (20%)
    ↓        ↓              ↓              ↓
Training   Model      Hyperparameter    Final
 Data      Fitting      Tuning        Evaluation
```
- **Use case**: Model selection and hyperparameter optimization
- **Pros**: Unbiased final evaluation, enables tuning
- **Cons**: Reduces training data size

### 🔄 Cross-Validation (K-fold)
```
Fold 1: [TEST ] [TRAIN] [TRAIN] [TRAIN] [TRAIN]
Fold 2: [TRAIN] [TEST ] [TRAIN] [TRAIN] [TRAIN]
Fold 3: [TRAIN] [TRAIN] [TEST ] [TRAIN] [TRAIN]
...
Final Score = Average(Fold1, Fold2, ..., FoldK)
```
- **Use case**: Robust model evaluation with limited data
- **Pros**: Maximum data utilization, robust estimates
- **Cons**: Computationally expensive (K times training)

## Environment Setup

In [0]:
# Basic imports for Databricks ML
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, row_number
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.window import Window
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import numpy as np

In [0]:
# Create demonstration dataset - Customer Churn
np.random.seed(42)

# Generate customer data
n_customers = 1000
ages = np.random.normal(45, 15, n_customers).clip(18, 80)
incomes = np.random.lognormal(10.5, 0.6, n_customers)
monthly_charges = np.random.normal(65, 25, n_customers).clip(10, 150)
support_calls = np.random.poisson(2, n_customers)

# Create realistic churn probability
churn_prob = (
    0.3 * (support_calls > 3) + 
    0.25 * (monthly_charges > 80) + 
    0.2 * (ages < 30) + 
    0.1 * np.random.random(n_customers)
)
churn = (churn_prob > 0.4).astype(int)

# Schema and data
schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("age", DoubleType(), True),
    StructField("income", DoubleType(), True),
    StructField("monthly_charges", DoubleType(), True),
    StructField("support_calls", IntegerType(), True),
    StructField("churn", IntegerType(), True)
])

data = [(i, float(ages[i]), float(incomes[i]), float(monthly_charges[i]), 
         int(support_calls[i]), int(churn[i])) for i in range(n_customers)]

df = spark.createDataFrame(data, schema)
df.display(10)

# Basic Split - Train/Test (80/20)

## Theory
**Basic Split** divides data into two parts: training (80%) and test (20%).

###  When to use:
- **Large datasets** (>10,000 samples)
- **Simple models** without hyperparameter tuning
- **Quick prototyping** and baseline models

###  Advantages:
- **Simplicity** - easy to implement
- **Speed** - minimal overhead
- **Clarity** - clear structure

###  Limitations:
- **No tuning** - no validation set
- **Variance** - results depend on random split

In [0]:
# Prepare features for the model
from pyspark.ml.feature import VectorAssembler

# Define input columns
feature_cols = ['age', 'income', 'monthly_charges', 'support_calls']

# Create VectorAssembler
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol='features'
)

# Transform data
df_features = assembler.transform(df)

print("Data with feature vectors:")
df_features.select("customer_id", "features", "churn").display(5)

In [0]:
# Split into training and test sets (80/20)
train_data, test_data = df_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set size: {train_data.count()}")
print(f"Test set size: {test_data.count()}")

# Check target class distribution
print("\nClass distribution in training set:")
train_data.groupBy("churn").count().display()

print("Class distribution in test set:")
test_data.groupBy("churn").count().display()

Churn (customer churn) refers to the loss of customers or users over time. In a business or ML context, churn typically means that a customer:

	•	cancels a subscription,
	•	stops using a product or service,
	•	becomes inactive.

In [0]:
# Train binary classification model
from pyspark.ml.classification import LogisticRegression

# Create model
lr = LogisticRegression(
    featuresCol='features', 
    labelCol='churn',
    maxIter=10
)

# Train model on training set
lr_model = lr.fit(train_data)

print("Model trained on training set")
print(f"Number of iterations: {lr_model.summary.totalIterations}")

In [0]:
# Evaluate model on test set
predictions = lr_model.transform(test_data)

# Check predictions
print("Predictions on test set:")
predictions.select("customer_id", "churn", "prediction", "probability").show(10)

# Calculate accuracy
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="churn")
auc = evaluator.evaluate(predictions)

print(f"\nAUC on test set: {auc:.3f}")

#  Three-way Split - Train/Validation/Test (60/20/20)

## Theory

**Three-way Split** divides data into **three independent sets**:

- **Training Set (60%)**: For training the model
- **Validation Set (20%)**: For hyperparameter tuning and model selection
- **Test Set (20%)**: For final model evaluation

### Why do we use three sets?

1. **Training Set**: Model learns patterns in the data
2. **Validation Set**: We test different model configurations without "looking" at the test set
3. **Test Set**: Final, objective evaluation of model performance

### Process:
1. Train model on **Training Set**
2. Evaluate different hyperparameters on **Validation Set**
3. Select the best configuration
4. Final evaluation on **Test Set**

In [0]:
# Split into three sets: train (60%), validation (20%), test (20%)
train_data_3way, temp_data = df_features.randomSplit([0.6, 0.4], seed=42)
val_data, test_data_3way = temp_data.randomSplit([0.5, 0.5], seed=42)

print(f"Training set: {train_data_3way.count()} records")
print(f"Validation set: {val_data.count()} records") 
print(f"Test set: {test_data_3way.count()} records")

# Check proportions
total = train_data_3way.count() + val_data.count() + test_data_3way.count()
print(f"\nProportions:")
print(f"Train: {train_data_3way.count()/total:.1%}")
print(f"Validation: {val_data.count()/total:.1%}")
print(f"Test: {test_data_3way.count()/total:.1%}")

In [0]:
# Test different hyperparameters
hyperparams = [1, 5, 10, 20]
best_auc = 0
best_maxIter = None

print("Hyperparameter tuning on validation set:")

for maxIter in hyperparams:
    # Train model
    lr = LogisticRegression(
        featuresCol='features', 
        labelCol='churn',
        maxIter=maxIter
    )
    
    model = lr.fit(train_data_3way)
    
    # Evaluate on validation set
    val_predictions = model.transform(val_data)
    val_auc = evaluator.evaluate(val_predictions)
    
    print(f"maxIter={maxIter}: AUC = {val_auc:.3f}")
    
    # Save best result
    if val_auc > best_auc:
        best_auc = val_auc
        best_maxIter = maxIter

print(f"Best hyperparameter: maxIter={best_maxIter} (AUC={best_auc:.3f})")

In [0]:
# Train final model with best hyperparameters
final_lr = LogisticRegression(
    featuresCol='features', 
    labelCol='churn',
    maxIter=best_maxIter
)

final_model = final_lr.fit(train_data_3way)

In [0]:
# Final evaluation on test set
test_predictions = final_model.transform(test_data_3way)
final_auc = evaluator.evaluate(test_predictions)

print(f"Final model performance on test set:")
print(f"AUC = {final_auc:.3f}")

# Comparison with validation set
print(f"\nComparison:")
print(f"Validation AUC: {best_auc:.3f}")
print(f"Test AUC: {final_auc:.3f}")
print(f"Difference: {abs(final_auc - best_auc):.3f}")

#  Cross-Validation (K-Fold)

## Theory

**Cross-Validation** is a technique that **maximizes data utilization** by repeatedly training and testing the model on different subsets.

### K-Fold Cross-Validation:

1. **Split the data into K parts** (e.g., K=5)
2. **Train the model K times**:
   - Each time, use K-1 parts for training
   - 1 part is used for testing
3. **Average the results** from all K iterations

### Advantages:
-  **Better data utilization** - every record is used for both training and testing
-  **More stable evaluation** - averaging reduces the impact of randomness
-  **Overfitting detection** - high variance between folds may indicate a problem

### Disadvantages:
-  **Computationally expensive** - training is performed K times
-  **Longer duration** - especially for large models

We use a logistic regression classifier to predict churn (labelCol='churn') based on input features.

In [0]:
# Cross-Validation with Spark ML
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Define model
cv_lr = LogisticRegression(featuresCol='features', labelCol='churn')

This creates a grid of hyperparameter combinations to test during training:

maxIter: number of training iterations (gradient descent steps)

regParam: regularization strength (higher = stronger regularization)

With 3 values for each parameter, there are 3 × 3 = 9 combinations to evaluate.

In [0]:
# Hyperparameter grid to test
paramGrid = ParamGridBuilder() \
    .addGrid(cv_lr.maxIter, [5, 10, 20]) \
    .addGrid(cv_lr.regParam, [0.01, 0.1, 1.0]) \
    .build()

print(f"Number of hyperparameter combinations: {len(paramGrid)}")

estimator: the ML model to train (logistic regression)

estimatorParamMaps: the grid of parameter combinations

evaluator: a metric evaluator (e.g., BinaryClassificationEvaluator)

numFolds=3: 3-fold cross-validation


➜ dataset is split into 3 parts, and each part is used once as validation

In [0]:
# Configure Cross-Validator
crossval = CrossValidator(
    estimator=cv_lr,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,  # 3-fold CV for faster execution
    seed=42
)

print("Cross-Validator configured (3-fold CV)")

In [0]:
# Split data into train and test for CV
cv_train_data, cv_test_data = df_features.randomSplit([0.8, 0.2], seed=42)

print("Starting Cross-Validation...")
print("This may take a moment...")

# Training with Cross-Validation
cv_model = crossval.fit(cv_train_data)

print("Cross-Validation completed!")
print(f"Best model selected from {len(paramGrid)} combinations")

In [0]:
# Analyze Cross-Validation results
best_model = cv_model.bestModel
best_params = {
    "maxIter": best_model.getMaxIter(),
    "regParam": best_model.getRegParam()
}

print("Best hyperparameters from CV:")
print(f"maxIter: {best_params['maxIter']}")
print(f"regParam: {best_params['regParam']}")

# Average results from all combinations
avg_metrics = cv_model.avgMetrics
#Area Under the Curve
print(f"Best average AUC from CV: {max(avg_metrics):.3f}")

# Final Notes

- Choose your data splitting strategy based on dataset size, model complexity, and evaluation needs.
- Always keep the test set untouched until the very end for unbiased performance assessment.
- Use cross-validation for small datasets or when you need robust, stable metrics.
- Set random seeds for reproducibility.
- For classification, use stratified splits to maintain class balance.



In [0]:
# Final evaluation of best model from CV
cv_test_predictions = cv_model.transform(cv_test_data)
cv_test_auc = evaluator.evaluate(cv_test_predictions)

print(f"Final evaluation of Cross-Validation model:")
print(f"Test AUC: {cv_test_auc:.3f}")

# Comparison of all three methods
print(f"\n📊 Comparison of all methods:")
print(f"Basic Split AUC:     {auc:.3f}")
print(f"Three-way Split AUC: {final_auc:.3f}")
print(f"Cross-Validation AUC: {cv_test_auc:.3f}")

# Summary: When to use which method?

## Decision Framework

### 1️⃣ **Basic Split (80/20)** 
✅ **Use when:**
- You have a **large dataset** (>10,000 samples)
- Model is **simple** without complex hyperparameters
- You need a **quick prototype** or baseline
- **Time** is limited

### 2️⃣ **Three-way Split (60/20/20)**
✅ **Use when:**
- You have a **medium dataset** (1,000-50,000 samples)
- You plan **hyperparameter tuning**
- You want to **compare different models**
- You need **objective final evaluation**

### 3️⃣ **Cross-Validation**
✅ **Use when:**
- You have a **small dataset** (<5,000 samples)
- You want to **maximize data utilization**
- You need **stable metrics**
- **Computation time** is not critical

## ⚠️ Key principles

1. **Never use test set for hyperparameter tuning!**
2. **Always set random seed for reproducibility**
3. **Check class distribution in each set**
4. **For imbalanced data use stratified split**

# Quick Reference for Databricks ML Associate

## Essential Data Splitting Code Patterns

### 1️⃣ Basic Train/Test Split
```python
# Simple 80/20 split with reproducible results
train, test = df.randomSplit([0.8, 0.2], seed=42)
```

### 2️⃣ Three-way Split (Train/Validation/Test)
```python
# First split: 60% train, 40% temp
train, temp = df.randomSplit([0.6, 0.4], seed=42)

# Second split: 20% validation, 20% test
val, test = temp.randomSplit([0.5, 0.5], seed=42)
```

### 3️⃣ Cross-Validation
```python
# K-fold cross-validation for robust evaluation
from pyspark.ml.tuning import CrossValidator

cv = CrossValidator(estimator=model, evaluator=eval, numFolds=5)
```

### 4️⃣ Stratified Split (for Classification)
```python
# Maintain class distribution across splits
from pyspark.ml.feature import StringIndexer

# For categorical targets, ensure balanced representation
```

---

## Congratulations!

You have completed **Data Splitting in Databricks ML**!  


### Key Skills Acquired:
- ✅ Proper train/validation/test split strategies
- ✅ Cross-validation implementation techniques
- ✅ Stratified sampling for balanced datasets
- ✅ Reproducible data splitting with random seeds
- ✅ Best practices for unbiased model evaluation