# Validation & Experimental Discipline

## Research Question
Why do we need validation data in addition to test data?

## Hypothesis
Using validation data helps prevent overfitting to the test set and leads to better generalization.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

In [2]:
# Create data
rng = np.random.default_rng(seed=43)
X = rng.random((150, 1)) * 5
y = X.squeeze()**2 + rng.standard_normal(150) * 3

### Note
Modern NumPy Generator API (`default_rng`) used for reproducibility and isolation.

In [4]:
# Train/Validation/Test split
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=43
) # This separates 20% of the total data to be used as the Test Set (X_test, y_test). 
#You are left with 80% of the data in "temporary" variables (X_temp, y_temp) to be split again in the next step.

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=43
)

### Why Three Sets?
- Training: learn parameters 60%
- Validation: tune model complexity 20%
- Test: final unbiased evaluation 20%

In [5]:
# Tune model using validation set
# Try three model complexities and compare the validation error
degrees = [1, 2, 5]
results = {}

for d in degrees:
    poly = PolynomialFeatures(degree=d)
    X_train_p = poly.fit_transform(X_train)
    X_val_p = poly.transform(X_val)

    model = LinearRegression()
    model.fit(X_train_p, y_train)

    val_mse = mean_squared_error(y_val, model.predict(X_val_p))
    results[d] = val_mse

results

{1: 14.65957569526338, 2: 8.208992679407546, 5: 9.219451784576465}

### Validation Results
- Degree with lowest validation error is selected
- Test set remains untouched

In [6]:
# Final evaluation on Test set
best_degree = min(results, key=results.get)

poly = PolynomialFeatures(degree=best_degree)
X_train_p = poly.fit_transform(X_train)
X_test_p = poly.transform(X_test)

final_model = LinearRegression()
final_model.fit(X_train_p, y_train)

test_mse = mean_squared_error(y_test, final_model.predict(X_test_p))
test_mse

9.092860577694728

### Final Test Performance
- Test set used only once
- Provides unbiased performance estimate

## Reflection
- Validation data prevents overfitting decisions
- Test data should be touched once
- Experimental discipline is as important as model choice