# Week 3 — Practical Coding Exercise (Constant Model & MSE)
This notebook follows the Week 3 instructions:
- Generate synthetic data `Y`.
- Try a range of constants `c` and compute the MSE for each.
- Find the optimal `c` and compare it with the dataset mean.
- Plot MSE vs. `c` and mark the optimum.
- Reflect on the shape of the curve and the role of a dummy model.
- **Bonus:** implement a simple `train_test_split` with shuffling.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Make outputs reproducible
rng = np.random.default_rng(42)

## 1) Generate synthetic data `Y`

In [None]:
# You can change the distribution if you want to experiment:
# Example A: Normal distribution
n = 300
Y = rng.normal(loc=4.0, scale=2.0, size=n)

# Example B (uncomment to try): Skewed (exponential) distribution
# Y = rng.exponential(scale=2.0, size=n)

# Quick sanity check
print(f"n = {Y.size}, mean(Y) ≈ {Y.mean():.4f}, std(Y) ≈ {Y.std(ddof=0):.4f}")

## 2) Grid of constants `c` and MSE calculation

In [None]:
# Range of constants to test
c_grid = np.linspace(0, 10, 201)  # 0, 0.05, ..., 10

# Compute MSE for each c (vectorized)
# MSE(c) = (1/n) * sum_i (Y_i - c)^2
mse_values = ((Y[:, None] - c_grid[None, :])**2).mean(axis=0)

# Find the best c (minimum MSE)
best_idx = np.argmin(mse_values)
c_star = c_grid[best_idx]
print(f"Best c on the grid: c* = {c_star:.4f}")

## 3) Compare optimal `c` with the mean of `Y`

In [None]:
y_mean = Y.mean()
print(f"Mean of Y: {y_mean:.6f}")
print(f"Difference |c* - mean(Y)| = {abs(c_star - y_mean):.6e}")

## 4) Plot MSE vs. `c` and mark the optimum

In [None]:
plt.figure(figsize=(6,4))
plt.plot(c_grid, mse_values)
plt.axvline(c_star)  # best c
plt.title("MSE vs. c (constant model)")
plt.xlabel("c")
plt.ylabel("MSE(c)")
plt.tight_layout()
plt.show()

## 5) Interpretation
- The MSE curve is **convex** (parabolic) in `c`.  
- The minimum occurs at **`c = mean(Y)`**. Moving away from the mean increases squared distances, so the MSE grows symmetrically around the mean.


## 6) Why a simple dummy model can be useful
- It gives a **baseline** to compare complex models against (sanity check).
- It is **fast, stable, and interpretable**.
- If a sophisticated model can't beat this baseline, there may be data/feature/label issues.


## Bonus: Simple `train_test_split` implementation with shuffling

In [None]:
def train_test_split(X, y, test_size=0.2, rng=None):
    """Split X, y into train and test using shuffling.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
    y : array-like of shape (n_samples,)
    test_size : float in (0,1)
    rng : numpy.random.Generator or None
    
    Returns
    -------
    X_train, X_test, y_train, y_test
    """
    X = np.asarray(X)
    y = np.asarray(y)
    assert X.shape[0] == y.shape[0], "X and y must have the same number of samples"
    n = X.shape[0]
    n_test = int(np.floor(test_size * n))
    if rng is None:
        rng = np.random.default_rng()
    perm = rng.permutation(n)
    test_idx = perm[:n_test]
    train_idx = perm[n_test:]
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Example use (toy data)
X = np.array([[1,2],[3,4],[5,6],[7,8]])
y = np.array([0,1,0,1])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, rng=rng)
print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:", y_train)
print("y_test:", y_test)