# Activity: Underfitting vs Overfitting

In this activity, you'll explore what happens when we fit models of different complexity to data.

## Steps:
1. Generate synthetic data (quadratic pattern + noise).
2. Fit three models:
   - Linear (underfit)
   - Quadratic (good fit)
   - High-degree polynomial (overfit)
3. Compare the plots.
4. Check training vs test errors.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Step 1: Generate data
np.random.seed(42)
X = np.linspace(-3, 3, 40).reshape(-1, 1)
y = 0.5 * X[:,0]**2 + np.random.randn(40) * 1.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

plt.scatter(X_train, y_train, label='Train data')
plt.scatter(X_test, y_test, label='Test data')
plt.legend()
plt.title('Generated Data (Quadratic with noise)')
plt.show()

## Step 2: Fit models of different complexity
- Linear (degree=1)
- Quadratic (degree=2)
- Overfit (degree=15)


In [None]:
def fit_and_plot(degree):
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly_train = poly.fit_transform(X_train)
    X_poly_test = poly.transform(X_test)
    model = LinearRegression().fit(X_poly_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_poly_train)
    y_test_pred = model.predict(X_poly_test)
    
    # Errors
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    
    # Plot
    X_line = np.linspace(-3, 3, 200).reshape(-1,1)
    y_line = model.predict(poly.transform(X_line))
    plt.scatter(X_train, y_train, label='Train data')
    plt.scatter(X_test, y_test, label='Test data')
    plt.plot(X_line, y_line, color='red', label=f'Degree {degree}')
    plt.title(f'Polynomial degree {degree}\nTrain MSE={train_mse:.2f}, Test MSE={test_mse:.2f}')
    plt.legend()
    plt.show()

for d in [1, 2, 15]:
    fit_and_plot(d)

## Step 3: Discussion Questions
1. Which model underfits the data? Why?
2. Which model overfits the data? How can you tell?
3. Why is the quadratic model the best balance?
4. What do you notice about **train vs test error** for each case?
5. How could we avoid overfitting in real-world problems?
