# Polynomial Fits: Compare Degrees 1–4
Generate a synthetic dataset with a clearly non-linear relationship, fit **polynomial regressions** of degrees **1, 2, 3, and 4**, and compare performance and visuals.

**What you’ll see**
- A reproducible dataset with slight outliers to highlight under/overfitting differences.
- Train/test evaluation across polynomial degrees 1–4.
- Clear plots overlaying the fits on the data.
- A short interpretation guide.

In [None]:
# Imports & global style
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Clean & modern
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3
plt.rcParams['figure.dpi'] = 140
plt.rcParams['font.size'] = 12


## 1) Generate Data
Create a clearly non-linear signal (approximately cubic) with Gaussian noise and a few outliers to make differences between low and high polynomial degrees more evident.

In [None]:
# Reproducible data
rng = np.random.default_rng(42)

n = 120 # total samples
x = rng.uniform(-3.0, 3.0, size=n)
# True function: mostly cubic with additional structure
def true_fn(t):
    return 0.6*t**3 - 1.0*t**2 + 2.2*t + 1.0

noise = rng.normal(0, 2.5, size=n)  # base noise
y = true_fn(x) + noise

# Add a few outliers (to emphasize overfitting at higher degrees)
n_outliers = max(4, n // 25)  # ~4-6 outliers
idx_out = rng.choice(np.arange(n), size=n_outliers, replace=False)
y[idx_out] += rng.normal(0, 10.0, size=n_outliers)

# Train/test split
X = x.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

print(f"Samples: {n}   Train: {len(X_train)}   Test: {len(X_test)}   Outliers: {n_outliers}")

### Quick look at the training data

In [None]:
plt.figure()
plt.scatter(X_train[:, 0], y_train, s=16, alpha=0.8, label='train points')
plt.title('Training Data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.tight_layout()
plt.show()

## 2) Fit & Evaluate Polynomial Models (Degrees 1–4)
Fit four models using a scikit-learn pipeline: `PolynomialFeatures(degree=d)` + `LinearRegression()`.
Report **MSE** and **R²** on both train and test sets.


In [None]:
def fit_poly(degree, X_train, y_train, X_test, y_test):
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linreg', LinearRegression())
    ])
    model.fit(X_train, y_train)
    yhat_train = model.predict(X_train)
    yhat_test = model.predict(X_test)
    
    metrics = {
        'degree': degree,
        'train_MSE': mean_squared_error(y_train, yhat_train),
        'test_MSE': mean_squared_error(y_test, yhat_test),
        'train_R2': r2_score(y_train, yhat_train),
        'test_R2': r2_score(y_test, yhat_test),
        'model': model
    }
    return metrics

results = []
models = {}
for d in [1, 2, 3, 4]:
    m = fit_poly(d, X_train, y_train, X_test, y_test)
    results.append({k:v for k,v in m.items() if k != 'model'})
    models[d] = m['model']

df = pd.DataFrame(results).sort_values('degree').reset_index(drop=True)
df_display = df[['degree','train_MSE','test_MSE','train_R2','test_R2']]
df_display

## 3) Visual Comparison
Overlay the fitted curves for degrees 1–4 on a dense grid. Differences should be visible:
- **Degree 1–2** underfit the cubic trend.
- **Degree 3** matches the underlying pattern best.
- **Degree 4** may start to overfit, especially with outliers.


In [None]:
# Dense grid for smooth curves
x_grid = np.linspace(X.min(), X.max(), 400).reshape(-1, 1)

plt.figure()
plt.scatter(X_train[:,0], y_train, s=12, alpha=0.6, label='train points')
# Plot each degree fit
for d in [1,2,3,4]:
    y_grid = models[d].predict(x_grid)
    # Show test R2 in legend to emphasize generalization
    r2 = df_display.loc[df_display['degree']==d, 'test_R2'].iloc[0]
    plt.plot(x_grid[:,0], y_grid, linewidth=2, label=f'degree {d} (test R²={r2:.3f})')

plt.title('Polynomial Fits (Degrees 1–4)')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.tight_layout()
plt.show()

### Optional: Residual View per Degree
Inspect residuals (errors) for each degree on the **test set**. Lower spread around 0 with no clear pattern usually indicates a better fit.


In [None]:
for d in [1,2,3,4]:
    yhat_test = models[d].predict(X_test)
    residuals = y_test - yhat_test
    plt.figure()
    plt.scatter(X_test[:,0], residuals, s=16, alpha=0.8, label=f'residuals (deg {d})')
    plt.axhline(0, linestyle='--', linewidth=1)
    plt.title(f'Residuals vs x (Degree {d})')
    plt.xlabel('x')
    plt.ylabel('test residual')
    plt.legend()
    plt.tight_layout()
    plt.show()

## 4) What to Notice
- **Underfitting**: Degrees **1** and sometimes **2** cannot capture the curvature of the true signal → lower R² and higher error.
- **Right capacity**: Degree **3** matches the (approximate) data-generating process → best or near-best test performance.
- **Overfitting risk**: Degree **4** can **chase outliers/noise**, sometimes improving train fit but hurting test metrics.
- When you change the random seed or noise level, you may see the gap widen or narrow—**the pattern remains**: match model complexity to signal complexity.


### Try it: Change parameters
You can tweak `n`, noise levels, or the number of outliers and re-run the notebook to observe how the comparison changes.
