# Overfitting & Underfitting Experiment

## Research Question
How does model complexity affect generalization?

## Hypothesis
- Simple models underfit complex data
- Very complex models overfit training data
- There exists a balance that generalizes well

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [11]:
#Create non-linear data
rng = np.random.default_rng(seed=43)
X = rng.random((100, 1)) * 5
y = X.squeeze()**2 + rng.standard_normal(100) * 3

In [12]:
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=43
)

In [13]:
lin = LinearRegression()
lin.fit(X_train, y_train)

train_mse_lin = mean_squared_error(y_train, lin.predict(X_train))
test_mse_lin = mean_squared_error(y_test, lin.predict(X_test))

train_mse_lin, test_mse_lin

(11.075137425110585, 14.606739463222393)

### Linear Model (Underfitting)
- High training error
- High test error
- Model too simple for data

In [14]:
poly2 = PolynomialFeatures(degree=2)
X_train_2 = poly2.fit_transform(X_train)
X_test_2 = poly2.transform(X_test)

model2 = LinearRegression()
model2.fit(X_train_2, y_train)

train_mse_2 = mean_squared_error(y_train, model2.predict(X_train_2))
test_mse_2 = mean_squared_error(y_test, model2.predict(X_test_2))

train_mse_2, test_mse_2

(9.51454505570951, 10.791843273586142)

### Polynomial Degree 2 (Good Fit)
- Low training error
- Low test error
- Balanced complexity

In [15]:
poly10 = PolynomialFeatures(degree=10)
X_train_10 = poly10.fit_transform(X_train)
X_test_10 = poly10.transform(X_test)

model10 = LinearRegression()
model10.fit(X_train_10, y_train)

train_mse_10 = mean_squared_error(y_train, model10.predict(X_train_10))
test_mse_10 = mean_squared_error(y_test, model10.predict(X_test_10))

train_mse_10, test_mse_10

(8.180622888730372, 10.206721949505837)

### Polynomial Degree 10 (Overfitting)
- Very low training error
- Much higher test error
- Model memorizes noise
  
CRITICAL INSIGHT (Read Carefully)
Situation	Training Error	Test Error
Underfitting	High	High
Good fit	Low	Low
Overfitting	Very Low	High

## Reflection
- More complex models are not always better
- Generalization is the true goal
- Evaluating both train and test error is essential


Whenever a model behaves badly, ask:

Is it too simple?

Is it too complex?

Is my evaluation method flawed?