# Overfitting 

## Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## 1. Overfitting Explanation in Markdown

## Overfitting

Overfitting is a phenomenon in machine learning and statistics where a model learns the training data too closely, including its noise and outliers, rather than generalizing from the underlying distribution. This often results in a model that performs exceptionally well on the training data but poorly on unseen data or test data.

### Characteristics of Overfitting:

1. **High Variance**: The model becomes overly complex with too many parameters. This often leads to a wavy or erratic curve in regression problems.
2. **Low Bias**: The model captures the training data very accurately, often with almost no error.
3. **Poor Generalization**: Despite having a low training error, the model will have a much higher test error.

### Causes of Overfitting:

- **Too many features or parameters** relative to the number of observations.
- **Lack of regularization** which can prevent complexity by penalizing certain model parameters if they're likely to cause overfitting.
- **Noisy data**: If the noise in the training data is learned as concepts by the model.
- **Too few data points** to train on.

### Prevention:

1. **Regularization**: Techniques like L1 (Lasso) and L2 (Ridge) regularization can add penalty terms for complexity, reducing the risk of overfitting.
2. **Cross-validation**: Helps in understanding how the model performs on unseen data.
3. **Pruning**: For decision trees, pruning can remove branches that have little power in predicting target values.
4. **Adding more data**: Sometimes, increasing the volume of data can help in reducing overfitting.

In essence, the main goal is to strike a balance between bias (underfitting) and variance (overfitting) to create a model that generalizes well to new data.


In [None]:
# Generate some data
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]  # Linear, some polynomial, and high degree polynomial

X = np.sort(np.random.rand(n_samples))
y = np.sin(2 * np.pi * X) + np.random.randn(n_samples) * 0.1

# Notice that I removed the np.newaxis here
X_train, X_test, y_train, y_test = train_test_split(X[:, np.newaxis], y, test_size=0.2)

plt.figure(figsize=(14, 5))
for i, degree in enumerate(degrees):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X_train, y_train)  # And also removed np.newaxis here

    # Evaluate the models using cross-validation
    scores = pipeline.score(X_test, y_test)
    
    X_test_fit = np.linspace(0, 1, 100)[:, np.newaxis]
    plt.plot(X_test_fit, pipeline.predict(X_test_fit), label="Model")
    plt.scatter(X_train, y_train, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(f"Degree {degree}\nTest Score = {scores:.2f}")

plt.show()


In this example, we fit polynomial regressions of degree 1 (linear), 4, and 15 to a dataset generated from a sine curve. The model with degree 15 clearly overfits the training data, capturing noise and showing a very wavy curve, while the model with degree 1 underfits. The model with degree 4 offers a balanced fit.