# Chapter 4: Training Models

## 1. Chapter Overview
**Goal:** Up to this point, we treated Machine Learning models as "black boxes". In this chapter, we will open the box to understand how they work under the hood. Understanding these details helps in quickly narrowing down the right model, the right training algorithm, and the right hyperparameters.

**Key Concepts:**
* **Linear Regression:** The Normal Equation and Computational Complexity.
* **Gradient Descent:** Batch, Stochastic, and Mini-batch GD.
* **Polynomial Regression:** Fitting complex data with linear models.
* **Learning Curves:** analyzing overfitting and underfitting.
* **Regularized Linear Models:** Ridge, Lasso, and Elastic Net.
* **Logistic Regression:** Using regression for classification (estimating probabilities).

**Practical Skills:**
* Training models using the Normal Equation (`numpy.linalg`).
* Using `PolynomialFeatures` to transform data.
* Plotting Learning Curves to diagnose model performance.
* Implementing Logistic and Softmax Regression on the Iris dataset.

In [None]:
# Setup
import sys
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation

### 1. Linear Regression
The simplest model. It makes a prediction by computing a weighted sum of the input features, plus a constant bias term (intercept).
$$ \hat{y} = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n $$

**How to train it?**
We need to find the value of $\theta$ that minimizes the Mean Squared Error (MSE). There is a closed-form mathematical solution called the **Normal Equation**:
$$ \hat{\theta} = (X^T X)^{-1} X^T y $$

### 2. Gradient Descent
The Normal Equation gets very slow when the number of features is large. Gradient Descent is an optimization algorithm capable of finding optimal solutions for a wide range of problems. The general idea is to tweak parameters iteratively in order to minimize a cost function.

* **Batch GD:** Uses the whole training set at every step (slow on large data).
* **Stochastic GD:** Picks a random instance at every step (fast but irregular).
* **Mini-batch GD:** Uses small random sets of instances (compromise).

### 3. Bias/Variance Tradeoff
* **Bias:** Error due to wrong assumptions (e.g., assuming data is linear when it is quadratic). Leads to underfitting.
* **Variance:** Error due to model sensitivity to small variations in training data. Leads to overfitting.

### 4. Regularization
Constraining a model to reduce overfitting.
* **Ridge (L2):** Adds "squared magnitude" of coefficients as penalty term to the loss function.
* **Lasso (L1):** Adds "absolute value" of coefficients. Can completely eliminate least important features (feature selection).
* **Elastic Net:** A mix of both.

## 3. Code Reproduction

### 3.1 Linear Regression using the Normal Equation
Let's generate some linear-looking data to test this.

In [None]:
import numpy as np

# Generate synthetic data
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

In [None]:
# Compute the Normal Equation manually using NumPy
X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print("Best Theta calculated:\n", theta_best)

# Make predictions using the calculated theta
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)

# Plot the regression line
plt.plot(X_new, y_predict, "r-", label="Predictions")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.legend(loc="upper left")
plt.show()

### 3.2 Polynomial Regression
What if the data is not linear? We can use a linear model to fit nonlinear data by adding powers of each feature as new features.

In [None]:
# Generate quadratic data
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Transform training data: adds the square of each feature
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Fit Linear Regression on transformed data
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

# Plotting
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)

plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.legend()
plt.show()

### 3.3 Learning Curves
Learning curves are plots of the model's performance on the training set and the validation set as a function of the training set size. They help detect underfitting and overfitting.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    plt.ylabel("RMSE")
    plt.xlabel("Training set size")
    plt.legend()

# Plot for a simple Linear Regression (Underfitting)
plot_learning_curves(LinearRegression(), X, y)
plt.title("Underfitting Example (Linear Model on Quadratic Data)")
plt.show()

### 3.4 Logistic Regression (Classification)
Logistic regression estimates the probability that an instance belongs to a particular class. If the estimated probability is > 50%, the model predicts that the instance belongs to that class.

We use the **Iris dataset** to detect the "Iris-Virginica" type.

In [None]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

iris = datasets.load_iris()
X = iris["data"][:, 3:]  # petal width
y = (iris["target"] == 2).astype(np.int)  # 1 if Iris-Virginica, else 0

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X, y)

# Visualize the probabilities
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(8, 3))
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginica")
plt.xlabel("Petal width (cm)")
plt.ylabel("Probability")
plt.legend(loc="center left")
plt.show()

## 4. Step-by-Step Explanation

### 1. Normal Equation
**Input:** A matrix `X_b` containing our features (and the bias term of 1s).
**Process:** We perform matrix multiplication and inversion: $(X^T X)^{-1} X^T y$. This is an exact mathematical solution.
**Output:** The optimal $\theta$ values that minimize the error. For our synthetic data $y = 4 + 3x$, the result should be close to 4 (intercept) and 3 (slope).

### 2. Polynomial Features
**Problem:** A straight line cannot fit a curve (parabola).
**Solution:** We square the existing feature $x$ to create $x^2$. Now the linear regression model sees two features ($x$ and $x^2$) and learns a formula like $y = \theta_0 + \theta_1 x + \theta_2 x^2$. This allows it to fit quadratic data while still being a "Linear Regression" algorithm mathematically.

### 3. Learning Curves Interpretation
* **Underfitting:** Both training and validation errors are high and reach a plateau. Adding more data won't help.
* **Overfitting:** Training error is low, but validation error is high (gap between curves). The model is too complex.

### 4. Logistic Regression Decision Boundary
In the plot, the decision boundary is where the probability lines cross (at 50%). If the petal width is greater than approx 1.6 cm, the classifier predicts "Iris-Virginica".

## 5. Chapter Summary

* **Iterative Optimization:** Gradient Descent (and its variants) is the engine behind most ML models (especially Neural Networks). It tweaks parameters step-by-step to minimize error.
* **Polynomial Regression:** A powerful trick to add complexity to simple linear models.
* **Diagnostics:** Learning curves are essential tools to diagnose if your model needs more data, more features, or regularization.
* **Regularization:** Techniques like Ridge and Lasso are critical to prevent complex models from memorizing noise (overfitting).
* **Logistic Regression:** Despite the name "Regression", it is a classification algorithm used to estimate the probability of binary outcomes.