# Training Models

This notebook contains sample code and the **solutions to Chapter 4 assignment.**

Credit: O'Reilly book Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow

## Set-up

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Assignment

## 1.

Which Linear Regression training algorithm can you use if you have a training
set with millions of features?

Answer:

If you have a training set with millions of features you can use Stochastic Gradi‐
ent Descent or Mini-batch Gradient Descent, and perhaps Batch Gradient
Descent if the training set fits in memory. But you cannot use the Normal Equa‐
tion or the SVD approach because the computational complexity grows quickly
(more than quadratically) with the number of features

## 2.

Suppose the features in your training set have very different scales. Which algorithms might suffer from this, and how? What can you do about it?

Answer:

If the features in your training set have very different scales, the cost function will
have the shape of an elongated bowl, so the Gradient Descent algorithms will take
a long time to converge. To solve this you should scale the data before training
the model. Note that the Normal Equation or SVD approach will work just fine
without scaling. Moreover, regularized models may converge to a suboptimal sol‐
ution if the features are not scaled: since regularization penalizes large weights,
features with smaller values will tend to be ignored compared to features with
larger values

3.

Can Gradient Descent get stuck in a local minimum when training a Logistic
Regression model?

Answer:

Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex.

4.

Do all Gradient Descent algorithms lead to the same model, provided you let
them run long enough?

Answer:

If the optimization problem is convex (such as Linear Regression or Logistic
Regression), and assuming the learning rate is not too high, then all Gradient
Descent algorithms will approach the global optimum and end up producing
fairly similar models. However, unless you gradually reduce the learning rate,
Stochastic GD and Mini-batch GD will never truly converge; instead, they will
keep jumping back and forth around the global optimum. This means that even
if you let them run for a very long time, these Gradient Descent algorithms will
produce slightly different models.


5.

Suppose you use Batch Gradient Descent and you plot the validation error at
every epoch. If you notice that the validation error consistently goes up, what is
likely going on? How can you fix this?

Answer:

If the validation error consistently goes up after every epoch, then one possibility
is that the learning rate is too high and the algorithm is diverging. If the training
error also goes up, then this is clearly the problem and you should reduce the
learning rate. However, if the training error is not going up, then your model is
overfitting the training set and you should stop training.

6.

Is it a good idea to stop Mini-batch Gradient Descent immediately when the vali‐
dation error goes up?

Answer:

Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch
Gradient Descent is guaranteed to make progress at every single training itera‐
tion. So if you immediately stop training when the validation error goes up, you
may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long
time (meaning it will probably never beat the record), you can revert to the best
saved model.

7.

Which Gradient Descent algorithm (among those we discussed) will reach the
vicinity of the optimal solution the fastest? Which will actually converge? How
can you make the others converge as well?

Answer:

Stochastic Gradient Descent has the fastest training iteration since it considers
only one training instance at a time, so it is generally the first to reach the vicinity
of the global optimum (or Mini-batch GD with a very small mini-batch size).
However, only Batch Gradient Descent will actually converge, given enough
training time. As mentioned, Stochastic GD and Mini-batch GD will bounce
around the optimum, unless you gradually reduce the learning rate

8.

Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?

Answer:

If the validation error is much higher than the training error, this is likely because
your model is overfitting the training set. One way to try to fix this is to reduce
the polynomial degree: a model with fewer degrees of freedom is less likely to
overfit. Another thing you can try is to regularize the model—for example, by
adding an ℓ2
 penalty (Ridge) or an ℓ1
 penalty (Lasso) to the cost function. This
will also reduce the degrees of freedom of the model. Lastly, you can try to
increase the size of the training set.

# 9.

Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?

Answer:

If both the training error and the validation error are almost equal and fairly
high, the model is likely underfitting the training set, which means it has a high
bias. You should try reducing the regularization hyperparameter α.

# 10. 

Why would you want to use:

a. Ridge Regression instead of plain Linear Regression (i.e., without any regula‐
rization)?

b. Lasso instead of Ridge Regression?

c. Elastic Net instead of Lasso?

Answer:

a) A model with some regularization typically performs better than a model
without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression.

b) Lasso Regression uses an ℓ1
 penalty, which tends to push the weights down to
exactly zero. This leads to sparse models, where all weights are zero except for
the most important weights. This is a way to perform feature selection auto‐
matically, which is good if you suspect that only a few features actually matter.
When you are not sure, you should prefer Ridge Regression.

c) Elastic Net is generally preferred over Lasso since Lasso may behave erratically
in some cases (when several features are strongly correlated or when there are
more features than training instances). However, it does add an extra hyper‐
parameter to tune. If you want Lasso without the erratic behavior, you can just
use Elastic Net with an l1_ratio close to 1.

# 11.

Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

Answer:

If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.

# 12.

Implement Batch Gradient Descent with early stopping for Softmax Regression
(without using Scikit-Learn).

## Reusing the Iris dataset loaded earlier

In [None]:
# Load X and Y data

X = iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]

Add the bias term for every instance ($x_0 = 1$):



In [None]:
# add bias term
X_with_bias = np.c_[np.ones([len(X), 1]), X]

# set the random seed
np.random.seed(42)

## Split Data Manually

In [None]:
# compute ratio
test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

# compute formula
test_size = int(total_size * test_ratio)
validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

# split X and Y data
X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

**The targets are currently class indices (0, 1 or 2)**

# Data Transformation

In [None]:
# vector of class indices
y_train[:10]

This function converts the **vector of class indices** into **a matrix containing a one-hot vector** for each instance

In [6]:
# one-hot vector converter
def to_one_hot(y):
    n_classes = y.max() + 1
    m = len(y)
    Y_one_hot = np.zeros((m, n_classes))
    Y_one_hot[np.arange(m), y] = 1
    return Y_one_hot

In [None]:
# matrix containing a one-hot vector for each instance
to_one_hot(y_train[:10])

**Create the target class probabilities matrix for the training set and the test set**


In [None]:
# create target class probabilities matrix
Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

# Select Model: Softmax Regression

In [None]:
# define softmax function
def softmax(logits):
    exps = np.exp(logits)
    exp_sums = np.sum(exps, axis=1, keepdims=True)
    return exps / exp_sums

**Define number of inputs and outputs**

In [None]:
# 2 features plus the bias term
n_inputs = X_train.shape[1]

# 3 iris classes
n_outputs = len(np.unique(y_train))

# Train Model

In [None]:
# translating the Softmax Regression math equations into Python code
eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    if iteration % 500 == 0:
        loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
        print(iteration, loss)
    error = Y_proba - Y_train_one_hot
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

In [None]:
# observe model parameter
Theta

In [None]:
# Make Predictions

In [None]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

# Model Evaluation: Accuracy Score

In [None]:
accuracy_score = np.mean(y_predict == y_valid)
accuracy_score

Now let's add early stopping. For this we just need to measure the loss on the validation set at every iteration and stop when the error starts growing.

In [None]:
eta = 0.1 
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1  # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    error = Y_proba - Y_train_one_hot
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients

    logits = X_valid.dot(Theta)
    Y_proba = softmax(logits)
    xentropy_loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(Theta[1:]))
    loss = xentropy_loss + alpha * l2_loss
    if iteration % 500 == 0:
        print(iteration, loss)
    if loss < best_loss:
        best_loss = loss
    else:
        print(iteration - 1, best_loss)
        print(iteration, loss, "early stopping!")
        break

**Make prediction**

In [None]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

**Model Evaluate**

In [None]:
# accuracy score
accuracy_score = np.mean(y_predict == y_valid)
accuracy_score

# Visualization

Plot the model's predictions on the whole dataset

In [None]:
x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]
X_new_with_bias = np.c_[np.ones([len(X_new), 1]), X_new]

logits = X_new_with_bias.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

zz1 = Y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

In [None]:
**Make prediction**

In [None]:
logits = X_test.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

In [None]:
**Model Evaluation**

In [None]:
accuracy_score = np.mean(y_predict == y_test)
accuracy_score