In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10, 6
np.random.seed(32371)


# Validation for hyperparameter selection for ridge regression
In this notebook, we will see how to use validation to select the optimal regularization coefficient for ridge regression, working on a synthetic dataset so as to compare our observations to the theoretical optimum.


In [None]:
def featurize_fourier(x, d, normalize = False):
    assert (d-1) % 2 == 0, "d must be odd"
    max_r = int((d-1)/2)
    n = len(x)
    A = np.zeros((n, d))
    A[:,0] = 1
    for d_ in range(1,max_r+1):
        A[:,2*(d_-1)+1] =  np.sin(d_*x*np.pi)
        A[:,2*(d_-1)+2] =  np.cos(d_*x*np.pi)

    if normalize:
        A[:,0] *= (1/np.sqrt(2))
        A *= np.sqrt(2)
    return A

def generate_X(n, d):
    """
    This function generates a random n x d synthetic data matrix X such that X^T X = I
    """
    x = (np.random.uniform(-1, 1, n))
    return featurize_fourier(x, d, normalize=True)* 1./np.sqrt(n)


Before we begin, we should visualize our data matrix and verify that it satisfies the properties we want.


In [None]:
n = 1000
d = 63
X = generate_X(n, d)

# Show X and X^T X
plt.imshow(X)
plt.show()
plt.imshow((X.T @ X))

plt.show()
print("diag(X^T X):", np.diag(X.T @ X));


You should see that $X^T X$ is _approximately_ orthogonal. Think about why is it not exactly orthogonal. How would we change `generate_X` to ensure orthogonality?

Now, we will write a function to generate the true weight vector $\vec{w}$ that we are trying to learn, as well as the perturbed observation $\vec{y} = X \vec{w} + \vec{\varepsilon}$ that we can actually see.


In [None]:
def generate_data(n, d, sigma=0.4, w=None, suppress_output=False):
    X = generate_X(n, d)

    # if w is not specified in the input, pick a
    # random weight for each of the `d` dimensions
    if w is None:
        w = np.random.randn(d)
        if not suppress_output:
            print("true w:", w)

    # the standard deviation of the measurement error
    eps = np.random.randn(n) * sigma
    y = X @ w + eps

    return X, w, y


Now we've got some synthetic data. Let's look at the basic steps we need to take to optimize our hyperparameter:
 - Pick different values of the hyperparameter $\lambda$
 - Train our model on some data using each particular value of $\lambda$
 - Evaluate the performance of our model on *different* data
 - Pick the $\lambda$ that yielded a model with the best performance.

Notice one thing here - when evaluating the performance of our model, we evaluate it on *different* data that was not used in training. Why? Fundamentally, this is to prevent overfitting. When our model is trained on some input data, we can imagine that it tries to "memorize" the training data as best it can. If we then evaluated our model on the same training data, all we'd see is how well it memorized that data. In contrast, by evaluating it on *different* data, we are able to see how well it can generalize to new data it has not previously seen.

Therefore, we will need to **split our dataset up into two parts: data used for training, and data used for validation.** You will do so here.


In [None]:
# Generate a validation data set
# Does it matter how we generate the split in this case?
# How about in general?
# Hint: what if the data source is sorted in some way?

def generate_training_validation_split(X, y, validation_fraction = 0.2):
    # Returns a tuple of 4 components:
    #   X_train: (1-validation_fraction) rows of X
    #   y_train: (1-validation_fraction) rows of y
    #   X_val: the complement of X_train
    #   y_val: the complement of y_train
    n = X.shape[0]
    n_validation = int(n * validation_fraction)

    ### start Generate_Validation_Split ###

    ### end Generate_Validation_Split ###

    return X_train, y_train, X_val, y_val

sigma = 0.6
X, w, y = generate_data(1250, 63, sigma)
X_train, y_train, X_val, y_val = generate_training_validation_split(X, y)


In [None]:
X_train.shape


In [None]:
def ridge_regress(X, y, lambd):
    d = X.shape[1]
    return np.linalg.solve(X.T @ X + lambd * np.eye(d), X.T @ y)


Let's first verify that our implementation of ridge regression is working. First, let's choose $\lambda = 0$, so it reduces to basic least squares. **Use the training data to compute an estimate $\hat{w}$ and plot it on the same axes as the true $\vec{w}$**, to see if they are close.


In [None]:
# Show w vs w_hat
### start Run_Least_Squares ###

### end Run_Least_Squares ###
plt.plot(w, label='w')
plt.plot(w_hat, label='w_hat')
plt.legend()
plt.show();


Now we can move on to hyperparameter tuning. One thing to notice is that we cannot actually choose $\lambda$ to minimize $\|\vec{w} - \hat{w}\|^2$ directly, since $\vec{w}$ is not known when working with real-world data. Instead, we should try to minimize $\|y_{val} - X_{val} \hat{w}\|^2$ i.e. the error our model makes on the validation data. Intuitively, it is clear that these two quantities should be roughly proportional, and in our case when $X^T X = I$, it is easy to show that they are _exactly_ proportional.

**Plot the validation error made after training the model using a range of lambdas. On the same graph, also plot the *true* estimation error $\|\vec{w} - \hat{w}\|^2$, on a separate vertical axis so the two plots are of similar scale.**


In [None]:
# Show estimation error of w for a range of lambdas
def plot_validation_errors(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-3, 0, 50), newplot=True):
    w_mses = []
    val_errors = []
    true_errors = []
    for l in candidate_lambdas:
        ### start Compute_Errors ###

        ### end Compute_Errors ###

    if newplot:
        plt.figure()
        ax1 = plt.subplot(2,1,1)
        plt.title("Error vs $\lambda$")
        plt.yscale('log')
        plt.xlabel("$\lambda$")

        color = 'tab:red'
        ax1.set_ylabel('Validation Error', color=color)  # we already handled the x-label with ax1
        ax1.tick_params(axis='y', labelcolor=color)
        ax1.plot(candidate_lambdas, val_errors, color=color)

        ax2 = plt.twinx()
        color = 'tab:blue'
        ax2.set_ylabel('Estimation Error', color=color)  # we already handled the x-label with ax1
        ax2.tick_params(axis='y', labelcolor=color)
        plt.plot(candidate_lambdas, true_errors)

        plt.xscale('log')  # Need to set scale after axes are created
    else:
        plt.plot(candidate_lambdas, val_errors)

plot_validation_errors(X_train, y_train, X_val, y_val)


From the plot, you should see an obvious optimal value of $\lambda$ that minimizes the validation error (if you don't, try regenerating the data matrix, you might have just been unlucky), and another value of $\lambda$ that minimizes the true error. **Explain why these values are not the same**.

Then, **write a function that numerically computes the $\lambda$ that, when used to train a model on the training set, minimizes the validation error.***


_Your explanation here_:



In [None]:
def optimize_lambda(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-3, 0, 50)):
    ### start Optimize_Lambda ###

    ### end Optimize_Lambda ###


In [None]:
print("Best lambda = {}".format(optimize_lambda(X_train, y_train, X_val, y_val)))


Is this value the *exact optimal* choice of hyperparameter? No! Since we calculated it empirically from noisy data, it will itself be a noisy estimate of the optimal hyperparameter. It is important to note that *both* noise in the training data and, crucially, noise in the *validation data* will affect the "optimal" lambda that we obtain from validation.

Let's see an example of this. In this next part, we will hold the training data fixed, and keep resampling our validation data from the same distribution. For each set of validation data, we will perform this hyperparameter optimization process again, and obtain a new "optimal" value of lambda.

As a point of comparison, first **compute the optimal value of lambda.**


In [None]:
def compute_optimal_lambda(d, sigma, w):
    ### start Optimal_Lambda ###

    ### end Optimal_Lambda ###

optimal_lambda = compute_optimal_lambda(d, sigma, w)


In [None]:
def repeatedly_optimize_lambda(n, d, validation_fraction = 0.2, num_samples = 50, sigma=0.4):
    # first, we will compute a *fixed* set of training data, so the
    # only source of randomness is from the validation set
    X_train, w, y_train = generate_data(int(n * (1 - validation_fraction)), d, sigma, suppress_output=True)

    plt.title("Error vs $\lambda$")
    plt.yscale('log')
    plt.xlabel("$\lambda$")
    ax0 = plt.gca()
    plt.twinx()
    plt.yticks([], "")

    optimized_lambdas = []

    for i in range(num_samples):
        X_val, _, y_val = generate_data(int(n * validation_fraction), d, sigma, w) # pass w in to keep it fixed
        optimized_lambdas.append(optimize_lambda(X_train, y_train, X_val, y_val))
        plot_validation_errors(X_train, y_train, X_val, y_val, newplot=False)
        plt.twinx()
        plt.yticks([], "")

    # Have to set the scale last or the plots won't line up correctly
    plt.xscale('log')
    ax0.set_ylabel('Normalized Error')
    return optimized_lambdas, w

sigma=0.6
optimized_lambdas, w = repeatedly_optimize_lambda(n=10000, d=d, validation_fraction=0.5, sigma=sigma)
print("Mean optimized lambda: {}".format(np.mean(optimized_lambdas)))
print("Stdev of optimized lambdas: {}".format(np.std(optimized_lambdas)))
print("Optimal lambda: {}".format(optimal_lambda))
plt.show()

plt.xlabel("$\lambda$")
plt.ylabel("Frequency")
plt.hist(optimized_lambdas, bins=25)
plt.show();


Adjust `n` and `validation_fraction` to keep the size of the training set constant while varying the size of the validation set. The below code block might help with that.

How do the accuracy and variance of the tuned lambdas vary as the size of the validation set increases? Why might the variance be large even with a large validation set?

How do they change with the size of the observation noise, `sigma`?


In [None]:
n_train = 5000
sigma=0.3
for vf in [.1, .2, .4, .6, .8, .9]:
    n_total = int(n_train / (1 - vf))
    print('-' * 40)
    print("Validation Fraction:", vf)
    optimized_lambdas, w = repeatedly_optimize_lambda(n_total, d, vf, sigma=sigma)
    plt.show()
    print("Mean optimized lambda: {}".format(np.mean(optimized_lambdas)))
    print("Stdev of optimized lambdas: {}".format(np.std(optimized_lambdas)))
    print("Optimal lambda: {}".format(compute_optimal_lambda(d=d, sigma=sigma, w=w)))


_Your observations here:_

