# [HW2] Practice With Augmented Regression

Import necessary Python packages and seed the random number generator.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10, 6
np.random.seed(174515)


## Helper Functions

**Fill in these functions using your code from the hyperparameter tuning problem on this homework.**


In [None]:
def ridge_regress(X, y, lambd):
    ### start ridge ###

    ### end ridge ###


In [None]:
def optimize_lambda(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-3, 0, 50)):
    ### start Optimize_Lambda ###

    ### end Optimize_Lambda ###


In [None]:
def plot_validation_errors(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-3, 0, 50)):
    # This version of the function only plots the validation error, since we don't know the true
    # vector w.
    w_mses = []
    val_errors = []
    for l in candidate_lambdas:
        ### start Compute_Errors ###

        ### end Compute_Errors ###

    plt.figure()
    plt.title("Error vs $\lambda$")
    plt.yscale('log')
    plt.xscale('log')
    plt.xlabel("$\lambda$")
    plt.ylabel('Estimation Error')  # we already handled the x-label with ax1
    plt.plot(candidate_lambdas, val_errors)


## Load the Training Data

This dataset uses physiochemical features of white wines to predict an expert's quality score. It has been lightly preprocessed to normalize the features and labels. As you will see later, preprocessing can be extrememly important for model performance. You can access the original data in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).


In [None]:
X = np.load("wine_train_features.npy")
y = np.load("wine_train_labels.npy")


## Part (a): Explicit Ridge Regression

First, we will perform explicit ridge regression and tune $\lambda$ to get a baseline for comparison.

**Generate an 80-20 training/validation split using the loaded data.**


In [None]:
print("All data:", X.shape)
n, d = X.shape
perm = np.random.permutation(n)
ntrain = int(n * 0.5)
idx_train = perm[:ntrain]
idx_val = perm[ntrain:]
X_train = X[idx_train, :]
y_train = y[idx_train]
X_val = X[idx_val, :]
y_val = y[idx_val]
print("Training data:", X_train.shape)
print("Validation data:", X_val.shape)


**Use `plot_validation_errors` and `optimize_lambda` to tune $\lambda$ for use in the rest of this problem.** Try running for several train/validation splits to see how $\lambda$ varies. Pick a reasonable value.


In [None]:
# TODO: Find a reasonable lambda
plot_validation_errors(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-2, 3, 100))
lambd = optimize_lambda(X_train, y_train, X_val, y_val, candidate_lambdas=np.logspace(-2, 3, 100))
print("Tuned lambda: %f" % lambd)


**Use your tuned $\lambda$ to find $\hat{w}$ by regressing on all the data. Print the MSE.**


In [None]:
### start w_hat_rr ###

### end w_hat_rr ###


## Part (b): Regularization with Augmented Features

Here you will train a regressor with augmented features. Refer to 6(c) from HW1 for context.

First **augment the training data.**


In [None]:
# TODO: Augment the training data in feature space.
# Use the appropriate weight for the augmentation identity.
d_raw = X.shape[1]
print("X shape", X.shape)
X_aug = np.zeros((X.shape[0], X.shape[1] + n))
print("X augmented shape", X_aug.shape)
### start b1 ###

### end b1 ###


Next, **perform the minimum-norm least-squares optimization and report the MSE.** The plot of $\hat{w}$ calculated in both ways should be one overlapping line (identical results).


In [None]:
# TODO: Solve the least-norm linear regression problem. Print the
# residual squared error.

# Be careful! You are working with an underdetermined system so you
# will need to modify the problem you solve from the overdetermined
# least-squares case.
# This can be solved using SVD, however explicitly calculating the SVD
# for large matrices takes a LONG time. You should think about
# how to do this regression in a more efficient way.

# Hint: Read and understand https://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf

### start b2 ###

### end b2 ###

# Plot w_hat calculated both ways. They should be identical.
plt.plot(w_rr, label='$\hat{w}$ ridge regression')
plt.plot(eta[:d_raw], label='$\hat{w}$ feature augmentation')
plt.legend();


**Plot $\eta$ and the subset of $\eta$ corresponding to the original features.
Interpret the residual squared error in the context of the augmented dataset.**


In [None]:
# TODO: Plot eta and the subset of eta corresponding to the original features.
#       Interpret the residual squared error in the context of the augmented dataset.
### start b3 ###

### end b3 ###


_Your interpretation of the squared error here_


## Part (c): Predicting with Augmented Features
We have two methods available for prediction. We can extract only the parts of $\eta$ corresponding to $\hat{w}$, or we can appropriately augment the test set and use some other subset of $\eta$ for prediction. Here you will use both methods and see that they give the same results for test prediction.

First, **predict using only the part of $\eta$ corresponding to $w$. Report the train and test MSEs.**


In [None]:
# Load the test dataset
# It is good practice to do this after the training has been
# completed to make sure that no training happens on the test
# set!
X_test = np.load("wine_test_features.npy")
y_test = np.load("wine_test_labels.npy")

n_test = y_test.shape[0]

# TODO: Use the w_hat subset of eta and calculate the train and test MSE

# First method: extract only w from eta and use it
### start c1 ###

### end c1 ###


Next, **appropriately augment the test set, and predict using the appropriate subset of $\eta$. Report train and test MSEs.**


In [None]:
# Second method: augment test set features
# -> What should we use for the augmented features?
### start c2 ###

### end c2 ###


## Part (d):
What if we add small random weights on the diagonal instead of 0s to the test set?

**Choose a few distributions and plot the test MSE for augmentation with a range of scalings of the distribution.**
One example might be
```
for u in np.linspace(0, 1, 50):
    test_augmented[:, d_raw:] = np.diag(np.random.uniform(0, u, n))
    ...
```


In [None]:
# TODO: instead of 0s for test features, use something small and random on the diagonal.
test_mses = []
scalings = []
### start d ###

### end d ###
plt.plot(scalings, test_mses)
plt.title("MSE with random augmentations")
plt.xscale('log')
plt.xlabel('Augmentation Scaling$')
plt.ylabel('MSE');


**Comment on what you observe about the effect of distribution and scaling on the test performance.**


_Your comments here_
