# [HW2] Problem : Outlier Removal via OMP (part 2)

This problem is the second part and a continuation from the OMP problem from HW1. We will look into why the concept of validation set is needed here as well as how to use a validation set that also contains outliers.

## (a)
First, we will run some demo from the first part of the problem that you have already seen.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from helpers import make_state_trace, identify_system, random_input, plot_eigenvalues
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from sklearn.linear_model import OrthogonalMatchingPursuit
import scipy.io

%matplotlib inline


In [None]:
def greedy_OMP(X_aug, y, k):
    """
    Performs orthogonal matching pursuit.

    args:
      X_aug: augmented data matrix, i.e., X_aug = [X I]
      y: noisy observations with outliers
      k: how many non-zero entries in solution of OMP; is equal to
        the number of iterations in the OMP algorithm

    returns:
      idx: the size k set of ignored data points
      plot_points: the prediction points for the two points x=0 and
        x = 10.0; for use of plotting
      remaining_residual: amount of residual remaining
    """

    reg = OrthogonalMatchingPursuit(
            n_nonzero_coefs=k, fit_intercept=False).fit(X_aug,y)
    idx = [j for (i, j) in zip(reg.coef_, range(1, len(reg.coef_) + 1))
           if abs(i) > 1e-7]

    test_mat = 0.0 * np.zeros((2, X_aug.shape[1]))
    test_mat[1, 0] = 10.0
    plot_points = reg.predict(test_mat)

    y = np.reshape(y, (y.shape[0], 1))
    err_vec = (y - np.reshape(reg.predict(X_aug), (y.shape[0], 1))) ** 2
    remaining_residual = np.mean(err_vec)
    return remaining_residual, reg.coef_


In [None]:
# Load data
mdict = scipy.io.loadmat("data.mat")

X, y = mdict["X_train"], mdict["y_train"].flatten()
X_test, y_test = mdict["X_test"], mdict["y_test"].flatten()
X_val, y_val = mdict["X_val"], mdict["y_val"].flatten()

n = 50
d = 1


In [None]:
errs = []
weights = []
X_aug = np.concatenate((X, np.eye(X.shape[0])), axis=1)
num = n
idxes_size = -1

while(num > 1):
    err, w = greedy_OMP(X_aug, y, n - num + 2)
    errs.append(err)
    weights.append(w)
    num -= 1

plt.figure(figsize=(10, 3))

plt.subplot(121)
plt.plot(np.arange(1, len(errs) + 1), errs)
plt.title("Error vs number of points removed")

plt.subplot(122)
plt.plot(np.arange(1, len(errs) + 1), errs)
plt.yscale('log')
plt.title("Error (log scale) vs number of points removed")

plt.tight_layout()
plt.show()


The left plot, which we have seen in the previous homework, is the error incurred by each OMP solution with different numbers of outliers removed. It is difficult to see in the normal scale, but the plot in the log scale (right) shows that the training error always keeps decreasing! Eventually, if we treat every data point as an outlier and remove all but one, we technically reach zero error. However, this is not what we want as we are estimating the true weights from a single data point instead of using all the uncorrupted points.

Next, we will show that this is indeed a bad thing to do when we evaluate the OMP solutions we obtained earlier on a clean test set. This is the data distribution that we want to perform well on, not the corrupted one we have for training. **Fill in the space below to compute the errors on this test set using all the weights we obtained from OMP with different numbers of outliers removed.** The vector test_errs should contain the mean squared error on the test set, indexed by how many purported outliers have been removed.


In [None]:
test_errs = []

### start 1 ###

### end 1 ###

plt.figure(figsize=(5, 3))
plt.plot(np.arange(1, len(test_errs) + 1), test_errs, color='orange')
plt.title("Test error vs number of outliers removed")
plt.yscale('log')
plt.tight_layout()
plt.show()


**Describe in words the trend of the test error you see, and based on the plot alone, what is a good number of outliers to be removed by OMP?**


## (b)

**Compute the validation errors based on some of weight vectors obtained from OMP (`weights` at indices 0, 10, 20, 30, and 40). For each weight, plot a histogram of the errors, and compute their mean and median.** See where the values of mean and median fall in each histogram.


In [None]:
for idx in [0, 10, 20, 30, 40]:
    ### start 2 ###

    ### end 2 ###
    mean = np.mean(val_err)
    median = np.median(val_err)
    plt.hist(val_err)
    plt.show()
    print('k=%d, mean: %.4f, median: %.4f.' % (idx + 2, mean, median))



## (c)

Does the idea of median being a robust estimator give you some idea for "robust" validation? **Plot mean and median of the validation errors with respect to the number of removed outliers. Now determine from the median plot where to stop. Why does median work better mean in this case?**


In [None]:
mean_val_errs = []
median_val_errs = []

### start 3 ###

### end 3 ###

plt.figure(figsize=(5, 3))
plt.plot(np.arange(1, len(mean_val_errs) + 1), mean_val_errs, label="mean")
plt.plot(np.arange(1, len(median_val_errs) + 1), median_val_errs, label="median")
plt.title("Validation error vs number of augmented bases")
plt.yscale('log')
plt.legend()
plt.tight_layout()
plt.show()
